Interesting article raising questions about how designers approach SoC architecture to achieve power and performance targets. Semiconductor Engineering .:. Do Circuits Whisper Or Shout?
If performance is your thing see my earlier blogs
Interesting article raising questions about how designers approach SoC architecture to achieve power and performance targets. Semiconductor Engineering .:. Do Circuits Whisper Or Shout?
If performance is your thing see my earlier blogs
Following on from Tuesday's announcement from ARM of the CoreLink CCI-550 interconnect and CoreLink DMC-500 memory controller products it is clear there are a lot of performance advantages to be had from a well matched interconnect and memory system. You can discover some of those benefits in eoin_mccann's blog The Foundation for Next Generation Heterogeneous Devices. For those in Silicon valley next Thursday, you can find out all about Exploring System Coherency from jdefilippi's talk on the new products at ARM TechCon.
Configuring your SoC infrastructure to match the number and type of processors and DDR channels present in your design is a vital step to ensuring your product is competitive. Measuring the resultant performance is the proof you need to know you have met your design objectives.
At ARM TechCon on Nov 12th I'll be teaming up with nickheaton from Cadence to discuss some of the key interconnect configuration options, demonstrating tools for automating the configuration process and verify erformance of the interconnect and memory system.
In our presentation on Architecting and Optimizing SoC Infrastructure I'll be discussing ways to optimise cache coherency, avoiding stalls while waiting for transactions to complete, setting up QoS contracts and keeping the memory controller fully utilized and Nick will be demonstrating with performance verification tools the importance of correct buffer sizing and having multiple outstanding transactions in flight.
System optimization involves running Linux applications and understanding the impact on the hardware and other software in a system. It would be great if system optimization could be done by running benchmarks one time and gathering all needed information to fully understand the system, but anybody who has ever done it knows it takes a numerous runs to understand system behavior and a fair amount of experimentation to identify corner cases and stress points.
Running any benchmark should be:
Determinism one of the challenges that must be understood in order to make reliable observations and guarantee system improvements have the intended impact. For this reason, it’s important to create an environment which is as repeatable as possible. Sometimes this is easy and sometimes it’s more difficult.
Traditionally, some areas of system behavior are not deterministic. For example, networking traffic of a system connected to a network is hard to predict and control if there are uncontroled machines connected to the network. Furthermore, even in a very controlled environment the detailed timing of the individual networking packets will always have some timing variance related to when they arrive at the system under analysis.
Another source of nondeterministic behavior could be something as simple as entering Linux commands at the command prompt. The timing of how fast a user is typing will vary from person to person and from run to run when multiple test runs are required to compare performance. A solution for this could be an automated script which automatically launches a benchmark upon Linux boot so there is no human input needed.
Understanding the variables which can be controlled and countering any variables which cannot be controlled is required to obtain consistent results. Sometimes new things occur which were not expected. Recently, I was made aware of a new source of non-determinism, ASLR.
Address Space Layout Randomization (ASLR) has nothing to do with system I/O, but the internals of the Linux kernel itself. ASLR is a security feature which randomizes where various parts of a Linux application are loaded into memory. One of the things it can do is to change the load address of the C library. When ASLR is enabled the C library will be loaded into a different address of memory each time the program is run. This is great for security, but is a hinderance for somebody trying to perform system analysis by keeping track of the executed instructions for the purpose of making performance improvements.
The good news is ASLR can be disabled in Linux during benchmarking activities so that programs will generate the same address traces.
A simple command can be used to disable ASLR.
$ echo 0 > /proc/sys/kernel/randomize_va_space
The default value is 2. The Linux documentation on sysctl is a good place to find information is the randomize_va_space:
This option can be used to select the type of process address space randomization that is used in the system, for architectures that support this feature.
0 - Turn the process address space randomization off. This is the default for architectures that do not support this feature anyways, and kernels that are booted with the "norandmaps" parameter.
1 - Make the addresses of mmap base, stack and VDSO page randomized. This, among other things, implies that shared libraries will be loaded to random addresses. Also for PIE-linked binaries, the location of code start is randomized. This is the default if the CONFIG_COMPAT_BRK option is enabled.
2 - Additionally enable heap randomization. This is the default if CONFIG_COMPAT_BRK is disabled.
There is a file /proc/[pid]/maps for each process which has the address ranges where the .so files are loaded.
Launching a program and printing the maps file shows the addresses where the libraries are loaded.
For example, if the benchmark being run is sysbench run it like this:
$ sysbench & cat /proc/`echo $!`/maps
Without setting randomize_va_space to 0 different addresses will be printed each time the benchmark is run, but after setting randomize_va_space to 0 the same addresses is used from run to run.
Below is example output from the maps file.
If you ever find that your benchmarking activity starts driving you crazy because the programs you are tracing keep moving around in memory it might be worth looking into ASLR and turning it off!
We are so pleased to have Arenta people and technology join Synopsys!
Verification requirements have exploded as designs have become increasingly complex. Atrenta's early design analysis tools enable efficient, early verification and optimization of SoC designs at the RTL level. Combined with Synopsys' industry-leading verification technologies, Atrenta's leading static and formal technology further strengthens Synopsys' Verification Continuum™ platform and enables customers with this unique verification environment to meet the demands of today's complex electronic designs. Atrenta's SoC design analysis technology also fortifies the Synopsys Galaxy™ platform with additional power, test and timing-related analysis technologies. By integrating Atrenta's complementary technology into Synopsys' platforms, Synopsys can offer designers a more comprehensive, robust portfolio of silicon to software solutions for complex electronic systems.
The Atrenta products include:
More information on the former Atrenta products is available at synopsys.com/Tools/Verification/Atrenta
SAN FRANCISCO—Design complexity is soaring. Node-to-node transitions now take a year to a year and a half, not several years. Market pressures mount.
This means third-party IP integration is crucial not only to managing system-on-chip (SoC) design complexity but getting to market in a reasonable amount of time. But IP integration has often been easier said than done. If done improperly, design teams can experience schedule slips, which means added cost and lost market opportunity. So is it worth the risk? Are there any real alternatives? These were fundamental questions a panel of experts addressed here at the 52nd Design Automation Conference in June.
Albert Li, director with Global Unichip Corp, said his company is “hugely dependent” on IP but “there are a lot of problems with HW IP” in terms of implementation and verification.
“Things are getting complicated,” said Navraj Nanda, senior director of marketing for the DesignWare Analog and mixed-signal IP products at Synopsys. “In terms of technology nodes on one end they’re getting smaller. On the other end, (more technologically mature) nodes are getting refreshed.”
He said a key challenge is how does the industry serve those markets “with the same types of IP?”
Thomas Wong, director within strategic business operations at Cadence's IP Group, said with the node-to-node transition shrinking from two to three years to sometimes 18 months, that pace is “outstripping the capacity of smart engineers” to keep up and exploit the node benefits.
While it’s always cathartic to talk about the shared challenges when it comes to the evolution of electronics design, the panelists quickly coalesced around the notion that IP—for all its challenges—is here to stay but that optimization and efficiencies must be found.
“I don't think there's any other way of designing chips with a very small number of exceptions,” said Leah Schuth, director of technical marketing with ARM.
Schuth suggested that the industry address IP and tools the same way it looks at consumer devices. “We need to make the complexity almost invisible to the user” though increased standardization or some kind of certification, she said.
File sizes are part of the integration problem, and here experts on the panel—which was moderated by Semiconductor Engineering Executive Editor Ann Steffora Mutschler—offered some jarring challenges as well as potential solutions.
Cadence’s Wong said that a customer recently told him that downloading libraries for a project was going to take seven days. And even delivering a hard drive with the terabytes of information on the drive took several days to download.
Schuth wondered how much data across IP is duplicated, bloating file sizes. Is there a way to not transmit “non-data” like header fields or duplicative data to cap file size, Schuth asked.
Nanda said he believes the file-size problem is actually worsening, even with EDA solutions to manage database sizes like OASIS (Open Artwork System Interchange Standard).
“You can be idealistic and say ‘hey let’s try to limit the data size because we understand the applications in the customer’s market,’” Nanda said, “but in reality our customers are in brainstorming mode so they want the whole enchilada.”
Wong noted that Cadence’s multi protocol IP offering can be one way of getting around the file size problem because you load a database once and get to use various protocols with various different designs.
“It was invented for that, but it’s a bonus,” he said.
Schuth said another way to improve IP integration challenges is to work hard to ensure the IP works “right out of the box” for customers, along the lines of ARM’s Socrates Design Environment or IP-XACT.
Wong suggested thinking about an integrated approach, and he summoned the ghosts of PC design past as an example. Chips & Technologies soared to prominence in the 1990s as a chipset vendor because it delivered complete motherboard reference designs into the market to ease and speed design, he said. This model carries over today into smart phone design, he added.
At the end of the day, in design engineering there are always challenges and usually gradual improvement. As the IP market and methodology mature, the integration stress eases and becomes a “100-piece puzzle instead of a 1,000-piece puzzle,” said Schuth. That’s because IP vendors are learning more and more about customer needs and then applying those lessons to subsequent engagements.
Recently, Carbon released the first ARMv8 Linux CPAK utilizing the ARM CoreLink CCN-504 Cache Coherent Network on Carbon System Exchange. The CCN family of interconnect offers a wide range of high bandwidth, low latency options for networking and data center infrastructure.
The new CPAK uses an ARM Cortex-A57 octa-core configuration to run Linux on a system with AMBA 5 CHI. Switching the Cortex-A57 configuration from ACE to CHI on Carbon IP Exchange is as easy as changing a pull-down menu item on the model build page. After that, a number of configuration parameters must be set to enable the CHI protocol correctly. Many of them were discussed in a previous article covering usage of the CCN-504. Using native AMBA 5 CHI for the CPU interface coupled with the CCN-504 interconnect provides high-frequency, non-blocking data transfers. Linux is commonly used in many infrastructure products such as set-top boxes, networking equipment, and servers so the Linux CPAK is applicable for many of these system designs.
Selecting AMBA 5 CHI for the memory interface makes the system drastically different at the hardware level compared to a Linux CPAK using the ARM CoreLink CCI-400 Cache Coherent Interconnect, but the software stack is not significantly different.
From the software point of view, a change in interconnect usually requires some change in initial system configuration. It also impacts performance analysis as each interconnect technology has different solutions for monitoring performance metrics. An interconnect change can also impact other system construction issues such as interrupt configuration and connections.
Some of the details involved in migrating a multi-cluster Linux CPAK from CCI to CCN are covered below.
Special configuration for the CCN-504 is done using the Linux boot wrapper which runs immediately after reset. The CPAK doesn’t include the boot wrapper source code, but instead uses git to download it from kernel.org and then patch the changes needed for CCN configuration. The added code performs the following tasks:
The most critical software task is to make sure multi-cluster snooping is operational. Without this Linux will not run properly. If you are designing a new multi-cluster CCN-based system it is worth running a bare metal software program to verify snooping across clusters is working correctly. It’s much easier to debug the system with bare metal software, and there are a number of multi-cluster CCN CPAKs available with bare metal software which can be used.
I always recommend a similar approach for other hardware specific programming. Many times users have hardware registers that need to be programmed before starting Linux and it’s easy to put this code into the boot wrapper and less error prone compared to using simulator scripts to force register values.The Linux device tree provided with the CPAK also contains a device tree entry for the CCN-504. The device tree entry has a base address which must match the PERIPHBASE parameter on the CCN-504 model. In this case the PERIPHBASE is set to 0x30 which means the address in the device tree is 0x30000000.
All Linux CPAKS come with an application note which provides details on how to configure and compile Linux to generate a single .axf file.
One of the new things in the CPAK is the method used to get the CPU Core ID and Cluster ID information to the GIC-400.
The GIC-400 requires the AWUSER and ARUSER bits on AXI be used to indicate the CPU which is making an access to the GIC. A number between 0 and 7 must be driven on these signals so the GIC knows which CPU is reading or writing, but getting the proper CPU number on the AxUSER bits can be a challenge.
In Linux CPAKs with CCI, this is done by the GIC automatically by inspecting the AXI transaction ID bits and then setting the AxUSER bits as input to the GIC-400. Each CPU will indicate the CPU number within the core (0-3) and the CCI will add information about which slave port received the transaction to indicate the cluster.
Users don’t need to add any special components in the design because the mapping is done inside the Carbon model of the GIC-400 using a parameter called “AXI User Gen Rule”. This parameter has a default value which assumes a 2 cluster system in which each cluster has 4 cores. This is a standard 8 core configuration which uses all of the ports of the GIC-400. The parameter can be adjusted for other configurations as needed.
The User Gen Rule does even more because the ARM Fast Model for the GIC-400 uses the concept of Cluster ID to indicate which CPU is accessing the GIC. The Cluster ID concept is familiar for software reading the MPIDR register and exists in hardware as a CPU configuration input, but is not present in each bus transaction coming from a CPU and has no direct correlation to the CCI functionality of adding to the ID bits based on slave port.
To create systems which use cycle accurate models and can also be mapped to ARM Fast Models the User Gen Rule includes all of the following information for each of the 8 CPUs supported by the GIC:
With all of this information Linux can successfully run on multi-cluster systems with the GIC-400.
In a system with CHI the Cluster ID and the CPU ID values must also be presented to the GIC in the same way as the ACE systems. For CHI systems, the CPU will use the RSVDC signals to indicate the Core ID. The new CCN-504 CPAK introduces a SoC Designer component to add Cluster ID information. This component is a CHI-CHI pass through which has a parameter for Cluster ID and adds the given Cluster ID into to the RSVDC bits.
For CCN configurations with AXI master ports to memory, the CCN will automatically drive the AxUSER bits correctly for the GIC-400. For systems which bridge CHI to AXI using the SoC Designer CHI-AXI converter, this converter takes care of driving the AxUSER bits based on the RSVDC inputs. In both cases, the AxUSER bits are driven to the GIC. The main difference for CHI systems is the GIC User Gen Rule parameter must be disabled by setting the “AXI4 Enable Change USER” parameter to false so no additional modification is done by the Carbon model of the GIC-400.
All of this may be a bit confusing, but demonstrates the value of Carbon CPAKs. All of the system requirements needed to put various models together to form a running Linux system have already been figured out so users don’t need to know it all if they are not interested. For engineers who are interested, CPAKs offer a way to confirm the expected behavior read in the documentation by using a live simulation and actual waveforms.
What do electronics design engineers and ancient people have in common? Not much, save for one significant passion: Tools.
Bullman, general manager ARM's Development Solutions Group, spoke on “Dealing with a Complex World" and the ancient and constant human need to develop tools to solve problems.
Read Cadence Editor Christine Young's complete coverage of Bullman's presentation here.
We had a very successful series of events at DAC in which we highlighted collaborative successes with our strategic partners.
Would love you hear your feedback and/or questions.
Many of you no doubt have heard of the passing of longtime EDA industry analyst Gary Smith, who died after an illness July 3. Gary was a life-loving, spirited and brilliant man whose long passion for electronics design was reflected in everything from his work to his music-playing to his efforts with the ITRS semiconductor roadmap. Richard Goering, longtime EDA editor with EE Times and Cadence, penned an obituary as did Dylan McGrath for EE Times.
A celebration of Gary's life, hosted by his widow, ARM's lorikate, is scheduled for Sunday July 12, 11 a.m. at the DoubleTree Hotel in San Jose. Additional information can be found here at Gary Smith EDA.
This year at the 52nd Design Automation Conference (DAC) in San Francisco, ARM® and its ARM Connected Community® partners will showcase how we drive industry innovation through collaborative initiatives that shape the technology for tomorrow's leading-edge devices, such as the Internet of Things (IoTs) and wearables electronics.
Presentations and demonstrations of ARM's newest industry-leading technology including:
ARM experts will also be participating in more than 30 technical sessions, panels and events during the conference. View the ARM speaking schedule.
Join us to learn how ARM works with our community of EDA and embedded partners to enable leading semiconductor and consumer electronics companies with integrated and optimized ARM-based IP for advanced systems-on-chips (SoCs) targeting a wide range of end applications from sensors to servers.
Over a dozen ARM partners will be speaking and featuring their advanced tools, services and products in the ARM Connected Community Pavilion (2414), including ANSYS, Carbon, eSilicon, Imperas Software, Lauterbach, Magillem, Mentor Embedded, Mentor Verification, NetSpeed, Rambus, Sonics and Synopsys.
ARM is sponsoring a Scavenger Hunt to win cool ARM-based products in the daily raffle drawing. To participate:
Drawings will be held at the ARM Connected Community Theater on Monday/Tuesday at 6:45 pm and on Wednesday at 4:45 pm.
I was delighted to receive a significant amount of enthusiastic feedback regarding my talk. The general premise of my talk is seemingly obvious to those of you in this SoC Design Community, but not necessarily to many in the lithography community: The papers at the lithography conferences focus on what line and space can be printed, but when you put an SoC together, there's a lot of other stuff going on.
That other stuff was quite interesting to much of the audience.
Below is a summary write-up of my talk that I'm working on for the SPIE newsroom.
Of course EUV, should it meet its production targets, would help. But as we get closer to an actual EUV insertion point, it is informative to move beyond the basic metrics of pitch and wafers per hour to consider the key SoC metrics: Power, Performance, Area, and Cost (PPAC), and how lithography mixes with other concerns to produce the final results. In so doing, we find pluses and minuses that might nudge the argument one way or the other. The first section of my talk detailed that we are in an era where transistor performance is not helped by shrinking the gate pitch. This is primarily due to parasitic resistance and capacitance. This is not solved by switching to a high mobility channel material in the fin or to even a nanowire, but only mitigated. Contacting ever-shrinking transistor source/drains is becoming as limiting of a problem as the transistor itself. Along these lines, I showed cases where larger gate pitch can result in smaller chips. One of the morals of those examples is that the overall chip area is a combination of transistor characteristics and wiring capabilities and the two should be considered together when defining a process technology. A related issue is the scaling of minimum area metal shapes, which are now often not lithographically limited but limited by copper fill. Larger minimum area metal shapes take up more routing resources and can result in larger chips.
Furthermore, the interconnect parasitics that the transistors must drive in order to operate a full circuit are also not helped by feature size scaling. As circuits are scaled, the ratio of wire load to transistor drive can be expected to worsen, and then in order to hit frequency scaling targets the transistor drive strengths have to be increased—either by increasing the width of a given transistor or by increasing the number of repeater gates in a critical path. Either way, area and power suffer due to poor interconnect scaling.
A further aspect challenged by feature size scaling is variability.
Designers must include margins for variability—if we are presented with larger variability, design margins increase, and those larger margins translate into larger, more power-hungry chips. Part of what we are fighting is Pelgrom’s Law, which reminds us that transistor variability increases as we decrease the area of the channel. The fabs have made remarkable progress in reducing this variability over the generations, including a big step with the 2nd generation of FinFETs by including multiple gate work function materials that allow us to reduce the doping in a wider range of transistor options. Historically, the two largest contributors to random variability in transistors have been the dopant fluctuations (reduced as noted above) and line edge roughness (LER) of the gate. With the reduction of the dopant component of random fluctuations, and with FinFETs adding fin edge roughness variability to gate etch roughness, reducing LER in general will become more important. This will be a key issue to monitor with the development of the EUV ecosystem, assource power and resist improvements will be needed to improve LER.
The circuits that are most affected by local variability are the memories. Scaling the minimum area SRAM bitcells (which are typically the ones bandied about in public marketing materials) from 28 to 20 to 16/14 to 10 can take them from relatively normal operation to non-functionality. This does not mean that we can’t make embedded memory anymore, just that designers need to choose one of three actions: Use a bitcell that has larger transistors, use a bitcell that contains more transistors (8 transistors instead of 6 is a popular option) , or wrap “assist” circuitry around the memory arrays which help them overcome their native variability limitations. Any of these options will add area and/or power, and might end up limiting the speed of the chip as well. I received a lot of feedback on this point—underscoring that many people outside of the direct chip design ecosystem did not know that we were already in a regime where the inherent variability of the transistors has limited our abilities to scale memory to the degree that pitch scaling would imply.
The key message of this opening part of the talk: There are many issues that will dilute the final product metrics, independent of the pitch scaling we enable. The good news side of this is somewhat tongue-in-cheek: All of the semiconductor industry isn’t hanging on the fate of EUV pitch scaling.
Transistors, their contacts, and the vias and wires, will all need to have fundamental improvements in order to fully take advantage of the pitch scaling that EUV may offer.
Transitioning to more specific lithography topics, I spent some time comparing the “with-EUV” option to the “without EUV” option (multiple patterning).
In the case of the most critical logic layer, the 1st metal layer, which necessarily contains a high degree of 2-dimensional shapes, we’ve taken on Litho-Etch-Litho-Etch style patterning, making do with this in the absence of EUV. Interestingly, with the great capabilities in overlay in the state-of-the-art 193i steppers, many of our critical 2-dimensional shapes (such as a line end facing the side of a line) may not scale if switched to EUV at a next node, due to the numerical aperture limitations. This disadvantage must be added to the obvious advantage of printing with fewer masks. One wild card here would be the ability to route in M1 with EUV. We removed the ability for the routers to manipulate M1 shapes several technology nodes ago, because the associated rules had become too complex for routers in the extreme low-k1 regimes.
EUV should help with some of the key constructs used to create low power (i.e., small) standard cell logic. A key area is in the local interconnect that is used to wire transistors under the M1. This leads to an interesting point, quantified in a later paper by Lars Liebmann of IBM, that low power designs will likely benefit more from EUV than higher performance designs, as the higher performance designs can use larger standard cells that would not put as much pressure on local patterning.
Another possibility to consider ties back to the interconnect parasitics that I discussed as key scaling limiters above. The only reason we haveadded the local interconnect is to make up for patterning limitations of 193i in these small standard cells.
It’s possible that EUV may allow us to eliminate or at least simplify the local interconnect, saving wafer cost but perhaps also importantly reducing wiring parasitics—which would reduce the area and power of chip implementations, all other things being equal.
There are many other “second order” issues that EUV may help with—meaning chip design issues that extend beyond basic pitch scaling. One example I used was the need for multiple patterning in the signal routing layers. If for instance LELE type multiple patterning is used, different wires on the same metal layer will move with respect to each other, due to misalignment between the different masks of the decomposed layout. This misalignment creates extra capacitance variation, which then increases the required design margins, and as discussed earlier will end up with larger, more-power hungry designs. Self-Aligned Double Patterning (SADP) is a possible alternative to LELE, and with SADP the line spacing won’t vary according to mask overlay, but it will vary according to “pitch walking” due to mandrel CD variation. Furthermore, most embodiments of SADP impose extra restrictions on the placement of line ends (where we via up/down to other metal layers), and any extra restrictions increase chip area. This brings up a point I emphasized during the talk: Many designs, especially low power designs, are limited by the wiring, and much of the wiring limitations are not from simple line/space but from the line ends and the rules associated with vias at the line ends.
Either way we choose to do multiple patterning, we will limit the ability to scale power, performance and/or area (PPA) of our designs, and this is another issue that should be comprehended when comparing EUV to non-EUV options.
Another potential EUV benefit relates to the routers that I mentioned earlier: merely enabling simpler design rules for the routers can result in improved design implementations, because optimizing the PPA of a design typically involves dozens and dozens of iterations of the floor-planning, placing, and routing of the design. The slower the router (due to more complex design rules), the fewer iterations will be possible prior to design tape-out, and the products taping out won’t be able to achieve ultimate PPA entitlement.
And that brings me to the message I wanted to leave the SPIE EUV conference attendees with:
No one would dispute the benefit that an ideal EUV capability would bring to the industry.
But as EUV closes in on its pitch and throughput targets and approaches viability, we must consider the practical design aspects in order to accurately quantify the potential benefit of EUV. Unfortunately, these design questions are not easy to answer—they ideally require a full Process Design Kit (PDK), with transistor models, parasitic extraction models, and wiring design rules, and fully considered implementations of mock designs in order to benchmark the PPA results.
Furthermore, there won’t be one right answer—low power and high performance designs will likely arrive at different value assessments for EUV, which will add complexity to the choices the foundries will have to make regarding the timing and specific process layers for EUV insertion.
The latest high-performance ARMv8-A processor is the Cortex-A72.The press release reports that the A72 delivers CPU performance that is 50x greater than leading smartphones from five years ago and will be the anchor in premium smartphones for 2016. The Cortex-A72 delivers 3.5x the sustained performance compared to an ARM Cortex-A15 design from 2014. Last week ARM began providing more details about the Cortex-A72 architecture. AnandTech has a great summary of the A72 details.
The Carbon model of the A72 is now available on Carbon IP Exchange along with 10 Carbon Performance Analysis Kits (CPAKs). Since current design projects may be considering the A72, it’s a good time to highlight some of the differences between the Cortex-A72 and the Cortex-A57.
IP Exchange enables users to configure, build, and download models for ARM IP. There are a few differences between the A57 and the A72. The first difference is the L2 cache size. The A57 can be configured with 512 KB, 1 MB, or 2 MB L2 cache, but the A72 can be configured with a fourth option of 4MB.
Another new configuration which is available on IP Exchange for the A72 is the ability to disable the GIC CPU interface. Many designs continue to use version 2 of the ARM GIC architecture with IP such as the GIC-400. These designs can take advantage of excluding the GIC CPU interface.
The A72 also offers an option to include or exclude the ACP (Accelerator Coherency Port) interface.
The last new configuration option is the number of FEQ (Fill/Evict Queue) Entries on the A72 has been increased to include options of 20, 24, and 28 compared to the A57 which offers 16 or 20 entries. This feature has been important to Carbon users doing performance analysis and studying the impact of various L2 cache parameters.
The Cortex-A72 configuration from IP Exchange is shown below.
The main change to the A72 interface is the width of the transaction ID signals has been increased from 6 bits to 7 bits. The wider *IDM signals only apply when the A72 is configured with an ACE interface. The main impact occurs when connecting an A72 to a CCI-400 which was used with A53 or A57. Since those CPUs have the 6-bit wide *IDM signals the CCI-400 will need to be reconfigured for 7-bit wide *IDM signals. All of the A72 CPAKs which use CCI-400 have this change made to them so they operate properly, but it’s something to watch if upgrading existing systems to A72.
This applies to the following signals for A72:
A number of system registers are updated with new values to reflect the A72. The primary part number field in the Main ID register (MIDR) for A72 is 0xD08 vs the A57 value of 0xD07 and the A53 value of 0xD03. Clearly, the 8 was chosen well before the A72 number was assigned. A number of other ID registers change value from 7 on the A57 to 8 on the A72.
There are a number of new events tracked by the Cortex-A72 Performance Monitor Unit (PMU). All of the new events have event numbers 0x100 and greater. There are three main sections covering:
The screenshots below from the Carbon Analyzer show the PMU events. All of these are automatically instrumented by the Carbon model and are recorded without any software programming.
The A72 contains many micro-architecture updates for incremental performance improvement. The most obvious one which was described is the L2 FEQ size, and there are certainly many more in the branch prediction, caches, TLB, pre-fetch, and floating point units. As an example, I ran an A57 CPAK and an A72 CPAK with the exact same software program. Both CPUs reported about 21,500 instructions retired. This is the instruction count if the program were viewed as a sequential instruction stream. Of course, both CPUs do a number of speculative operations. The A57 reported about 37,000 instructions speculatively executed and the A72 reported 35,700.
The screenshots of the instruction events are shown below, first A72 followed by A57. All of the micro-architecture improvements of the A72 combine to provide the highest performance CPU created by ARM to date.
Carbon users easily can run the A57, A53, and now the A72 with various configuration options and directly compare and contrast the performance results using their own software and systems. The CPAKs available from Carbon System Exchange provide a great starting point and can be easily modified to investigate system performance topics.
Please join ARM's Pierre-Alexandre Bou-Ach to learn how he has been able to reduce the area of Mali cost-efficient GPUs using Synopsys Design Compiler Graphical and IC Compiler. In this technical session, Pierre. physical design lead, will be joined by Synopsys' Priti Vijayvargiya to present methodologies, design choices and results achieved.
The webinar will be held on April 23, 2015 (9AM PST) and can also be viewed as a recorded event.
Electronics applications are exploding in diversity, breadth and design complexity. The time-to-market pressure cooker is boiling. Performance, power and area all are demanding engineers' attention. As design teams push themselves, they often look to the enablement ecosystem for clues to what technologies they can exploit next.
This is particularly true in the spring of each year when the TSMC Technology Symposium is held in San Jose, Calif. Just before the event kicked off, I sat down with Suk Lee, Senior Director, Design Infrastructure Marketing Division of TSMC, to get a sense for what his company is rolling out to the industry in the coming months and how previously announced process nodes are progressing.
In short, Lee said:
Here's a link to the complete Q&A with Lee. And watch for additional dispatches from the 2015 TSMC Technology Symposium as I transcribe my notebook!
One of my associates, Achim Nohl, an expert in virtual prototyping and software bring-up on ARM® processors, will be conducting a webinar on 16 April 2015 (9AM PDT) on driver development for ARMv8-based designs using Hybrid (Virtual+FPGA) Prototyping. Don't worry if you can't see it live, since it will also be available recorded.
In this webinar aimed at firmware and Linux driver developers, Achim will introduce Virtualizer™ Development Kits (vdks) using ARM Fast Models and how they can be used to bring up drivers on ARMv8-based designs for DesignWare® interface IP like USB Ethernet. PCI Express, UFS mobile storage, etc. In addition, he'll show how you can connect a virtual prototype of your ARMv8-based processor subsystem to a HAPS® FPGA-based prototype hosting the DesignWare digital core and analog PHY daughterboard to perform final hardware validation.
Here's what you can expect to learn: