1 2 3 Previous Next

SoC Implementation

161 posts


This year at the 52nd Design Automation Conference (DAC) in San Francisco, ARM® and its ARM Connected Community® partners will showcase how we drive industry innovation through collaborative initiatives that shape the technology for tomorrow's leading-edge devices, such as the Internet of Things (IoTs) and wearables electronics.


ARM Exhibiting (booth 2428)

Presentations and demonstrations of ARM's newest industry-leading technology including:

ARM experts will also be participating in more than 30 technical sessions, panels and events during the conference. View the ARM speaking schedule.


The ARM Connected Community Pavilion

Join us to learn how ARM works with our community of EDA and embedded partners to enable leading semiconductor and consumer electronics companies with integrated and optimized ARM-based IP for advanced systems-on-chips (SoCs) targeting a wide range of end applications from sensors to servers.


Over a dozen ARM partners will be speaking and featuring their advanced tools, services and products in the ARM Connected Community Pavilion (2414), including ANSYS, Carbon, eSilicon, Imperas Software, Lauterbach, Magillem, Mentor Embedded, Mentor Verification, NetSpeed, Rambus, Sonics and Synopsys.

Click here to view the ARM Connected Community Theater schedule.


The ARM Scavenger Hunt

ARM is sponsoring a Scavenger Hunt to win cool ARM-based products in the daily raffle drawing. To participate:

  1. Pick up a playing card in the ARM booth (24428)
  2. Visit 5 partners in the ARM Connected Community Pavilion and collect an IoT sticker from each one
  3. Go visit 1 ARM partner on the exhibit floor to collect the last sticker
  4. Return the completed card to the ARM booth for entry in the daily drawing and receive an ARM lapel pin and raffle ticket


Drawings will be held at the ARM Connected Community Theater on Monday/Tuesday at 6:45 pm and on Wednesday at 4:45 pm.


I recently had the honor of presenting the Keynote to SPIE's EUV (Extreme Ultraviolet) conference.  The SPIE conference is the premier gathering place of the lithography community.   al[1].gif

I was delighted to receive a significant amount of enthusiastic feedback regarding my talk.  The general premise of my talk is seemingly obvious to those of you in this SoC Design Community, but not necessarily to many in the lithography community:    The papers at the lithography conferences focus on what line and space can be printed, but when you put an SoC together, there's a lot of other stuff going on. 

That other stuff was quite interesting to much of the audience. 

Below is a summary write-up of my talk that I'm working on for the SPIE newsroom.


EUV and SoC:  Does it Really Help?


Of course EUV, should it meet its production targets, would help.   But as we get closer to an actual EUV insertion point, it is informative to move beyond the basic metrics of pitch and wafers per hour to consider the key SoC metrics:   Power, Performance, Area, and Cost (PPAC), and how lithography mixes with other concerns to produce the final results.       In so doing, we find pluses and minuses that might nudge the argument one way or the other. The first section of my talk detailed that we are in an era where transistor performance is not helped by shrinking the gate pitch.   This is primarily due to parasitic resistance and capacitance.   This is not solved by switching to a high mobility channel material in the fin or to even a nanowire, but only mitigated.    Contacting ever-shrinking transistor source/drains is becoming as limiting of a problem as the transistor itself.    Along these lines, I showed cases where larger gate pitch can result in smaller chips.    One of the morals of those examples is that the overall chip area is a combination of transistor characteristics and wiring capabilities and the two should be considered together when defining a process technology.    A related issue is the scaling of minimum area metal shapes, which are now often not lithographically limited but limited by copper fill.   Larger minimum area metal shapes take up more routing resources and can result in larger chips.


Furthermore, the interconnect parasitics that the transistors must drive in order to operate a full circuit are also not helped by feature size scaling.  As circuits are scaled, the ratio of wire load to transistor drive can be expected to worsen, and then in order to hit frequency scaling targets the transistor drive strengths have to be increased—either by increasing the width of a given transistor or by increasing the number of repeater gates in a critical path.  Either way, area and power suffer due to poor interconnect scaling.

A further aspect challenged by feature size scaling is variability.

Designers must include margins for variability—if we are presented with larger variability, design margins increase, and those larger margins translate into larger, more power-hungry chips.    Part of what we are fighting is Pelgrom’s Law, which reminds us that transistor variability increases as we decrease the area of the channel.   The fabs have made remarkable progress in reducing this variability over the generations, including a big step with the 2nd generation of FinFETs by including multiple gate work function materials that allow us to reduce the doping in a wider range of transistor options.  Historically, the two largest contributors to random variability in transistors have been the dopant fluctuations (reduced as noted above) and line edge roughness (LER) of the gate.    With the reduction of the dopant component of random fluctuations, and with FinFETs adding fin edge roughness variability to gate etch roughness, reducing LER in general will become more important.  This will be a key issue to monitor with the development of the EUV ecosystem, assource power and resist improvements will be needed to improve LER.


The circuits that are most affected by local variability are the memories. Scaling the minimum area SRAM bitcells (which are typically the ones bandied about in public marketing materials) from 28 to 20 to 16/14 to 10 can take them from relatively normal operation to non-functionality.   This does not mean that we can’t make embedded memory anymore, just that designers need to choose one of three actions:   Use a bitcell that has larger transistors, use a bitcell that contains more transistors (8 transistors instead of 6 is a popular option) , or wrap “assist” circuitry around the memory arrays which help them overcome their native variability limitations.   Any of these options will add area and/or power, and might end up limiting the speed of the chip as well.    I received a lot of feedback on this point—underscoring that many people outside of the direct chip design ecosystem did not know that we were already in a regime where the inherent variability of the transistors has limited our abilities to scale memory to the degree that pitch scaling would imply.


The key message of this opening part of the talk:   There are many issues that will dilute the final product metrics, independent of the pitch scaling we enable.  The good news side of this is somewhat tongue-in-cheek:    All of the semiconductor industry isn’t hanging on the fate of EUV pitch scaling.

Transistors, their contacts, and the vias and wires, will all need to have fundamental improvements in order to fully take advantage of the pitch scaling that EUV may offer.


Transitioning to more specific lithography topics, I spent some time comparing the “with-EUV” option to the “without EUV” option (multiple patterning).

In the case of the most critical logic layer, the 1st metal layer, which necessarily contains a high degree of 2-dimensional shapes, we’ve taken on Litho-Etch-Litho-Etch style patterning, making do with this in the absence of EUV.   Interestingly, with the great capabilities in overlay in the state-of-the-art 193i steppers, many of our critical 2-dimensional shapes (such as a line end facing the side of a line) may not scale if switched to EUV at a next node, due to the numerical aperture limitations.   This disadvantage must be added to the obvious advantage of printing with fewer masks.   One wild card here would be the ability to route in M1 with EUV.   We removed the ability for the routers to manipulate M1 shapes several technology nodes ago, because the associated rules had become too complex for routers in the extreme low-k1 regimes.


EUV should help with some of the key constructs used to create low power (i.e., small) standard cell logic.   A key area is in the local interconnect that is used to wire transistors under the M1.   This leads to an interesting point, quantified in a later paper by Lars Liebmann of IBM, that low power designs will likely benefit more from EUV than higher performance designs, as the higher performance designs can use larger standard cells that would not put as much pressure on local patterning.

Another possibility to consider ties back to the interconnect parasitics that I discussed as key scaling limiters above.   The only reason we haveadded the local interconnect is to make up for patterning limitations of 193i  in these small standard cells.


It’s possible that EUV may allow us to eliminate or at least simplify the local interconnect, saving wafer cost but perhaps also importantly reducing wiring parasitics—which would reduce the area and power of chip implementations, all other things being equal.

There are many other “second order” issues that EUV may help with—meaning chip design issues that extend beyond basic pitch scaling.   One example I used was the need for multiple patterning in the signal routing layers.   If for instance LELE type multiple patterning is used, different wires on the same metal layer will move with respect to each other, due to misalignment between the different masks of the decomposed layout.    This misalignment creates extra capacitance variation, which then increases the required design margins, and as discussed earlier will end up with larger, more-power hungry designs.   Self-Aligned Double Patterning (SADP) is a possible alternative to LELE, and with SADP the line spacing won’t vary according to mask overlay, but it will vary according to “pitch walking” due to mandrel CD variation. Furthermore, most embodiments of SADP impose extra restrictions on the placement of line ends (where we via up/down to other metal layers), and any extra restrictions increase chip area.    This brings up a point I emphasized during the talk:  Many designs, especially low power designs, are limited by the wiring, and much of the wiring limitations  are not from simple line/space but from the line ends and the rules associated with vias at the line ends.

Either way we choose to do multiple patterning, we will limit the ability to scale power, performance and/or area (PPA) of our designs, and this is another issue that should be comprehended when comparing EUV to non-EUV options.

Another potential EUV benefit relates to the routers that I mentioned earlier:  merely enabling simpler design rules for the routers can result in improved design implementations, because optimizing the PPA of a design typically involves dozens and dozens of iterations of the floor-planning, placing, and routing of the design.  The slower the router (due to more complex design rules), the fewer iterations will be possible prior to design tape-out, and the products taping out won’t be able to achieve ultimate PPA entitlement.


And that brings me to the message I wanted to leave the SPIE EUV conference attendees with:

No one would dispute the benefit that an ideal EUV capability would bring to the industry.

But as EUV closes in on its pitch and throughput targets and approaches viability, we must consider the practical design aspects in order to accurately quantify the potential benefit of EUV.    Unfortunately, these design questions are not easy to answer—they ideally require a full Process Design Kit (PDK), with transistor models, parasitic extraction models, and wiring design rules, and fully considered implementations of mock designs in order to benchmark the PPA results.

Furthermore, there won’t be one right answer—low power and high performance designs will likely arrive at different value assessments for EUV, which will add complexity to the choices the foundries will have to make regarding the timing and specific process layers for EUV insertion.

The latest high-performance ARMv8-A processor is the Cortex-A72.The press release reports that the A72 delivers CPU performance that is 50x greater than leading smartphones from five years ago and will be the anchor in premium smartphones for 2016. The Cortex-A72 delivers 3.5x the sustained performance compared to an ARM Cortex-A15 design from 2014. Last week ARM began providing more details about the Cortex-A72 architecture. AnandTech has a great summary of the A72 details.




The Carbon model of the A72 is now available on Carbon IP Exchange along with 10 Carbon Performance Analysis Kits (CPAKs). Since current design projects may be considering the A72, it’s a good time to highlight some of the differences between the Cortex-A72 and the Cortex-A57.


Carbon IP Exchange Portal Changes

IP Exchange enables users to configure, build, and download models for ARM IP. There are a few differences between the A57 and the A72. The first difference is the L2 cache size. The A57 can be configured with 512 KB, 1 MB, or 2 MB L2 cache, but the A72 can be configured with a fourth option of 4MB.


Another new configuration which is available on IP Exchange for the A72 is the ability to disable the GIC CPU interface. Many designs continue to use version 2 of the ARM GIC architecture with IP such as the GIC-400. These designs can take advantage of excluding the GIC CPU interface.


The A72 also offers an option to include or exclude the ACP (Accelerator Coherency Port) interface.


The last new configuration option is the number of FEQ (Fill/Evict Queue) Entries on the A72 has been increased to include options of 20, 24, and 28 compared to the A57 which offers 16 or 20 entries. This feature has been important to Carbon users doing performance analysis and studying the impact of various L2 cache parameters.


The Cortex-A72 configuration from IP Exchange is shown below.



ACE Interface Changes

The main change to the A72 interface is the width of the transaction ID signals has been increased from 6 bits to 7 bits. The wider *IDM signals only apply when the A72 is configured with an ACE interface. The main impact occurs when connecting an A72 to a CCI-400 which was used with A53 or A57. Since those CPUs have the 6-bit wide *IDM signals the CCI-400 will need to be reconfigured for 7-bit wide *IDM signals. All of the A72 CPAKs which use CCI-400 have this change made to them so they operate properly, but it’s something to watch if upgrading existing systems to A72.


This applies to the following signals for A72:

  •          AWIDM[6:0]
  •          WIDM[6:0]
  •          BIDM[6:0]
  •          ARIDM[6:0]
  •          RIDM[6:0]

System Register Changes

A number of system registers are updated with new values to reflect the A72.  The primary part number field in the Main ID register (MIDR) for A72 is 0xD08 vs the A57 value of 0xD07 and the A53 value of 0xD03. Clearly, the 8 was chosen well before the A72 number was assigned. A number of other ID registers change value from 7 on the A57 to 8 on the A72.

New PMU Events

There are a number of new events tracked by the Cortex-A72 Performance Monitor Unit (PMU). All of the new events have event numbers 0x100 and greater. There are three main sections covering:

  • Branch Prediction
  • Queues
  • Cache

The screenshots below from the Carbon Analyzer show the PMU events. All of these are automatically instrumented by the Carbon model and are recorded without any software programming.





The A72 contains many micro-architecture updates for incremental performance improvement. The most obvious one which was described is the L2 FEQ size, and there are certainly many more in the branch prediction, caches, TLB, pre-fetch, and floating point units. As an example, I ran an A57 CPAK and an A72 CPAK with the exact same software program. Both CPUs reported about 21,500 instructions retired. This is the instruction count if the program were viewed as a sequential instruction stream. Of course, both CPUs do a number of speculative operations. The A57 reported about 37,000 instructions speculatively executed and the A72 reported 35,700.


The screenshots of the instruction events are shown below, first A72 followed by A57. All of the micro-architecture improvements of the A72 combine to provide the highest performance CPU created by ARM to date.





Carbon users easily can run the A57, A53, and now the A72 with various configuration options and directly compare and contrast the performance results using their own software and systems. The CPAKs available from Carbon System Exchange provide a great starting point and can be easily modified to investigate system performance topics.

Please join ARM's Pierre-Alexandre Bou-Ach to learn how he has been able to reduce the area of Mali cost-efficient GPUs using Synopsys Design Compiler Graphical and IC Compiler. In this technical session, Pierre. physical design lead, will be joined by Synopsys' Priti Vijayvargiya to present methodologies, design choices and results achieved.


The webinar will be held on April 23, 2015 (9AM PST) and can also be viewed as a recorded event.


Click here to learn more about the webinar and to register for it (live or recorded).

Electronics applications are exploding in diversity, breadth and design complexity. The time-to-market pressure cooker is boiling. Performance, power and area  all are demanding engineers' attention. As design teams push themselves, they often look to the enablement ecosystem for clues to what technologies they can exploit next.

This is particularly true in the spring of each year when the TSMC Technology Symposium is held in San Jose, Calif. Just before the event kicked off, I sat down with Suk Lee, Senior Director, Design Infrastructure Marketing Division of TSMC, to get a sense for what his company is rolling out to the industry in the coming months and how previously announced process nodes are progressing.

In short, Lee said:

  • Moore's Law is not slowing down
  • TSMC capex will jump 25% this year to roughly $12 billion
  • The ecosystem in general has done an excellent job of enabling double patterning and coloring at 16FF
  • 10nm is "progressing well," with more than more than 35 EDA tools certified using ARM's Cortex A-57 core as the vehicle
  • Development of EUV lithography solutions continues in joint research work with ASML, but more work needs to be done to achieve a useable power source.

Here's a link to the complete Q&A with Lee. And watch for additional dispatches from the 2015 TSMC Technology Symposium as I transcribe my notebook!

Related stories:

TSMC Symposium Updates (2014)


FinFET production and the ARM Ecosystem: TSMC readies 16nm FinFET ramp and tips 10nm FinFET plans

One of my associates, Achim Nohl, an expert in virtual prototyping and software bring-up on ARM® processors, will be conducting a webinar on 16 April 2015 (9AM PDT) on  driver development for ARMv8-based designs using Hybrid (Virtual+FPGA) Prototyping. Don't worry if you can't see it live, since it will also be available recorded.


Learn more about the webinar and register at TechOnLine for the live or recorded session.


In this webinar aimed at firmware and Linux driver developers, Achim will introduce Virtualizer™ Development Kits (vdks) using ARM Fast Models and how they can be used to bring up drivers on ARMv8-based designs for DesignWare® interface IP like USB Ethernet. PCI Express, UFS mobile storage, etc. In addition, he'll show how you can connect a virtual prototype of your ARMv8-based processor subsystem to a HAPS® FPGA-based prototype hosting the DesignWare digital core and analog PHY daughterboard to perform final hardware validation.


Here's what you can expect to learn:

  • The basic usage of a virtual prototype for hardware dependent software development
  • The anatomy of a Linux software stack for ARM and the integration of drivers into Linux and firmware
  • The use and benefits of hybrid prototypes in which virtual and physical prototype are connected

In addition to the webinar, Achim's also just posted a blog that gives a good introduction to the topic.

In late 2014, Carbon released the first Carbon Performance Analysis Kit (CPAK) utilizing the ARM CoreLink CCN-504 Cache Coherent Network. Today, the CCN-504 can be built on Carbon IP Exchange with a wide range of configuration options. There are now four CPAKs utilizing the CCN-504 on Carbon System Exchange. The largest design includes sixteen Cortex-A57 processors, the most processors ever included in a Carbon CPAK.


At the same time SoC Designer has added new AMBA 5 CHI features including support for monitors, breakpoints, Carbon Analyzer support, and a CHI stub component for testing.

Introduction to AMBA 5 CHI

To get a good introduction on AMBA 5 CHI I recommend the article, "What is AMBA 5 CHI and how does it help?".


Another interesting ARM Community article is “5 things you might now know about AMBA 5 CHI”.


Although the cache coherency builds on AMBA 4 ACE and is likely familiar, some of the aspects of CHI are quite different.


CCN-504 Configuration

Configuring the CCN-504 on Carbon IP Exchange is similar to all Carbon models. Select the desired interface types, node population, and other hardware details and click the "Build It" button to compile a model.



Understanding the Memory Map

One of the challenges of configuring CHI systems is to make sure the System Address Map (SAM) is correctly defined. As indicated in the table above, the process is more complex compared to a simple memory map with address ranges assigned to ports.


The network layer of the protocol is responsible for routing packets between nodes. Recall from the previous article that CHI is a layered protocol consisting of nodes of various types. Each node has a unique Network ID and each packet specifies a Target ID to send the packet to and a Source ID to be able to route the response.


For a system with A57 CPUs and a CCN-504 each Request Node (RN), such as a CPU, has a System Address Map (SAM) which is used to determine where to send packets. There are three possible node types a message could be sent to: Miscellaneous Node (MN), Home Node I/O coherent (HN-I), or Home Node Fully coherent (HN-F). DVM and Barrier messages are always sent to the MN so the challenge is to determine which of the possible Home Nodes an address is destined for.


The make the calculation of which HN-F is targeted the RN uses an address hash function. This can be found in the CCN-504 TRM.


Each CCN has a different hashing function depending on how many HN-F partitions are being used.


The hashing function calculates the HN-F to be used, but this is still not a Network ID. Additional configuration signals provide the mapping from HN-F number to Node ID.


All of this means there are a number of SAM* parameters for the A57 and the CCN-504 which must be set correctly for the memory map to function. It also means that a debugging tool which makes use of back-door memory access needs to understand the hashing function to know where to find the correct data for a given address. SoC Designer takes all of this into consideration to provide system debugging.


As you can see, setting up a working memory map is more complex compared to routing addresses to ports.


Carbon models use configuration parameters to perform the following tasks:

  •          Associate each address region with HN-Fs or HN-Is
  •          Specify the Node ID values of Home Nodes and the Miscellaneous Node
  •          Define the number of Home Nodes
  •          Specify the Home Nodes as Fully Coherent or I/O Coherent


The parameters for the A57 CPU model are shown below:




The parameters for the CCN-504 model are similar, a list of SAMADDRMAP* values and SAM*NODEID values.


It’s key to make sure the parameters are correctly set for the system to function properly.


Cheat Sheet

Sometimes it’s helpful to have a picture of all of the parts of a CCN system. The cheat sheet below has been a tremendous help for Carbon engineers to keep track of the node types and node id values in a system.



SoC Designer Features

With the introduction of AMBA 5 CHI, SoC Designer has been enhanced to provide CHI breakpoints, monitors, and profiling information.


Screenshots of CHI transactions and CHI profiling are shown below. The Target ID and the Source ID for each transaction are shown. This is from the single-core A57 CPAK so the SourceID values are always 1. Multi-core CPAKs will create transactions with different SourceID values.


The CCN-504 has a large number of PMU events which can be used to understand performance.






AMBA 5 CHI is targeted at systems with larger numbers of coherent masters. The AMBA 5 CHI system memory map is more complex compared to ACE systems. A number of System Address Map parameters are required to build a working system, both for the CPU and for the interconnect.


Carbon SoC Designer is a great way to experiment and learn how CHI systems work. Pre-configured Carbon Performance Analysis Kits (CPAKs) are available on Carbon System Exchange which can be downloaded and run which demonstrate hardware configuration as well as the software programming needed to initialize a CHI system. Just like the address map, the initialization software is more complex compared to an ACE system with a CCI-400 or CCI-500.

NUREMBERG, Germany—One of the most significant announcements in design synthesis popped here earlier this year, when Cadence unveiled Stratus at a press conference during Embedded World 2015.

As Frank Schirrmeister, Cadence group director for the System Development Suite, put it, “High-level synthesis has come of age.” Stratus combines the best of both the Cadence C-to-Silicon Compiler HLS tool and Cynthesizer from Forte Design Systems, which Cadence acquired in 2014. Here's a roundup of stories about the technology and its utility.



  • And lastly here’s a link to Schirrmeister’s press conference:


I hope to see you at the 25th annual Synopsys Users' Group meeting in Silicon Valley, where we'll have bunch of excellent sessions relating directly to ARM-based design. Stop on by for one of these and stay for the many other incredible designer-driven technical presentations.


Also, don't miss the ARM Keynote on Tuesday by Dipesh Patel on IoT - "The Internet of Completely Different Things."


You can find me around these various sessions as well as at the SNUG Designer Community Expo, doing video interviews with all of our ecosystem partners (4-8PM today, Monday 23 March).


All that technical goodness and free food and beer at the Designer Community Expo on Monday evening and the SNUG Pub on Tuesday evening!


Mon 23 March

12:30-2        IC Compiler II Lunch – Accelerating Products to Market with the power of 10X [including ARM on Cortex-A72] - [MA-12]

2-3:30          Minimum Energy Design for Sub-threshold Wireless Sensor Nodes [by ARM] [MB-03]


Tues 24 March

         9-10            Keynote Address - IoT - The Internet of Completely Different Things - Dr. Dipesh Patel, EVP, ARM

10:30-12      Area-centric Reference Implementation Flow for ARM Mali GPU [by ARM] [TA-02]

12-1:30        Design Compiler Lunch – accelerating innovation with Design Compiler [including ARM on Cortex-A72 [TA-12]

1:30-3:30     Renesas successful Cortex-A57 MPCore processor implementation in 16nm FinFET [Renesas[TB-01]

                    High-performance, energy efficient Cortex-A72 processor core implementation in 16FFLL+ with Galaxy tools [ARM+ Synopsys]

1:30-3:30     Achieving Highest Accuracy FinFET Extraction w/ StarRC "QuickCap Inside" Solution [ARM] [TB-09]


Wed 25 March

12:15-1:45     Lynx Lunch-and-Learn – Design exploration of Cortex-A53 mobile in Samsung 28LPP and 14LPP FinFET [WB-08]


I hope to see you there!

Jason Andrews

EDA Containers

Posted by Jason Andrews Mar 21, 2015

Linux containers provide a way to build, ship, and run applications such as the EDA tools used in SoC Design and Verification. EDA Containers is a LinkedIn Group to explore and discover the possibilities of using container technology for EDA application development and deployment. Personally, I work in Virtual Prototyping doing simulation of ARM systems and software. This is a challenging area because it involves not only hardware simulation tools, but also software development tools to compile and run software. We are looking for other engineers interested to explore containers as they related to EDA tools and embedded software development process. If you are interested to learn or have expertise to share related to Docker, LXC, LXD, Red Hat containers please join us. The group is not specific to any EDA company or product. The members are from various companies who just happen to be interested to learn and explore what can be done with Linux containers.


If you are interested please join the group or feel free to discuss related topics here in the ARM Community!



SANTA CLARA, Calif.—The launch of Cadence's new Innovus Implementation System heralds “a new era” in physical implementation technology, breaking longstanding electronic system-design bottlenecks, according to Rahul Deokar, product management director with Cadence.

Deokar gave a technical overview of the new technology at CDNLive Silicon Valley, just hours after it was unveiled during a keynote address at the annual event (March 10). “Older implementation tools had forced you guys as designers to do smaller design blocks,” he told a standing room-only audience at the Santa Clara Convention Center here. “You can now handle 5-10 million instance design blocks…and you can take weeks or even months off your SoC design schedules.”Slide06.jpg

Leapfrog Effect

Against the backdrop of SoC Implementation, the new technology represents a fundamental overhaul of the Encounter system that “leapfrogs” the industry and delivers a far more compelling digital implementation solution than the industry has experienced, Deokar said.

Previously, optimizing for power, performance and area (PPA) and improving turn-around time (TAT) was an either-or choice, he said.

"These were two conflicting objectives in a lot of ways. Traditional tools have effectively tackled just one or the other, however what good is it if the tool runs super-fast but ends up with sub-optimal PPA,” Deokar said. “Innovus gets you the best of both worlds on TAT and PPA.”

By delivering performance that is up to10x faster, design blocks that took 7-10 days can now be run in 1-2 days. The 10-20 percent PPA improvement is equivalent to a half-node or even a full-node transition, without actually moving to the new node, he added.

Furthermore, because the technology is integrated with Cadence signoff solutions, significant additional productivity gains can be achieved along the flow, he added.

And Innovus is not targeted at just bleeding-edge nodes such as 16/14/10nm; it has vast utility for established process nodes as well, Deokar said.

Driving improved TAT

A massively parallel architecture is key to improved turn-around time, Deokar said. The core algorithms have been improved such that “even if you're not running on 16 or 32 or 64 CPUs, the core algorithms of placement, optimization and routing have been sped up. Even on 2- and 4-CPU machines, you should be able to see TAT advantages,” he said. “Now, add multithreading, distributed network processing and MMMC (multi-mode/multi-corner) scenario acceleration, and you get the complete massively parallel system.”

That means really large chips that forced teams to divide the SoC into many blocks to manage the placement and routing complexities can now work with fewer blocks, which cuts design time and saves money, Deokar said.

He cited as one example a 28nm 2.8 million-cell networking IP running on 8 CPUs (pictured) that saw implementation time cut from 336 hours to 48 hours—a 7x improvement.

Slide12.jpgPushing PPA

The other key Innovus benefit for PPA represents a big step forward, he said.

Traditionally, placement ran on heuristic-based algorithms, but GigaPlace in Innovus is solver- or equation-based.

“That means you can model in the equation a lot of different design variables - timing, slack, power, wire length, congestion, layer awareness,” Deokar said. “GigaPlace concurrently solves both the electrical and physical objectives. As a result you get better PPA.”

Another feature is that Innovus is now power aware throughout the optimization process, Deokar said.

“All the transforms that were timing and area aware, now power is a part of that same cost function,” he told his audience.

A third key component is that the concurrent clock and datapath optimization technology from Azuro, which Cadence acquired in 2011, is now fully integrated.

“A lot high-performance designers have unique clocking methodologies -- H-trees, clock meshes, multipoint CTS,” he said. “You guys invest a lot of manual effort building these, but since these are customized, they’re not flexible when process and technology changes occur.”

The CCOopt FlexH feature integrated into Innovus is a combination of regular clock tree and H-tree, he said, “You get the best of both worlds in automation and in cross-corner variation, as well as in high-performance and a power-efficient clock network.”

Deokar also highlighted the track-aware optimization features of NanoRoute.

“Before you go into your detailed route step, right after track assignment to the different metals layers, we do timing-aware optimization,” Deokar said. “This pro-actively prevents signal integrity issues from occurring downstream in the flow, and dramatically reduces the timing jump between pre route and post route optimization.”

Finally Deokar noted productivity gains from the integration of Innovus with existing Cadence signoff technologies such as Tempus, Voltus and Quantus, a common user interface and reporting and visualization enhancements.

More information about Innovus can be found by navigating to the technology's landing page.

SANTA CLARA, Calif.—It was amazing he didn’t lose us at the first slide.

ARM CEO Simon Segars ventured to the annual Cadence's CDNLive Silicon Valley this week to keynote on the first day. And his first slide was an aerial photo of a small but eye-poppingly beautiful tropical island.

That the standing-room-only audience here didn’t mentally drift off on vacation to that lovely place was a testament to Segars’ sobering message: Given the continuing constraints in the battery-life improvement, we have a lot of work ahead of us.Simon-Segars-and-the-tropical-Island-CDNLive.jpg

Not that the electronics system design world hasn’t put in Herculean effort already.

Said Segars:

“We've delivered orders of magnitude more compute performance. We've gone from having just a CPU in a mobile device to having a dedicated graphics engine, a video engine, really high-performance DSPs. What's incredible to think about is how the power efficiency of this device has improved. Battery technology is pretty poor and doesn't advance anywhere near what we're able to achieve in semiconductors.”

Even so, much work lies ahead. Given not only the continued robust growth of mobile and IoT devices in the coming years but their impact on networking, “We need to evolve the way we design products and deliver silicon chips,” Segars said.

For design teams, the pressures across those segments are enormous and the stakes are high.

In mobile, for example, the smart phone is now effectively a remote control device for our lives—from controlling lighting, the TV and cameras and the like to hailing a cab, such as ride service Uber, Segars noted. And smart phones are increasingly creating sophisticated content, not just consuming it. That requires more performance, storage and power than ever.

The rise of sensors within phones and sensor fusion within the larger IoT world (golf clubs that help you improve your swing based on sensor data) also create new design challenges, he said.

A new frontier of design challenges is emerging rapidly in the network, where the very proliferation of edge devices is stressing bandwidth, capacity and storage needs, he said. It’s also forcing the industry to rethink fundamentally what the network is.

Said Segars:

“The architecture of the network hasn't changed for a long time. There is a lot of switching going on and funneling of data that goes on. But there essentially is a client device, there's a network in middle and a computer and storage at the end of it—what we call the cloud. We think we can extend intelligence throughout it to turn it into what can be another platform of innovation.”


In this vision, the distinction between network and cloud vanishes and is replaced by a continuum of computing.

“If you distribute compute and storage throughout the network, there's a lot you can do,” Segars said.

For example moving storage closer to the edge where data is captured or consumed can eliminate the need to shunt terabytes worth of data through the network. That effectively increases bandwidth and reduces latency. If evolved properly, this new network vision will become a platform for immense innovation in a way that the evolution of the system-on-chip has become a platform for innovation within systems, Segars said.

The executive just returned from Mobile World Congress in Barcelona, where there was much discussion of what 5G networks will look like. However that vision evolves, it is clear that “one size does not fit all” when it comes to enabling semiconductor, system and software technologies, he noted.

For example, ARM positioned its most recent major announcement not solely as the debut of a new microprocessor architecture—the Cortex-A72 (ARM Cortex-A72 and the New Premium Mobile Experience) as an IP suite designed to enable the next general of smart devices.

In addition, this new platform “requires a range of semiconductor solutions. We're going to need different amounts of processing and optimization and acceleration for networking, security and storage,” Segars added.

“We need to get creative in how we build chips. Integrated development tools and methodologies are essential,” Segars said.

To that end, he praised the semiconductor ecosystem and highlighted his company’s longtime partnership with Cadence in particular:

"Good chip design starts with system design, so the work we're doing on Palladium and fast models ( Accelerating the Time to Point of Interest by 50X Using Cadence Palladium Platform with ARM Fast Models) is really important to allow you get down to build the thing that's going to best solve the technical problems you're addressing."

He added:

"The improvement in IP, the work on the tools and flows, the evolution of software, of networking and of the platform creates this ecosystem where we can turn this mobile device from something that's great today to something that's amazing in the future.”

--Cadence announces complete SoC development environment for new ARM mobile IP suite

-- Optimizing ARM Cortex-M7 with Cadence


In Part 1 of this blog series (found here ) we introduced the ARM CoreLinkTM CCI-500 Cache Coherent Interconnect and described some of the new configurable features which are available over and above what was available with the previous generation CoreLinkTM CCI-400. We described how a UVM testbench is needed in order to start exploring the enhanced performance capabilities that are on offer, and we introduced an automation tool, Interconnect Workbench, which removes the need for manual testbench creation.

In Part 2 of the blog we start by exploring how CoreLink CCI-500 performs in a CoreLink CCI-400 like configuration and follow that by showing the full performance potential of CoreLink CCI-500 when configured for maximum performance.


CoreLink CCI-500 as a CoreLink CCI-400 replacement

As a first experiment we have configured CoreLink CCI-500 to have 2x ACE input ports, 3x ACELite input ports, 2x memory ports and 1 system port, this matches the fixed configuration of the previous generation of Cache Coherent Interconnect, the CoreLink CCI-400. We have then created a scenario which drives saturating transactions into all of the input ports targeting the two memory ports. The transactions are all defined as Non-shareable so we eliminate the effect of L2 Cache Snoops and see just the raw throughput. Running the testbench at 500MHz also provides a useful point of reference as many CoreLink CCI-400 designs are run at this speed.


As in Part 1, we can easily generate a testbench for the CoreLink CCI-500 configurations, a diagram of the generated UVM testbench is shown below.


Screen Shot 02-19-15 at 11.53 AM.JPG


The generated testbench contains all the necessary instances of fully configured AMBA VIP needed along with an instance of the Interconnect Validator VIP connected to all of the interface VIP. This additional VIP provides full system scoreboard functionality to support transaction tracking from entry point to its exit point including all the necessary coherency modelling need for ACE. In addition it also captures all necessary timing details needed for performance analysis, the graphs shown in this blog all come from this source.


It is also important to note that all the Slave VIP instances which model the memory in the system are configured to have zero delay in order that we only see effective delays and bandwidth of the CoreLink CCI-500.

Screen Shot 02-18-15 at 08.43 AM.JPG


As can be seen from the chart the CCI-500 delivers around 14GB/s of both READ and WRTE bandwidth.


Unleashing the full CoreLink CCI-500 performance

The big benefit of CoreLink CCI-500 over its predecessor is the capability to support two additional memory ports over and above the two supported by CoreLink CCI-400, in addition the design can be targeted to run at 667Mhz in appropriate technologies. The figure below shows the same non-shareable saturating test running on the CoreLink CCI-500 configuration described previously, i.e. with two memory ports at 500Mhz (the red lines) with a CoreLink CCI-500 configuration with four memory ports running at 667Mhz (the blue lines). The graphs show both read and write bandwidth for the two implementations and the improvement in bandwidth is clearly shown, around 14GB/s vs 33 GB/s of bandwidth for both read and write traffic.


Screen Shot 02-19-15 at 02.52 AM.JPG

One of the interesting phenomena we see in the simulations is that there is a delay between simulation startup and high bandwidth levels being achieved despite the fact that all masters are trying to make saturating memory accesses from the start. This is caused by the need for the Snoop Filter (more blogs are coming to explain the Snoop Filter) to initialize its RAM, the configuration we have chosen for this test is the “Large” configuration of CoreLink CCI-500 with four memory ports (blue graphs) and also 4x ACE ports (compared to 2x in the 500Mhz case). To support more ACE ports the Snoop Filter RAM is larger and hence takes longer to initialize.

Managing Bandwidth Requests

In order to understand how well CoreLink CCI-500 handles demanding scenarios it is useful to visualize how much it is stalling the requesting masters, sometimes this is called “back pressure”. A useful proxy for “back-pressure” in AMBA infrastructure is the concept of Outstanding Transactions. The AMBA ACE, ACE-Lite, AXI4 and AXI3 protocols all support multiple transaction issuing assuming that the receiving interface can support it.


As multiple transactions get issued into the system the number of Outstanding Transactions (OT), i.e. transactions that have been initiated but are incomplete, increases. The OT level will increase until the receiving interface (in this case the CoreLink CCI-500) throttles it, this is generally called the read_acceptance or write_acceptance limit. If we look at the chart below it shows WRITE bandwidth in red (we saw this on the last chart) plotted with WRITE OT Level, in blue, for all of the initiating masters combined. As the system is stalled waiting for the Snoop Filter RAM to be initialized we can see the OT level is flat, but once the RAM initialization is complete the CoreLink CCI-500 starts trying to balance the requesting masters.


Screen Shot 02-19-15 at 11.27 PM.JPG


After a brief peak of nearly 100 Outstanding Transactions the OT level starts to generally decrease to settle at around the 65 OT level plus or minus around 5 OT. We could run the simulation for longer to understand if this is steady state.


If you happen to be attending DVCon in a week's time I will be jointly presenting alongside Simon Rance from ARM at Session 6.3 on Tuesday 3rd March 2015, we would be very happy to chat about this or other ARM system performance topics.


Event Details | DVCon


Watch out for more parts of this blog in which we will further explore key features of CoreLink CCI-500.


Exploring the ARM CoreLinkTM CCI-500 performance envelope - Part 1

Carbon cycle accurate models of ARM CPUs enable system performance analysis by providing access to the Performance Monitor Unit (PMU). Carbon models instrument the PMU registers and record PMU events into the Carbon System Analyzer database without any software programming. Contrast this non-intrusive PMU event collection with other common methods of software execution:


  • ARM Fast Models focus on speed and have limited ability to access PMU events
  • Simulating or emulating CPU RTL does not provide automatic instrumentation and event collection
  • Silicon requires software programming to enable and collect events from the PMU

The ARM Cortex-A53 is a good example to demonstrate the features of SoC Designer. The A53 PMU implements the PMUv3 architecture and gathers statistics on the processor and memory system. It provides six counters which can count any of the available events.

The Carbon A53 model instruments the PMU events to gather statistics without any software programming. This means all of the PMU events (not just six) can be captured from a single simulation.

The A53 PMU Events can be found in the Technical Reference Manual (TRM) in Chapter 12. Below is a partial list of PMU events just to provide some flavor of the types of events that are collected. The TRM details all of the events the PMU contains.



Profiling can be enabled by right-clicking on a CPU model and selecting the Profiling menu. Any or all of the PMU events can be enabled. Any simulation done with profiling enabled will write the selected PMU events into the Carbon System Analyzer database.prof

Bare Metal Software


The automatic instrumentation of PMU events is ideal for bare metal software since it requires no programming and will automatically cover the entire timeline of the software test or benchmark. Full control is available to enable the PMU events at any time by stopping the simulator and enabling or disabling profiling.


All of the profiling data from the PMU events, as well as the bus transactions, and the software profiling information end up in the Carbon Analyzer database. The picture below shows a section of the Carbon Analyzer GUI loaded with PMU events, bus activity, and software activity.




The Carbon Analyzer provides many out-of-the-box calculation of interesting metrics as well as a complete API which allows plugins to be written to compute additional system or application specific metrics.

Linux Performance Analysis


Things get more interesting in a Linux environment. A common use case is to run Linux benchmarks to profile how the software executes on a given hardware design. Linux can be booted quickly and then a benchmark can be run using a cycle accurate virtual prototype by making use of Swap & Play.


Profiling enables events to be collected in the analyzer database, but the user doesn’t have the ability to understand which events apply to each Linux process or to differentiate events from the Linux kernel vs. those from user space programs. It’s also more difficult to determine when to start and stop event collection for a Linux application. Control can be improved by using techniques from Three Tips for Using Linux Swap & Play with ARM Cortex-A Systems.

Using PMU Counters from User Space


Since the PMU can be used for Linux benchmarks, the first thing that comes to mind is to write some initialization code to setup the PMU, enable counters, run the test, and collect the PMU events at the end. This strategy works pretty well for those willing to get their hands dirty writing system control coprocessor instructions.

Enable User Space Access


The first step to being able to write a Linux application which accesses the PMU is to enable user mode access. This needs to be done from the Linux kernel. It's very easy to do, but requires a kernel module to be loaded or compiled into the kernel. All that is needed to set bit 0 in the PMUSERENR register to a 1. It takes only one instructions, but it must be executed from within the kernel. The main section of code is shown below.




Building a kernel module requires a source tree for the running kernel. If you are using a Carbon Performance Analysis Kit (CPAK), this source tree is available in the CPAK or can easily be downloaded by using the CPAK scripts.


A source code example as well as a Makefile to build it can be obtained by registering here.


The module can either be loaded dynamically into a running kernel or added to the static kernel build. When working with CPAKs it’s easier for me to just add it to the kernel. When I’m working with a board where I can natively compile it on the machine it’s easier to dynamically load it using:

$ sudo insmod enable_pmu.ko

Remember to use the lsmod command to see which modules are loaded and the rmmod command to unload it when finished.

The exit function of the module returns the user mode enable bit back to 0 to restore the original value.

PMU Application


Once user mode access to the PMU has been granted, benchmark programs can take advantage of the PMU to count events such as cycles and instructions. One possible flow from a user space program is:

  • Reset count values
  • Select which of the six PMU counter registers to use
  • Set the event to be counted, such as instructions executed
  • Enable the counters to start counting

Once this is done, the benchmark application can read the current values, run the code of interest, and then read the values again to determine how many events occurred during the code of interest.




The cycle counter is distinct from the other 6 event count registers. It is read from a separate CP15 system control register. For this example, event 0x8 is monitored, instruction architecturally executed, using event count register 0. Please take a look at the source code for the simple test application used to count cycles and instructions of a simple printf() call.




This article provided an introduction to using the Carbon Analyzer to automatically gather information on ARM PMU events for bare metal and Linux software workloads. Carbon models provide full access to all PMU events during a single simulation with no software changes and no limitations on the number of events captured.


It also explained how additional control can be achieved by writing software to access the PMU directly from a Linux test program or benchmark application. This can be done with no kernel changes, but does require the PMU to be enabled from user mode and is limited to the number of counters available in the PMU; six for CPUs such as the Cortex-A15 and A57.


Next time I will look at an alternative approach to use the ARM Linux PMU driver and a system call to collect PMU events. 

There's little doubt that the Internet of Things (IoT) market is keenly dependent on sensor design. This raises a number of engineering challenges, not the least of which are power, area and integration. In addition, because it's a fast-growing, fast-changing segment, time to market is critical. This means design tools and methodologies need to evolve to enable these systems.

In the second of two webinars on the topic, experts from ARM, Cadence and Coventor discuss

  • How to create a MEMS component using Coventor tools
  • How to design and integrate analog conditioning circuits using Cadence platforms
  • And how to design energy-efficient discrete smart sensors and sensor fusion hubs with the ARM® Cortex®-M processor family about MEMS, IoT and sensor design is now available.

My Cadence colleague, Richard Goering, offers a summary. In the webinar,MEMS2.jpg Tim Menasveta,  CPU Product Manager, Cortex-M0 and Cortex-M0+ Processors, for ARM, Ian Dennison, Solutions Marketing Senior Group Director for the Custom IC and PCB Groups at Cadence from Cadence and Chris Welham, Worldwide Applications Engineering Manager, from Coventor, present how to design a MEMS vibration sensor (right).

In the first webinar, Diya Soubra, CPU Product Manager for ARM Cortex-M3 processors at ARM, and Dennison guided listeners through ways to reduce time to market and realize power-performance-area design targets.


Related stories:

-- IoT Webinar Series Part 1

-- IoT Webinar Series Part 2

-- Upcoming Webinar: SoC Verification Challenges in the IoT Age

-- Iconic wearable hits the Million mark, sign of things to come?

-- Whitepaper: Pushing the Performance Boundaries of ARM Cortex-M Processors for Future Embedded Design

--Cortex-M7 Launches: Embedded, IoT and Wearables

--New Cortex-M7 Processor Balances Performance, Power

--The new ARM® Cortex®-M7 »

Filter Blog

By date:
By tag: