Skip navigation


1 2 3 Previous Next

ARM Processors

378 posts

SemiWiki recently published a book on FPGA-based prototyping titled “PROTOTYPICAL: The Emergence of FPGA-Based Prototyping for SoC Design.” Among other things the book explores ARM’s role in fpga prototyping technology.  Below is a excerpt from the book.  If you want to read the entire book, you can download it from the S2C web site at


“Developing for ARM Architecture

Since ARM introduced its Cortex strategy, with A cores for application processors, R cores for real-time processors, and M cores for microcontrollers, designers have been able to choose price/performance points – and migrate software between them. How do designers, who are often doing co-validation of SoC designs with production software, prototype with these cores?


Some teams elect to use ARM’s hard macro IP offering, with optimized implementations of cores. ARM has a mixed prototyping solution with their CoreTile Express and LogicTile Express products. CoreTile Express versions are available for the Cortex-A5, Cortex-A7, Cortex-A9, and Cortex-A15 MPCore processors, based on a dedicated chip with the hardened core and test features. The LogicTile Express comes in versions with a single Xilinx Vertex-5, dual Virtex-6, or single Virtex-7 FPGAs, allowing loose coupling of peripheral IP.


Others try to attack the challenge entirely in software. Cycle-accurate and instruction-accurate models of ARM IP exist, which can be run in a simulator testbench along with other IP. With growing designs come growing simulation complexity, and with complexity comes drastic increases in execution time or required compute resources. Simulation supports test vectors well, but is not very good at supporting production software testing – a large operating system can take practically forever to boot in a simulated environment.

Full-scale hardware emulation has the advantage of accommodating very large designs, but at substantial cost. ARM has increased its large design prototyping efforts with the Juno SoC for ARMv8-A, betting on enabling designers with a production software-ready environment with a relatively inexpensive development board.


However, as we have seen SoC design is rarely about just the processor core; other IP must be integrated and verified. Without a complete pass at the full chip design with the actual software, too much is left to chance in committing to silicon. While useful, these other platforms do not provide a cost-effective end-to-end solution for development and debug with distributed teams. Exploration capability in a prototyping environment is also extremely valuable, changing out design elements in a search for better performance, power consumption, third-party IP evaluation, or other tradeoffs.


The traditional knock on FPGA-based prototyping has been a lack of capacity and the hazards of partitioning, which introduces uncertainty and potential faults. With bigger FPGAs and synthesizable RTL versions of ARM core IP, many of the ARM core offerings now fit in a single FPGA without partitioning. Larger members of the ARM Cortex-A core family have been successfully partitioned across several large FPGAs without extensive effort and adverse timing effects, running at speeds significantly higher than simulation but without the cost of full-scale hardware emulation.


A hybrid solution has emerged in programmable SoCs, typified by the Xilinx Zynq family. The Zynq UltraScale+ MPSoC has a quad-core ARM Cortex-A53 with a dual-core ARM Cortex-R5 and an ARM Mali-400MP GPU, plus a large complement of programmable logic and a full suite of I/O. If that is a similar configuration to the payload of the SoC under design, it may be extremely useful to jumpstart efforts and add peripheral IP as needed. If not, mimicking the target SoC design may be difficult.


True FPGA-based prototyping platforms offer a combination of flexibility, allowing any ARM core plus peripheral IP payload, and debug capability. Advanced FPGA synthesis tools provide platform-aware partitioning, automating much of the process, and are able to deal with RTL and packaged IP such as encrypted blocks. Debug features such as deep trace and multi-FPGA visibility and correlation speed the process of finding issues.


The latest FPGA-based prototyping technology adds co-simulation, using a chip-level interconnect such as AXI to download and control joint operations between a host-based simulator and the hardware-based logic execution. This considerably increases the speed of a traditional simulation and allows use of a variety of host-based verification tools. Using co-simulation allows faster turnaround and more extensive exploration of designs, with greater certainty in the implementation running in hardware.


Integration rollup is also an advantage of scalable FPGA-based prototyping systems. Smaller units can reside on the desk of a software engineer or IP block designer, allowing dedicated and thorough investigation. Larger units can support integration of multiple blocks or the entire SoC design. With the same synthesis, debug, and visualization tools, artifacts are reused from the lower level designs, speeding testing of the integrated solution and shortening the time-to-success.


Another consideration in ARM design is not all cores are stock. In many cases, hardware IP is designed using an architectural license, customized to fit specific needs. In these cases, FPGA-based prototyping is ideal to quickly experiment and modify designs, which may undergo many iterations. Turnaround time becomes very important and is a large productivity advantage for FPGA-based prototyping.”

The ISC16 event occurred last week in Frankfurt, Germany. ISC stands for International Super-computing and while ARM is known for its energy-efficient, mobile CPU cores, we are beginning to make some waves in the arena of the world’s largest computers.


ARMv8-A brings out the real strength of Fujitsu’s microarchitecture

To kick off the week, our partner Fujitsu unveiled their plan for the next generation “Post-K” supercomputer to be based on ARMv8-A technology.  It turns out ARM Research has been working hard for several years on a number of technical advantages that will give ARM partners an edge in the HPC market and Fujitsu has taken note.   At ISC16, both Fujitsu and RIKEN, the user of Japan’s fastest and current “K” super-computer, presented their plans to collaborate on the ARM-based Post-K supercomputer.  The significance of this announcement can’t be overstated as this strategic project is seen as Japan’s stepping stone to the Exascale tier of super-computing. Exascale requires roughly 10x the computing power of today’s fastest computers, yet must function within similar power envelopes. It is a lofty goal.


More will be divulged by ARM on its HPC technology at HotChips this August, but in the meantime, here are a few links to recent articles covering the Fujitsu announcement and others relating to ARM in HPC. The Next Platform article does a particularly good job of highlighting why Fujitsu and RIKEN see value in the ARMv8-A architecture and ARM server ecosystem:



Designed for HPC Applications

The last two articles linked above are interesting in they seem to imply that ARM is delving in HPC based on “mobile chips”.  This certainly isn’t the case.  ARM and its partners are taking advantage of the architectural flexibility the ARM business model provides them. Fujitsu and others are designing CPU’s from the ground up with HPC codes and end-user super-computer applications fully in mind, while still benefit from the energy-efficiency benefits of the ARMv8-A architecture.  As noted in the slide shown above, Fujitsu’s own “POST-K” microarchitecture and their collaboration with RIKEN and ARM is a great example of this.  We expect more to come from other ARM partners in the future, so stay tuned.

SAN FRANCISCO--In the decades since the open source software movement emerged, it’s always seemed to pick up momentum, never abating.

This year is no exception as we roll into Red Hat Summit, June 27-30, in San Francisco.RedHat-Summit-2016.jpg

ARM and its ecosystem partners will be at the Moscone Center outlining how server, networking and storage applications are deploying today and how optimized ARM technology platforms provide scalable, power-efficient processing.

For a sneak peek at one of interesting trends in open source, check out Jeff Underhill’s post on Ceph’s embrace of ARM (Ceph extends tentacles to embrace ARM in Jewel release - Visit us at Red Hat Summit (booth #129) to find out more!).

Then join us in booth #129 with our partners Cavium, Linaro, Penguin, SoftIron, AppliedMicro and Western Digital to get the latest insights on open source software and hardware design.

Don’t miss the Thursday panel (3:30-4:30 p.m.) “Building an ARM ecosystem for the enterprise: Through the thorns to the stars,” moderated by Red Hat’s Jon Masters and featuring Underhill, Yan Fisher (Red Hat), Mark Orvek (Linaro), and Larry Wikelius (Cavium).


Related stories:

Ceph extends tentacles to embrace ARM in Jewel release - Visit us at Red Hat Summit (booth #129) to find out more!

The amount of data that consumers are producing is increasing at a phenomenal rate, and shows no signs of slowing down anytime soon. Cisco estimated last year that global mobile data traffic would multiply tenfold between 2014 and 2019, up to a rate of 24.3 Exabyte per month. In order to support this continuing evolution, the infrastructure backbone of the cloud needs to stay ahead of the curve.



Cloud and server infrastructure needs to stay ahead of predicted usage trends


This is requiring large volume deployments of servers in the “cloud provider” space. Large Internet companies are building out datacenters at a scale that is unprecedented to manage all of this data. There is an insatiable appetite for more compute, with the caveat of it needing to be delivered at highest compute density within given server constraints to minimize Total Cost of Ownership (TCO). Datacenters are replacing servers on a shorter cycle as well as evaluating new installations more often because workflow demands are constantly changing. There is huge opportunity for server SoC vendors to innovate, with some aspects being critical to successfully building a server SoC:


  • Time-to-market
  • Performance/Watt
  • Higher levels of integration in costly advanced technology nodes




The explosion in computing applications brings opportunity for tailored server SoCs


To help more ecosystem partners enter the market, ARM has designed processors and System IP blocks (e.g. CoreLink™ Cache Coherent Network, Memory Controllers) that can meet the performance and reliability expectations for the mainstream server market. This helps our partners to develop SoCs for their target applications, which in turn enables OEMs and Datacenter providers to get the right performance within budget. ARM has now taken this a step further in enabling our silicon partners to deliver a mainstream server SoC by developing and delivering a server subsystem.



What is a Server Subsystem?


A server subsystem is a collection of ARM IP (processors and System IP) that has been architected and integrated together along with all the glue logic necessary to lay down the foundation for a server class SoC. The subsystem configuration specifically targets mainstream requirements, which covers roughly 80% of the server market.The subsystem allows a partner to quickly go from Register Transfer Level (RTL) which is a high-level hardware description language used for defining digital circuits to silicon. Even with ARM delivering the soft IP blocks to our partners, it can still take multiple months to architect the SoC, integrate all the IP correctly together while meeting power and performance targets. In addition, the design then needs to be fully verified and validated prior to taping out. The ARM subsystem helps “short circuit” this process by delivering verified and validated top level RTL that has already integrated the various ARM IP blocks together in a mainstreamserver configuration. (Find out more about ARM’s system validation process in the whitepaper System Validation at ARM: Enabling our Partners to Build Better Systems). This can save our partners up to a year of effort. The silicon partner can take our subsystem and then add in the necessary Input/Output logic (e.g. PCIe, SATA, and USB) along with any of their own differentiating IP to complete the SoC design. By providing partners with the subsystem, it significantly reduces the effort of integrating, verifying and validating the IP together for this configuration thus reducing overall development time and allows silicon partners to focus resources on differentiation.


Accelerating the Path to Server SoCs


So, how does this server subsystem help our partners build a competitive server SoC with faster time to market? ARM has architected the server subsystem to provide enough CPU compute to allow partners to efficiently manage the majority of server workloads. The subsystem consists of up to 48 cores, 12 Cortex®-A72 processor clusters, each with four CPU cores, attached to the CoreLink CCN-512 Cache Coherent Network along with four server class memory controllers (CoreLink DMC-520). Other ARM System IP has been integrated in to perform specialized tasks within the subsystem for the kind of use cases expected. CoreLink NIC-450 Network Interconnect for low power, low latency rest of SoC interconnect for peripheral inputs such as PCIe CoreLink GIC-500 Generic Interrupt Controller performs critical tasks of interrupt management, prioritization and routing supporting virtualization and boosting processor efficiency. The real value of the subsystem lies in the fact that all of the IP has been pre-integrated and pre-validated with ARM engineering “know-how” of our IP to ensure predictable performance with much less engineering resource or time required. By taking a holistic view to system performance, the integration teams were able to make the whole subsystem greater than the sum of its parts.


The picture below shows a high level view of the subsystem.



So what about Power, Performance, and Area?


In addition to the above, ARM provides a system analysis report along with the pre-configured and optimized RTL. The system analysis report gives the silicon partner data we collected on the performance, power, and area of the subsystem. It includes industry standard benchmark emulation results such as SPECCPU 2006, STREAM, and LMBench. Based on early analysis, expect this subsystem to scale to performance levels needed to win mainstream server deployments in large datacenters.

These benchmarks are key data points that an end customer buying a hardware platform based on the SoC leveraging the subsystem uses to decide what platform they will buy and deploy in their datacenter. It is critical that our silicon partners have a good understanding of performance expectations well before they have actual silicon they can test. The investment to develop server SoCs is high and reducing the likelihood of additional spins is key to time-to-market. In addition to the performance results, ARM also analyzes the power draw of the subsystem and includes this in the report. Also, ARM physical design team does preliminary floor planning and some timing constraint analysis for target process technology. In effect, it helps our partners understand die size and cost implications which ultimately ensure their design will meet customer’s expectations.



Reference Software and Documentation


In addition to giving our partners a head start on hardware design and understanding PPA (Performance, Power, and Area) targets, the subsystem also comes with a reference software package. The subsystem has been built to adhere to industry server standards (e.g. UEFI, ACPI). The reference software includes ARM Trusted Firmware and UEFI source code ported to the subsystem, ACPI tables populated for the subsystem, and any patches needed to run latest Linux kernel along with release documentation and guide on how to use the software. In addition, a        Fixed Virtual Platform (FVP) of the subsystem is available. The FVP is a fast, accurate platform built on fast models that helps our partner’s software development activities. The software developed for the subsystem is
ready-to-run on the FVP. Delivering this reference software stack along with the optimized RTL allows silicon partners to more rapidly develop the necessary software to allow booting an OS as soon as silicon arrives. On the hardware side, the subsystem also includes a technical reference manual that describes the various pieces of the subsystem in detail, implementation guide, and integration manual. All of this documentation is delivered along with the RTL to help our partners quickly understand the subsystem. This is critical in enabling SoC designers to get up to speed fast and devote as much time and resource as possible on differentiating their design through proprietary IP, customized software, or a mixture of both.


ARM’s Server Subsystem Provides a Fast Start to SoC Development


As I mentioned previously, the ever increasing data processing requirements that are occurring due to the continued electronics revolution have big implications for datacenters. It means that mainstream server SoCs are becoming increasingly complex every year. In addition, companies are replacing their server platforms at an unprecedented rate. This requires our silicon partners to deliver more capable SoCs faster. ARM has been enabling our partners with ARM processors and System IP that can be leveraged to deliver server SoCs. The ARM subsystem now takes this enabling activity a step further by giving our partners a fast start to their SoC development. By providing a pre-integrated, pre-verified foundation, it reduces the entry barriers for the ARM ecosystem to enter the changing server market and develop optimized SoCs for their target applications. For more information please contact me directly here via the comments below or private message and I’ll make sure to get back to you.

Functional safety for Silicon IP used to be a niche activity, limited to an elite circle of chip and system developers in automotive, industrial, aerospace and similar markets. However over the last few years that’s changed significantly. There’s now a more tangible vision towards self-driving cars with increasingly adventurous Advanced Driver Assistance Systems (ADAS) to capture people’s interest along with media-rich in-vehicle infotainment. Moreover the emergence of drones in all shapes and sizes and the growing ubiquity of industrial Internet of Things are also proliferating the requirement for functional safety, all of which are relevant to ARM®.


Much like any technology market surrounded in ‘buzz’ these burgeoning applications require semiconductors to make them happen and the fast-pace of product innovation has attracted huge interest from ARM’s partners. In the IP community ARM leads the way with a broad portfolio of IP from ARM Cortex®-M0+ to the mighty Cortex-A72 and beyond. With a heritage in secure compute platforms and functional safety ARM is well placed to enable the success of its silicon partners.



What’s functional safety all about?


In a nut-shell, functional safety is what the name says, it’s about ensuring that products operate safely and continue to do so even when they go wrong. ISO 26262 the standard for automotive electronics defines functional safety as:


ISO 26262 “the absence of unreasonable risk due to hazards caused by malfunctioning behaviour of electrical / electronics systems”.



Standards for other markets such as IEC 61508 for electrical and electronic systems and DO-254 for airborne electronic hardware have their own definitions, although more importantly they also set their own expectations for engineering developments. Hence it’s important to identify the target markets before starting development and ensure suitable processes are followed – attempts to ‘retrofit’ development processes can be costly and ineffective so best avoided. Figure 1 illustrates a variety of standards applicable to Silicon IP.


Standards green.png

Standards for functional safety of silicon IP



In practice, functionally safe means a system that is demonstrably safe to a skilled third-party assessor, behaving predictably in the event of a fault. It must fail safe which could be with full functionality or graceful degradation such as reduced functionality or a clean shutdown followed by a reset and restart. It's important to realize that not all faults will lead to hazardous events immediately. For example a fault in a car's power steering might lead to incorrect sudden steering action. However, since the electronic and mechanical designs will have natural timing delays, faults can often be tolerated for a specific amount of time. In the ISO 26262 this time is known as the fault tolerant time interval, and depends on the potential hazardous event and the system design.



What’s at fault?


Failures can be systematic, such as due to human error in specifications and design, or due to the tools used. One way to reduce these errors is to have rigorous quality processes that include a range of plans, reviews and measured assessments. Being able to manage and track requirements is also important as is good planning and qualification of the tools to be used. ARM provides ARM Compiler 5 certified by TÜV SÜD to enable safety-related development without further compiler qualification.


Another class of failure is random hardware faults; they could be permanent faults such as a short or broken via as illustrated by Figure 2. Alternatively they could be soft errors caused by exposure to natural radiation. Such faults can be detected by counter measures designed into the hardware and software, system-level approaches are also important. For example Logic Built-In-Self-Test can be applied at startup or shutdown in order to distinguish between soft and permanent faults. Error logging and reporting is also an essential part of any functionally safe system, although it’s important to remember that faults can occur in the safety infrastructure too.



Classes of fault.png

Figure 2. Classes of fault



Selection of counter measures is part of the process I enjoy the most, it relates strongly to my background as a platform and system architect, and often starts with a concept-level Failure Modes and Effects Analysis (FMEA). Available counter measures include diverse checkers, selective hardware and software redundancy, as well as full lock-step replication available for Cortex-R5 and the ‘old chestnut’ of error correcting codes which we use to protect the memories of many ARM products.



Get the measure of functional safety


Faults that build up over time without effect are called latent faults and ISO 26262 proposes that a system designated ASIL D, its highest Automotive Safety Integrity Level, should be able to detect at least 90% of all latent faults. As identified by Table 2, it also proposes a target of 99% diagnostic coverage of all single point failures and a probabilistic metric for random hardware failures of ≤10-8 per hour.



Table 1. ISO 26262 proposed metrics

Proposed metrics.png


These metrics are often seen as a normative requirement, although in practice they are a proposal, and developers can justify their own target metrics because the objective is to enable safe products, not add bullet points to a product datasheet.


A question I often ask myself in respect of semi-autonomous driving is whether it’s safer to meet the standard’s proposed metrics for ASIL D with 10,000 DMIPS of processing or have 100,000 DMIPS with reduced diagnostic coverage and enable ‘smarter’ algorithms with better judgement? The answer is application specific, although in many cases a more capable performant system could save more lives than a more resilient system with basic functionality, so long as its failure modes are not wildly non-deterministic.


Irrespective of the diagnostic coverage achieved, it’s essential to follow suitable processes when targeting functionally safe applications – and this is where the standards really help. Even if you’re not targeting safety, more rigorous processes can improve overall quality.



Get it delivered


When developing for functional safety, an essential part of the product is the supporting documentation which needs to include a safety manual to outline the product’s safety case, covering aspects such as the assumptions of use, explanation of its fault detection and control capabilities and the development process followed.


Safety cases are hierarchical in use, the case for an IP is needed by chip developers to form part of their safety case which then enables their customer and so forth. Most licensable silicon IP will be developed as a Safety Element out of Context (SEooC), where its designers will have little no idea how it will subsequently be utilised. Hence the safety manual must also capture insight from the IP developers about their expectations in order to avoid inappropriate use.


At ARM we support users of targeted IP with safety documentation packages, which always includes a safety manual.


So in summary when planning for functional safety think PDS:

  • Process
  • Development
  • Safety documentation package

The trend for the electronics industry remains the same as ever; we want chips that are smaller, faster, more efficient. When you look at the trajectory of SoC designs you can see that the cost of integrating IP rises sharply when a node process is changed. For example, at 10nm the IP integration cost is projected to be almost 4 times that of a 28nm process. It is a growing drain of project resource in terms of money and effort needed to properly integrate a system.


In an effort to solve this integration issue, we need to look within the design flow to identify areas where improvements can be made. One of these is IP configuration. IP configurability is evolving due to the growing reoccurrence of highly complex IP that designers are integrating into their SoCs. Add to this the amount of competition in the IP market, where silicon partners are looking for IP that is tailored to their design in order to optimize system performance.


IP integration cost per node.png


The above graph, provided by SemiCo Research, shows the costs for 1st time effort at each new node with design parameters maxed out. The trend is clear.



As systems become more complex the configurability requirements for certain types of IP becomes exponentially more complex e.g. a system interconnect (CoreLink NIC-450) or a debug and trace subsystem (CoreSight SoC-400).  These IPs can be considered to have an infinite configuration space which brings a new class of problem.


  • Where do I start?
  • How do I configure all the bits of the IP that I need?
  • How do I know it will work?


What we need, then, is more Intelligent IP configuration that is based on the system context and configured with awareness of PPA constraints making the downstream IP integration process simplified and highly automated.


Another thing to consider is the highly iterative nature of the IP Integration cycle. Between specification, configuration and integration of components it takes many versions before an optimized system can be built. When you add in the increase in data, dependencies and complexities of current IP, it only adds to the problem. Examples of complex IP configurations that need iteration include debug & trace, interrupts, interconnect, MMU, memory, I/O etc.



A solution we have been developing at ARM defines an intelligent IP configuration flow to make system integration more scalable and easier to manage. It involves the following:


  • Consistent method of IP configuration
  • Configure IP consistent with a system context
  • Automatic creation of IP micro architecture (µarchitecture synthesis)
  • Refinement step with quality assurance (µArchitecture DRCs)
  • Automatic integration of IP into the system (auto-integration)


IP Catalog.png


To enable this concept of intelligent IP configuration, you need tooling to automate the configuration and integration of IP, ensure system viability and reduce the time spent on iterations. ARM® Socrates IP Tooling can do this, using what we are calling ‘IP Creators’. ARM IP creators have a unique flow (lifecycle) that includes features such as :


  • Metadata harvesting for ensuring IP configuration is consistney with the system
  • µArchitecture synthesis
  • µArchitecture DRCs
  • µArchitecture refinement
  • Auto-Integration



These features accelerates the design cycle (case study shows an 8x reduction), reduces risk and simplifies system design. Let’s take a closer look at how this is done.



Metadata Harvesting for initial IP configuration

First, you need to automatically create the system specification. This is done through harvesting the system data, as well as identifying the interfaces on the particular IP, for example an interconnect.  Our current flow will read IP-XACT metadata from a system and be able to infer certain interface configuration for IP e.g. for an interconnect we can extract interface requirements e.g. AMBA® protocol type, data size, address size etc. For debug and trace, we can infer information like the number of ATB interfaces, size etc. This process accelerates the specification of the IP interfaces and will use this information to drive the final IP configuration.


System Specification - IP Tooling.png



The next step is to define and create the System IP µarchitecture. The system architect can input high level information e.g. data paths, memory maps, and other data that that are processed by algorithms to configure the IP. The  µArchitecture synthesis automatically creates the IP in a way that is correct-by-construction, through design rule checks (DRCs) that validate the configuration. You can see in the image below the master/slave connections that are generated by the algorithms.



Microarchitecture - IP Tooling.png



The major effect of the µArchitecture synthesis is that configuration iterations are greatly reduced. It results in a system assembly process that is faster and easier. Interfaces are automatically AMBA-compliant through the IP-XACT-driven approach to integration. The image below shows a fully connected system resulting from the µArchitecture  synthesis. 



System assembly - IP Tooling.png



Once system integration is complete, a number of deliverables are generated that can be easily used by different stakeholders within the design team. The RTL of the integrated system design, testbench, test cases, design spec and reports are all automatically published and ready for the next step of SoC design.


RTL Generation - IP Tooling.png



Putting this methodology of intelligent IP configuration and automated IP integration to the test, we conducted some internal studies. Typically the creation of a debug and trace subsystem is a time-consuming and iterative process. When using this new approach, the time spent was dramatically reduced from three months to just one week. Even more impressive was the elimination of 90 bugs when comparing the two approaches, as the intelligent methodology did not return a single bug during the design cycle.


Debug and trace results - IP Tooling.png


Looking to Future Productivity Gains with Socrates IP Tooling


The SoCs that are being designed today have increased dramatically in complexity over the last number of years, and will continue to do so. A combination of smaller process nodes, more complex IP and designs targeting highly specific performance points means that system integration plays an important role in the creation of an SoC. Using an automated tooling methodology based on designer input rules can make system assembly easier and faster. Looking to the future, there is potential for innovation around adding physical awareness as new metadata to enable better PPA analysis and trade-off.

At DAC today, June 6th, we announced the creation of a new partnership program for design houses. Called the ARM Approved Design Partner program, this initiative creates a group of design houses which ARM is happy to recommend to anyone needing design services around ARM IP.


We have linked it very closely with the DesignStart program. Launched last year, DesignStart allows registered users to evaluate the Cortex-M0 processor IP completely free of charge. A follow-on fast-track licence route then allows easy and cost-effective access to the full IP to go into production. DesignStart has generated significant interest since launch and one thing we have noticed is that many registrants do not have in-house SoC design capability. To fill this gap, we have recruited ARM Approved Design Partners, all fully audited, approved and recommended by ARM for their capability in successfully designing with ARM IP.


To find out more about the program, have a look at


The founder members, all present at DAC to join in the launch, are Sondrel (based Reading, UK), eInfoChips (based in Ahmedabad, India), Open-Silicon (based in Milpitas, CA) and SoC Solutions (based in Atlanta, GA). We are delighted to welcome them on board and to be able to recommend them.



Early adopters of ARM's 2017 premium mobile experience IP suite, including the ARM® Cortex®-A73 CPU, Mali™-G71 GPU, and the CoreLink™ CCI-550 cache coherent interconnect as well as the related Artisan POP™ technology, have successfully taped out using Synopsys tools and verification IP.


In support of ARM's launch of this new premium mobile suite, Synopsys issued a concurrent news release highlighting our mutual customer tape-out success. Among the Synopsys products used in these tapeouts are:


In addition, Synopsys announced the immediate availability of a Reference Implementation for the Cortex-A73 processor (using ARM Artisan POP technology) that you can use to jump start optimized implementation of your Cortex-A73 core.


Come to see ARM + TSMC + Synopsys at DAC to learn more about the Reference Implementation - breakfast DAC Monday, a great way to kick off DAC!


Congratulations to ARM on the well received product rollout and also to our mutual customers who are already moving toward products with this new premium mobile IP suite.

The end of the year is approaching, but I’d like to have one last delta before taking some time off. PowerPC vs. ARM, seems like an appropriate stand-off. In this rendition. However, I will incarnate the e200z0 core and the Cortex M4 core, which are the MCU implementations of each corresponding ISAs. For the sake of simplicity, each time the word PowerPC is uttered, I am referring to the e200z0 core; similarly, ARM will stand as a simplification of Cortex-M4.


Getting the obvious out of the way

PowerPC is sold both as silicon (i.e. MCU) as well as synthesizable IP blocks; ARM only sells IP, but there are a number of companies that sell microcontrollers built around said IP. At the end of the day, both cores cannot be compared in terms of technology node because their implementation depends on a third party. I will say, however, that PowerPCs are typically used in automotive and industrial applications which tend to use more robust technology nodes than consumer applications where ARM is typically found. I suspect, but cannot confirm, that one of the reasons for this is that the ARM core is relatively big (physically), and really benefits from a smaller node. Therefore, it is not strange to find PowerPC devices qualified at -40 – 125C ranges, in LPDF packages; ARM devices are normally only qualified in 0 – 85C ranges and come in smaller, BGA packages.



Perhaps the easiest way to compare and contrast each standard is with a side-by-side comparison of the blocks defined by each spec, as taken from each spec:

Side by side comparison of blocks defined in each spec

Image 1: Side by side comparison of blocks defined in each spec; Cortex-M4 to the left, e200z0 to the right.

Similarly colored boxes show the equivalent blocks for each architecture. It should be immediately obvious that the Cortex-M4 to the left has a significant number of blocks without equivalent on the e200z0 architecture.


And this is what I’d like to talk about. has done an excellent job of defining a powerful core, one that is flexible and capable of being hooked-up to an almost infinite number of peripherals. And then it stops. Standard peripherals, such as an interrupt handler unit, or a debug trace unit are not defined in the standard, which means each vendor is free to implement as they wish. ARM, on the other hand, tightly integrates these “standard” peripherals into the core. ARM wins in this situation because tighter integration of debug peripherals means compatibility with standard tools; tighter integration of the interrupt handler unit means quicker interrupts (but let’s not get ahead of ourselves). This approach also helps vendors integrating the IP as they do not have to worry about handling these elements (which are more than likely far away from their target application, or from where they want to add value).


The direct effect of one approach vs. the other is quickly visible when it comes to interrupts: ARM’s Cortex-M4 guarantees a latency of 3-cycles from the time the Interrupt is flagged to the time the core is actually doing something with it. All context registers are stored automatically. The e200z0, on the other hand, will require an external controller to flag it to the core as an external interrupt. Next, some code will need to be written to ensure that the context registers are correctly stored. Finally, it is also code that will allow to jump to the pending interrupt and attend. Latency is therefore not guaranteed, and will vary from implementation to implementation.


But that is not to say that the e200z0 is inferior. Let’s take a look at Table 1:

Cortex M4e200z0
Memory Management/Protection UnitYN
Instruction CacheNN
Signal processing extensionYN
Branch unit processorNot explicitY
Integer divide cycles2 – 12 cycles5 – 34 cycles
Interrupt controllerInternalExternal
Jump-to-Isr latency3 cyclesCode dependant; several cycles
Relocatable ISR tableYesYes
Debug InterfacesJTAG, J-LinkJTAG
Number of core registers13 + SP, LR, PC (16 total)32 + SP, CR, LR, CTR
Instruction set supportedThumb 16-bit instructionsVLE 16-bit instructions
Table 1.

In fact, when you look at  the generalities, the e200z0 and the Cortex-M4 are very similar: Harvard architecture, 32-bit RISC machies with no out-of-order execution and 1-cycle execution times for most instructions. Yes, the Cortex-M4 is about twice as fast ath the e200z0 when it comes to division, but the fact that the latter has double the amount of core registers means that it can economize load/store cycles.


Which brings us to the instruction set architecture.


In a similar effort, both ARM and have created extensions to their original ISA with the goal of reformatting instructions into 16-bit words to help with code density. Both communities have later released devices that are only compatible with these extensions, removing all support for the original ISA. This is the case for both the e200z0 and the Cortex-M4 with Variable Length Encoding, and Thumb ISAs, respectively.


Comparing and contrasting both ISAs probably deserves a blog entry by itself, but the gist of it is that both instruction sets have similar encodings. Perhaps worthy of a special mention is Thumb’s immediate rotate addressing mode, which allows to shift a core-register while performing another operation during the same execution cycle of the original operation.


Truth be told, both ISAs are so complex that it will be up to the compiler to fully exploit their advantages. Take, for example, the Cortex-M4 DSP extension which adds a DSP-like unit capable of 1-cycle Multiply-and-accumulate operations, among others. When writing code, a simple line such as

y = (m * x + b);

will compile using a standard sequence of loads, multiplies, stores, and adds. In order to use the DSP-extension, an abstraction layer needs to be downloaded, and function-like calls need be made (which are replaced by macros and take advantage of said extension).


Which means that code is no longer portable to, say, a PowerPC architecture.


Toolchain support

This category is tough. Both organizations have done an excellent job of standardizing their architectures, and a plethora of compilers and standard tools is available for both. Since both are also JTAG-compliant, this means that almost anything can be used to develop for them:


  • gcc

  • CodeWarrior

  • Green Hills

  • IAR Workbench (ARM only)


I’d say there’s a tie here, although there may be specialized tools on each case,debugging activities are not necessarily harder on one platform than on the other.



If both architectures were to hit the market for the first time today,with the same IP-based distribution model, it’s really hard to predict who would win. The Cortex-M4 is tightly integrated with an interrupt controller and debugging support, while the e200z0 allows a greater amount of customization to vendors. The Cortex-M4 allows bit-shifting as part of a register load or store, but the e200z0 doesn’t need to perform loads and stores as often because it has more core registers. The Cortex-M4 is slightly faster with fixed-point math division.  Toolchain support is excellent for both architectures. Without bringing down these characteristics to specific products, it’s hard to have a winner!


Power ISA v. 2.06B

Cortex-M4 Reference Manual

By now you would have read the news about the latest ARM® Cortex®-A73 processor and Mali™-G71 GPU. These new processors allow for more performance in an ever thinner mobile device, and accelerate new use cases such as Virtual Reality (VR), Augmented Reality (AR) and the playback and capture of rich 4K content. However, these applications place increased demands on the system, and require more data to be moved between processors, cameras, displays and memory. This is the job of the memory system.





ARM Develops IP Together at the System Level


To get the best user experience the memory system must balance the demands of peak performance, low latency and high efficiency. The ARM CoreLink™ interconnect and memory controller IP provide the solution. ARM develops processor, multimedia and system IP together, including design, verification and performance optimization, to get the best overall system performance and to help our silicon partners get to market faster with a reduced integration cost.



There are three key properties that the memory system must deliver:


  • Lower memory latency - to ensure a responsive and fluid experience. This helps maintain a high frame rate providing a more natural VR & AR experience, as well as improving most other use cases such as web browsing and social media interactions.
  • Higher peak bandwidth - to support the increase in pixels and frame rate expected by 4K and HDR content. Also we’re seeing mobile devices with higher megapixel count or multiple cameras, in both cases we need to move more data to and from memory.
  • Improved memory efficiency - to move more data in the same or lower power budget. This can be enabled by innovation in the interconnect, for example hardware cache coherency, as well as improvements in the memory controller to get the best utilization of dynamic memory.


This blog describes how the latest CoreLink System IP delivers on the above requirements.



Optimized Path to Memory with CoreLink CCI-550 and DMC-500


The ARM CoreLink CCI-550 Cache Coherent Interconnect and DMC-500 Dynamic Memory Controller have been optimized to get the best from Cortex-A73 and Mali-G71. ARM big.LITTLE™ processing has relied on CCI products to provide full cache coherency between Cortex processors for a number of years now. For the first time, Mali-G71 offers a fully coherent memory interface with AMBA® 4 ACE. This means sharing data between CPU and GPU is easier to develop, lower latency and lower power.





Accelerating Heterogeneous GPU Compute


GPU compute exists today, but with software or IO coherency it can be difficult to use. Here’s a quote from a middleware developer regarding the cost:


“30% of our development effort was spent on the design, implementation and debugging of complex software coherency.”

Mukund Srinivasan, VP Media Client Business, Ittiam Systems



A fully coherent CPU and GPU memory system offer a simplified programming model and improved performance efficiency. This is enabled by two fundamental technologies:


  • Shared Virtual Memory (SVM) - where all processors use the same virtual address to access a shared data buffer. Sharing data between processes is now as simple as passing a pointer.
  • Hardware Coherency - which ensures all coherent processors see the same shared data and removes the need to clean and invalidate caches.


The following chart summarizes the benefit of these technologies and highlights how a fully coherent memory system can provide a ‘fine-grained’ shared virtual memory where the CPU and GPU can work on a shared buffer at the same time.



5-Shared-Virtual-Memory-and-full-coherency.pngFor a more detailed explanation see this blog:

Exploring How Cache Coherency Accelerates Heterogeneous Compute



OpenCL 2.0 is one API that enables programming with fine-grained SVM. Initial benchmarking at ARM is showing promising results. We have created a simple test called “Workload Balancing” that is designed to stress the processing and moving of data between CPU and GPU. As you can see from the chart below, moving from software coherency to a fine-grained fully coherent memory system can reduce overheads by as much as 90%.




Increasing Cortex-A73 Processor Performance


A high performance and low latency path to memory for the Cortex processors is fundamental to providing a fluid and responsive experience for all applications. The snoop filter technology integrated into the CoreLink CCI-550 enables a higher peak performance and offers system power savings which are discussed later in the blog.


The following example shows how the snoop filter can improve memory performance of a Cortex-A73 in a system where the LITTLE core, Cortex-A53, is idle and running at a low frequency. Under these conditions, any big core memory access will snoop the LITTLE core and will see a higher latency. This could slow down any applications that access memory and may make the device feel sluggish and less responsive.

With the snoop filter enabled the memory requests are managed by the snoop filter and see a consistently low latency, even if the LITTLE core is in a lower power state and running at a low clock frequency.





As can be seen by the chart below, when the snoop filter is enabled the memory tests in the ‘Geekbench’ benchmark see a significant improvement, as much as 241%. Other tests, like integer and floating point are running within the processor caches and are not accessing memory so they see less of a benefit. Overall the improvement on Geekbench score is as much as 28%. In terms of real-world applications this would deliver a more fluid user experience.




Reducing Memory Latency with Advanced Quality-of-Service (QoS)


Reducing latency can give a boost to any application that is working with memory, especially gaming, VR, productivity and web browser tasks. CoreLink CCI-550, NIC-450 and DMC-500 introduce a new interface called ‘QoSAccept’ which is designed to minimize the latency of important memory requests.

Benchmarking within ARM has shown a 38% reduction in latency through the interconnect for worst case traffic, in this example a CPU workload is limited to one outstanding transaction.



10-QoS-Accept-Demonstrates-lowest-latency.pngFor more details, refer to this whitepaper:

Whitepaper: Optimizing Performance for an ARM Mobile Memory Subsystem



System Power Savings with CoreLink CCI-550


Mobile devices are getting ever thinner, and while compute requirements are increasing, it means the whole system must deliver improved power efficiency. The CoreLink CCI-550 and DMC-500 play an important role as they are central to the memory system power. The snoop filter technology allows the number of coherent devices to scale without negatively impacting system power consumption. In fact, the snoop filter saves power in two ways:


  • On-chip power savings - by resolving coherency in one central location instead of broadcasting snoops to every processor.

  • DRAM + PHY power savings - by reducing the number of expensive external memory accesses, whenever data is found in on-chip caches.


As the chart below demonstrates, we see more power savings as the number of coherent ACE interfaces increase, and as the proportion of sharable data increases. In this example “30% sharable” might represent a system where only the big.LITTLE CPU accesses are coherent, and “100% sharable” might represent a future GPU compute use case where all CPU and multimedia traffic is coherent.





While this example shows a system with 4x ACE interfaces, the CoreLink CCI-550 can scale to 6x ACE total interfaces to support systems with the highest performance 32 core Mali-G71.



Scalability to Minimize Area and Cost


Cost, including die area, is always important to the silicon partner and OEM. Reducing the area of silicon gates is also important for reducing power. For these reasons CoreLink CCI-550 has been designed to scale from low cost mobile up to high resolution, high performance tablets and clamshell devices. This scalability also allows the system integrator to tune the design to meet their exact system needs. In terms of peak system bandwidth, CoreLink CCI-550 can offer up to 60% higher peak bandwidth than the CoreLink CCI-500.




Memory System is Key to User Experience


To summarize, the interconnect and memory controller play an important role in delivering the performance expected from the latest Cortex and Mali processors. As noted above, CoreLink CCI-550 and DMC-500 can give a 28% increase in Geekbench, a 38% reduction in memory latency, and save potentially 100’s of mW of memory system power. This is fundamental to delivering the highest possible user experience within a strict power envelope.

ARM’s coherent interconnect products are silicon proven, have been implemented across a range of applications, and have been licensed over 60 times by silicon partners including AMD, HiSilicon, NXP, Samsung and Xilinx to name a few.

I look forward to seeing CoreLink CCI-550 in the latest devices!




Further Information:


Please feel free to comment below if you have any questions.

Consider this: The performance of smartphones, nearly all of which are powered by ARM processors, has grown by 100x since 2009. One hundred times in seven years! With that has emerged entirely new functionality, lightning-fast user responsiveness, and immersive user experiences – all in the same power footprint. It’s really an unrivaled engineering achievement, given the challenging design constraints in the mobile space.

Evolution of the smartphone.jpg

This performance, functionality and user experience dynamic has driven a truly remarkable market, which will see more than 1.5 billion handsets sold in 2016.

With this consumer embrace, smart phone design has become, in many ways, the platform for future innovation. Augmented and virtual reality, ultra-HD visualization, object-based audio processing or computer vision all underlie the demand for extra system performance. At the same time, smart phone designs have slimmed considerably in recently years, which limits thermal dissipation and ratchets up the need for thoughtful power management design. Battery capacity improvement cannot continue as smartphones have gotten as large as they practically can. To continue delivering more immersive user experiences and staying on the smartphone innovation path we’ve blazed in the past decade, we need to deliver more sustained performance with higher efficiency.


To this end, ARM has announced its latest high-performance processor, the Cortex-A73. After introducing Cortex-A72 just last year, ARM is accelerating its innovation pace with the Cortex-A73 processor, which will power premium smartphones by early 2017.


The Cortex-A73 is designed and optimized specifically for mobile and consumer devices. The aspects of  Cortex-A73 that I’m most excited about are all about efficient performance:


  • Delivers the highest performance in the mobile power envelope, at frequencies up to 2.8GHz
  • With 30% better power efficiency to sustain the best user experience
  • Inside the smallest ARMv8-A footprint ever.


I’ve had the privilege of sitting alongside the design team that has created the Cortex-A73, with the specific intent of meeting this challenge: to be the most efficient and highest performance ARM processor. What follows is an overview of the main features and key enhancements of the Cortex-A73 and their resulting benefits.


Cortex-A73: ARMv8-A high-performance processor


Cortex-A73 diagram


Starting with the basics, the Cortex-A73 supports the full ARMv8-A architecture. Its feature set is ideal for mobile and consumer devices. ARMv8-A includes ARM TrustZone technology, NEON, virtualization and cryptography. Both in 32-bit and 64-bit, the Cortex-A73 gives access to the widest mobile application and middleware eco-system – mobile software is developed and optimized by default on the ARM architecture.

The Cortex-A73 includes a 128-bit AMBA 4 ACE interface enabling integration in ARM big.LITTLE systems, either with the highly efficient Cortex-A53 in premium designs or with our latest ultra-efficient Cortex-A35 processor in mid-range and more cost constrained designs.


Highest performance

The Cortex-A73 processor is designed for your next-generation premium smartphone. When implemented in the advanced 10nm technology, the Cortex-A73 delivers 30% more sustained performance than our most recent previous high-performance CPU, the Cortex-A72. Running at frequencies up to 2.8GHz, the Cortex-A73 also delivers the highest peak performance, almost matched by the sustained performance of its extreme energy efficiency. What you’ll notice in the chart below is that the Cortex-A73 can sustain operation at nearly peak frequency, a rarity in mobile phone processors today, where real-world frequencies get throttled back.

Cortex-A73 Maximizes performance



Performance optimized for mobile


The Cortex-A73 micro-architecture includes several interesting performance optimizations that I can share (and quite a few others that I can’t share). It supports a 64kB instruction cache, state-of-art branch prediction based on the most advanced algorithms, and high-performance instruction prefetching. The main performance improvements are actually implemented in the data memory system. It uses advanced L1 and L2 data prefetchers, with complex pattern detection. We have also optimized the store buffer for continuous write streams and increased the data cache to 64kB without any timing impacts.


These enhancements translate into a performance uplift of up to 10% in mobile use cases compared to Cortex-A72 at iso-frequency. We expect silicon designs with Cortex-A73 to push further on frequency than in previous generations, a venture that is assisted by the increased efficiency. Moreover the Cortex-A73 consistently beats Cortex-A72 in all memory workloads by at least 15% to increase the performance across multiple applications, operating system operations or complex compute execution as NEON processing.



A73 performance optimized for mobile



Power efficiency benefits

To deliver the uplift in performance, the Cortex-A73 requires less power than the Cortex-A72. The Cortex-A73 implements several optimizations such as an aggressive clock-gating scheme, power optimized RAM organization, and optimal resource sharing for AArch32 and AArch64 execution to reduce power.


Compared to Cortex-A72, the power saving for a combination of integer workload is above 20% and even higher for workloads such as floating-point or memory access. This power efficiency enables a better user experience and extends the battery life. Or it can also be used to give extra headroom to the rest of the SoC, enabling the overall system and the graphics processor to increase performance and to provide better visual effects, higher frame rate or new functionality.


A73 power efficiency benefit


The smallest ARM Premium CPU


In addition to delivering the highest sustained and peak performance,  the Cortex-A73 is even more compelling as it delivers this performance in the smallest area for an ARMv8-A premium processor. This translates into a premium experience at mid-range costs for the increasingly important mid-range smartphone market. The Cortex-A73 is smaller than the ARMv7-A Cortex-A15; when compared to the Cortex-A57 and Cortex-A72, it offers 70% and 46% area reduction respectively, well over the benefit of the technology itself. At iso-process, Cortex-A73 core is up to 25% smaller than Cortex-A72. Optimal for implementation in advanced technology nodes such as 16nm and 10nm, the Cortex-A73 also scales very efficiently in mass-market nodes such as 28nm to provide significant performance uplift for mid-range devices. The reduced footprint offers silicon area for integrating more functionality or increasing the performance of the other IPs in premium systems, or to decrease SoC and device costs in mid-range systems.


A73 smallest ARM premium CPU


Boost your mid-range smartphone


With our big.LITTLE technology and CoreLink CCI, ARM provides a great scalability to enable our partners to differentiate and optimize their system. What does that mean? SoC designs can create designs with 1 or 2 big cores and 2 or 4 LITTLE cores that rival the performance and user experience of premium designs. An exclusive L2 cache can scale down to 1MB and still provide enough cache to support the big cores in real-world high performance workloads. big.LITTLE software can adapt to all of these scalable configurations by placing work optimally based on an energy model.


big.LITTLE technology is widely deployed in the mobile market today. The Cortex-A73, combined with Cortex-A53, will power the next-generation of premium smartphones, typically in an octa-core configuration. In addition, Cortex-A73 provides the opportunity to boost the mid-range user experience to a higher level. For example, in a hexa-core big.LITTLE configuration, a dual-core Cortex-A73 and quad-core Cortex-A53 or Cortex-A35 enables significant performance uplift in the same or less area than an octa-core Cortex-A53 - a common topology that has been very successful in entry and mid-range devices. In comparison to an octa-core Cortex-A53, the Cortex-A73 hexa-core delivers 30% more multi-core performance and twice the single-thread peak performance resulting in a considerable improvement of the user experience, thanks to a reduced response time for applications such as web browsing and interface scrolling.


A73 more performance


In summary, I am proud to have worked alongside the team that has developed the most efficient high-performance processor, all in pursuit of the continuous improvement of user experience that has come to characterize mobile devices based on the ARM architecture. With the Cortex-A73 processor, you get more for less: more performance, more battery life for less power and less area. Later this year and in 2017, our partners will integrate the Cortex-A73  bringing new functionality and new innovation into premium smartphones, tablets, clamshells, DTVs, and a wide range of consumer devices. I can’t wait to see what they will build.


Related stories:

A walk-through of the Microarchitectural improvements in Cortex-A72

Introducing Cortex-A32: ARM’s smallest, lowest power ARMv8-A processor for next generation 32-bit embedded applications

Memory System is Key to User Experience with Cortex-A73 and Mali-G71


Cache Coherency and Shared Virtual Memory

The Heterogeneous System Architecture (HSA) Foundation is a not-for profit consortium for SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs whose goal is to make it easier for software developers to take advantage of all the advanced processing hardware on a modern SoC. The CPU and GPU on a typical applications processor occupy a significant proportion of die area and applying these resources efficiently across multiple applications can improve the end user experience. Done right, efficiency can be gained in power, performance, programmability and portability.


This blog focuses on some of the hardware innovations and changes that are relevant to shared virtual memory and cache coherency, which are components of the HSA hardware specification.


What is Shared Virtual Memory?

Traditional memory systems defined separate memory for CPU and GPU. In the case of PCs, the GPU may have completely separate discrete memory chips on a different board. In these systems, any application that wants to share data between CPU and GPU will need to copy it from CPU memory to graphics memory at a significant cost of latency and power.


Mobile systems have had a unified memory system for many years where all processors can access the same physical memory. However, even though this is physically possible, the software APIs and memory management hardware and software may not allow this. Graphics buffers may still be defined separately from other memory regions and data sharing may still require an expensive copy of data between buffers.


Shared virtual memory (SVM) allows processors to see the same view of memory; specifically, the same virtual address on the CPU and GPU will point to the same physical memory location. With this architecture, an application only needs to pass a pointer between processors that are sharing data.


There are multiple ways to implement SVM; it doesn’t mean you have to share the exact same page table. The only requirement is that if a buffer is to be shared between processors then it must appear in the page tables for both memory management units (MMUs). With SVM in place, sharing data becomes as simple as passing a pointer between processors.


So What is Cache Coherency?

Let’s go back to basics and ask what does coherency mean? Coherency is about ensuring all processors, or bus masters in the system see the same data. For example if I have a processor which is creating a data structure in its local cache then passing it to a GPU, both the processor and GPU must see the same data. If the GPU reads from external DDR, the GPU will read old, stale data.


There are three mechanisms to maintain coherency:


  • Disable caching is the simplest mechanism but may cost significant processor performance. To get the highest performance, processors are pipe-lined to run at high frequency and access caches which offer a very low latency. Caching of data that is accessed multiple times increases performance significantly and reduces DRAM accesses and power. Marking data as “non-cached” could impact performance and power and in reality is not used.
  • Software managed coherency is the traditional solution to the data sharing problem. Here the software, usually device drivers, must clean dirty data from caches and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.
  • Hardware managed coherency offers an alternative to simplify software. With this solution any cached data marked ‘shared’ will always be up to date, automatically. All processors and bus masters in that sharing domain see the exact same value.


Challenges with Software Coherency

A cache stores external memory contents close to the processor to reduce the latency and power of accesses. On-chip memory accesses are significantly lower power than external DRAM accesses.


Software managed coherency manages cache contents with two key mechanisms:


  • Cache Cleaning:
    • If any data stored in a cache is modified, it is marked as ‘dirty’ and must be written back to DRAM at some point in the future. The process of cleaning will force dirty data to be written to external memory. There are two ways to do this: 1) clean the whole cache which would impact all applications, or 2) clean specific addresses one by one. Both are very expensive in CPU cycles.
    • With modern multi-core systems this cache cleaning must happen on all cores.
  • Cache Invalidation:
    • If a processor has a local copy of data, but an external agent updates main memory then the cache contents are out of date, or ‘stale’. Before reading this data the processor must remove the stale data from caches, this is known as ‘invalidation’ (a cache line is marked invalid).
    • An example is a region of memory used as a shared buffer for network traffic which may be updated by a network interface DMA hardware; a processor wishing to access this data must invalidate any old stale copy before reading the new data.


Complexity of Software Coherency

“We would like to connect more devices with hardware coherency to simplify software and accelerate product schedules”

“50% of debug time is spent on SW coherency issues as these are difficult to find and pinpoint”

Quotes from a system architect at an application processor vendor.


Software coherency is hard to debug; the cache cleaning and invalidation must be done at the right time. If done too often it wastes power and CPU effort. If done too infrequently it will result in stale data which may cause unpredictable application behaviour, if not a crash. Debugging this is extremely difficult as it will present occasional data corruption.


Looking specifically at CPU and GPU sharing, this software cache maintenance will be difficult to optimize and applications on these systems will try and avoid sharing data due to cost and complexity. One middleware vendor using GPU compute with software coherency noted that the developers spent around 30% of their time architecting, implementing and debugging the data sharing including breaking down image data into sub-frames and careful timing of the mapping and unmapping functions.


When sharing is used with software coherency, the size of the task running on the GPU must be large enough to make it worthwhile, taking into account the cost of software coherency.


Hardware Coherency Requires an Advanced Bus Protocol

Extending hardware coherency to the system requires a coherent bus protocol, and in 2011 ARM® released the AMBA® 4 ACE specification which introduces the “AXI Coherency Extensions” on top of the popular AXI protocol. The full ACE interface allows hardware coherency between processor clusters and allows an SMP operating system to extend to more cores.


With the example of two clusters, any shared access to memory can ‘snoop’ into the other cluster’s caches to see if the data is already on chip; if not, it is fetched from external memory (DDR). In mobile, this has enabled the big.LITTLE™ processing model which improves performance and power efficiency by utilizing the right core to suit the size of the task.


The AMBA 4 ACE-Lite interface is designed for IO (or one-way) coherent system masters like DMA engines, network interfaces and accelerators. These devices may not have any caches of their own, but they can read shared data from the ACE processors. Alternatively, they may have caches but these would still need to be cleaned and invalidated by software.


While hardware coherency may add some complexity to the interconnect and processors, it massively simplifies the software and enables applications that would not be possible with software coherency such as big.LITTLE processing.


Adding Hardware Coherency to the GPU

While processor clusters have implemented cache coherency protocols for many years, this is a new area for GPUs. As applications look to share more data between CPU and GPU, hardware cache coherency ensures this can be done at a low cost in power and latency, which in turn makes it easier, more power efficient and higher performance than any software managed mechanism. Most importantly it makes it easy for the software developer to share data.


There are two ways a GPU could be connected with hardware coherency:


  • IO coherency (also known as one-way coherency) using ACE-Lite where the GPU can read from CPU caches. Examples include the ARM Mali™-T600, 700 and 800 series GPUs.
  • Full coherency using full ACE, where CPU and GPU can see each other’s caches.



The Powerful Combination of SVM and Hardware Coherency

The following diagrams summarize what we’ve learned so far and also describe the coarse and fine grain shared virtual memory. These charts approximate elapsed time on the horizontal axis, and address space on the vertical axis.



The above chart shows traditional memory systems where software coherency required data to be cleaned from caches and copied between processors to ‘share’ the data. In additional to cache cleaning, the target cache would also need to invalidate any old data before reading new data from DRAM. This is time consuming and power hungry and limits the applications that can take advantage of heterogeneous processing.


2- SVM2.png

With shared virtual memory the CPU and GPU can now share physical memory and operate on the same virtual address, which eliminates the copy. If we have an IO coherent GPU, in other words one-way coherent where GPU can read CPU caches, then we remove the need to clean data from CPU caches. However, because this is one-way, the CPU cannot see the GPU caches. This means the GPU caches must be cleaned with cache maintenance operations after processing completes. This ‘coarse-grain’ SVM means the processors must take turns accessing the shared buffer.


3- SVM3.png

Finally, if we enable a fully coherent memory system then both CPU and GPU can see exactly the same data at all times, and we can use ‘fine-grained’ SVM. This means both processes can access the same buffer at the same time instead of taking turns. Handshaking between processors uses cross-device atomics. By removing all of the cache maintenance overheads we can get the best overall performance.


Connecting Hardware with Software: Compute APIs

At this point it’s useful to map these hardware technologies to the software APIs. Compute APIs like OpenCL 2.0 can take full advantage of SVM and hardware coherency, and can run on HSA platforms. Not all OpenCL 2.0 implementations are the same; there are a number of optional features that can be enabled if the hardware supports it. These features can also be mapped to the HSA profiles: base profile and full profile, as shown in the table below.


OpenCL Feature

Shared Virtual Memory

Fully Coherent Memory

HSA Profile

Fine Grained Buffer

Required, buffer level

Required, fully coherent

Base Profile

Fine Grained System

Required, full memory

Required, fully coherent

Full Profile

Coarse Grain

Required, buffer level

Not required

(legacy, software or IO coherency)



HSA always requires hardware coherency, and with the base profile the scope of shared virtual memory can be limited to the shared buffers. This means only the shared buffers would appear in both CPU and GPU page tables, not the full system memory. This may be easier and lower cost to implement in hardware.


Full coherency is required for fine grain, and this enables both CPU and GPU to work on different addresses with the same data buffer at the same time.


Full coherency also allows the use of atomic operations, which allows processors to work on the same address within the same buffer. Atomic operations allow synchronization between threads, much like in a multi-core CPU. Atomics are optional for OpenCL but required for HSA.


For coarse grain, if hardware coherency is not present then it would need to use software managed coherency including cache maintenance operations, or optionally IO coherency for the GPU.


Hardware Requirements for Cache Coherency and Shared Virtual Memory

The hardware required to implement these technologies already exists today in the form of fully coherent processors and cache coherent interconnects. The interconnect is responsible for connecting processors, peripherals and memory together on the system on chip (SoC). The AMD Kavari APU already has a fully coherent memory between the CPU and GPU. ARM offers IP such as the CoreLink™ CCI-550 Cache Coherent Interconnect, Cortex®-A72 processor and the Mali Mimir GPU, which together support the full coherency and shared virtual memory techniques described above.


Interconnect innovations such as snoop filters, are essential to support scaling to higher performance memory systems. The snoop filter acts as a directory of processor cache contents and allows any memory access to be targeted directly to the processor that holds that data. More detail on this can be found in this blog: CoreLink CCI-500 and Snoop Filter.


Cache Coherency Brings Heterogeneous Compute One Step Closer

HSA, with full coherency and shared virtual memory, is all about delivering new, enhanced user experiences through advances in computing architectures that bring improvements across key areas:


  • performance
  • power efficiency
  • reduced software complexity


Application developers now have access to the complete compute potential on an SOC, where workloads can be moved seamlessly between computing devices enabling right sized computing for the given task.

The modern smartphone has long since ceased to be just a tool for calling other people. Nowadays we use it as a remote control for our world; recording our experiences first hand, entertainment devices, augmenting our hobbies and always-on social calendars. Those are just some examples of how we consciously use our phones, but they are also used as a primary computing device to support embedded applications such as fitness monitoring and smart home applications like climate control and security.


Modern SoCs need to be designed with these use cases in mind, in order to enable the always-on computing and high functioning computing tasks with an efficiency requirement to last all day. ARM partner MediaTek has a successful track record of designing SoCs that are tailored towards the demands of its end users, people like you and me.


MediaTek have announced a new addition to the Helio line of SoCs, the Helio X20. The Helio X family of SoCs are known for using cutting edge innovation and technology to deliver powerful performance for the rich experiences that are demanded of the modern smartphone.



Helio X20.png

    MediaTek's flagship mobile SoC



The Helio X20 is the world’s first mobile processor with a Tri-Cluster CPU architecture, featuring ten processing cores. The Tri-Cluster architecture is an innovation based on ARM’s big.LITTLE™ processing technology that delivers increased parallel performance at significantly lower average power.  Each processor cluster is designed to efficiently handle different types of workload, enabling a better allocation of tasks which results in optimum performance and extended battery life for users.



For example:

  • The smallest cluster may be used for background computing or simple tasks such as sending messages
  • The middle cluster gives the user an unprecedented balance of computing power and energy efficiency
  • The biggest cluster will be called into play for a task that requires more performance, such as video streaming



The chip delivers leading edge camera, display, gaming and audio features for today’s most demanding applications. Display is refreshed at an accelerated 120 frames per second for crisp and responsive scrolling of web content and maps, and uncompromised motion viewing supporting mobile games with high-resolution graphics. The sharp quality of the video and pictures are such that the screen is forgotten, it feels like real life.



State of the art visuals with MiraVision


Let’s take a look under the hood to see how MediaTek managed to do this with the Helio X20.



ARM Technology Powers the Helio X20


The ARM Cortex-A72 processor was used in a dual-core configuration as the large cluster on the chip, running at 2.3GHz. ARM’s flagship application processor ensured this level of performance was reached with 30% less power consumption, due to micro-architectural innovations that enhance floating point, integer and memory performance which improve the execution of every major class of workload.


The Cortex-A53 processor was used in the medium and small clusters, with a quad-core configuration running at 2.0GHz as the medium cluster and another quad-core with 1.4GHz as the small cluster. The 64-bit processor delivered these frequencies while operating within a low power and area footprint.


The graphics computing power was provided by the ARM Mali-T880, the highest-performance GPU in the Mali family.  Running at 780MHz, it gives 140% higher GPU performance in the Helio X20 chip when compared to its predecessor, while also providing 60% better power performance.


Advanced Power Management is provided through the Cortex-M4 based SCP which is directed by requests from the OS but  is also “system aware” and can enable clock and power control on key system events.


With the Tri-Cluster architecture running different tasks simultaneously, the complexity of managing interrupts increases significantly. The CoreLink GIC-500 generic interrupt controller used affinity level routing to manage the increase in scale in a simple and efficient manner. This boosts processor efficiency, leading to some of the power improvements in the new chip.



MediaTek SoCs Enable Mobile Devices to Perform All Day


As the mobile phone has cemented itself as the primary computing device, we are placing more and more demands on it every day to monitor our health, communicate with friends and entertain us. The variety in the different use cases means that an optimized approach to system design is needed if we are able to rely on our phone’s battery all day.


The MediaTek Helio X family focuses on uncompromised multimedia performance backed by state of the art mobile computing. The Helio X20 uses some world-leading innovations to deliver lasting premium performance due to its advanced power efficiency, and has already begun shipping to customers. The continued partnership between ARM and MediaTek allows both companies to satisfy customer demand in the mobile space.

Chinese Version(中文版):ARM 的系统验证:让合作伙伴能够构建更好的系统

(Huge thanks to the system validation team in Bangalore for providing me with all of the technical information here. Much appreciated!)


Functional validation is widely acknowledged as one of the primary bottlenecks in System-on-Chip (SoC) design. A significant portion of the engineering effort spent on productizing the SoC goes into validation. According to the Wilson Research Group, verification consumed more than 57% of a typical SoC project in 2014.



Source: Wilson Research Group



In spite of these efforts, functional failures are still a prevalent risk for first-time designs. Since the advent of multi-processor chips, including heterogeneous designs, the complexity of SoCs has increased considerably. As you can see in the diagram below, the number of IP components in a SoC is growing at a strong rate.


#IP Blocks in System.png


Source: ChipDesignMag



SoCs have evolved into complex entities that integrate several diverse units of intellectual property (IP). A modern SoC may include several components such as CPUs, GPU, interconnect, memory controller, System MMU, interrupt controller etc. The IPs themselves are complex units of design that are verified individually. Yet, despite rigorous IP-level verification, it is not possible to detect all bugs – especially those that are sensitized only when the IPs interact within a system. This article intends to give you some behind-the-scenes insight into the system validation work done at ARM to enable a wide range of applications for our IP.



Many SoC design teams attempt to solve the verification problem individually using a mix of homegrown and commercially available tools and methods. The goal of system validation at ARM is to provide partners with high quality IP that have been verified to interoperate correctly. This provides a standardized foundation upon which partners are able to build their own system validation SOC solutions. Starting from a strong position, their design and verification efforts can be directed more at the design differentiation they add to the SoC and its interactions with the rest of the system.



Verification Flow


The verification flow at ARM is similar to what is widely practiced in the industry.


Validation flow.png

The ARM verification flow pyramid



Verification of designs starts early and at the granularity of units, which combine to form a stand-alone IP. During the entire verification cycle it is at unit-level when engineers have the greatest amount of visibility into the design. Individual signals that would otherwise be deep within the design may be probed or set to desired values to aid validation. Once unit-level verification has reached a degree of maturity, the units are combined to form a complete IP (e.g. a CPU). Only then can IP-level verification of the IP commence. For CPUs this is very often the first time assembly program level testing can begin. Most of the testing until this point is by toggling individual wires/signals. At IP level the tests are written in assembly language. The processor fetches instructions from memory (simulated), decodes them executes etc. Once top-level verification reaches some stability multiple IPs are combined into a system and the system validation effort begins.


IPs go through multiple milestones during their design-verification cycle that reflect their functional completeness and correctness. Of these, Alpha and Beta milestones are internal quality milestones.  LAC (Limited Access) represents the milestone after which lead partners get access to the IP. This is followed by EAC (Early Access), which represents the point after which the IP is ready to be fabricated for obtaining engineering samples and testing. By the REL (Release) milestone the IP has gone through rigorous testing and is ready for mass production.


IPs are usually between Alpha and Beta quality before going through the system validation flow. By this phase of the design cycle the IPs have already been subjected to a significant amount of testing and most low-level bugs have already been found. Stimulus has to be carefully crafted so that the internal state of the micro-architecture of each IP is stressed to the utmost. The stimulus is provided by either assembly code or by using specially designed verification IPs integrated into the system. ARM uses a combination of both methods.


Many of these bugs could result in severe malfunctions in the end product if they were left undetected. Based on past experience ARM estimates these types of bugs to take between 1-2 peta cycles of verification to discover and 4-7 man months of debug effort. In many cases, a delay that large would prove fatal to a chip’s opportunity to hit its target window in the market. Catching them early enough in the design cycle is critical to ensure the foundations in the IP are stable, before they go on to being integrated as part of an SoC.




System Validation


The nature of ARM’s IP means it is used in a diverse range of SoCs, from IoT devices to high end smartphones to enterprise class products. Ensuring that the technology does exactly what it is designed to do in a consistent and reproducible manner is the key goal of system validation, and the IP is robustly verified with that in mind. In other words, Focus of verification is IP but in a realistic system context. Towards this end, ARM tests IPs in a wide variety of realistic system configurations that are called Kits.


A kit is defined as a “group of IPs” integrated together in the form of a subsystem for a specific target application segment (e.g. Mobile, IoT, Networking etc.). It typically includes the complete range of IPs developed within ARM – CPUs, interconnect, memory controller, system controller, interrupt controller, debug logic, GPU and media processing components.

A kit is further broken down in to smaller components, called Elements. Elements can be considered building blocks for kits.  It contains at least one major IP and white space logic around it, though some of the elements have several IP integrated in together.



IP, Elements, Kits.png



These are designed to be representative of typical SoCs with different applications. One result is that it gives ARM a more complete picture of the challenges faced by the ecosystem of integrating various IP components together to achieve a target system performance.


The system validation team uses a combination of stimulus and test methodology to stress test kits. Stimulus is primarily software tests that are run on the CPUs in the system. The tests may be hand-created - either assembly or high-level language – or generated using Random Instruction Sequence - RIS tools, which will be explained in the upcoming sections. In addition to code running on CPUs, a set of Verification IPs (VIPs) are used to inject traffic into the system and to act as observers.


In preparation for validation, a test plan is created for every IP in the kit. Test planning captures various IP configurations, features to be verified, scenarios that will be covered, stimulus, interoperability consideration with IPs, verification metrics, tracking mechanisms , and various flows that will be a part of verification. Testing of kits starts with simple stimulus that is gradually ramped up to more complex stress cases and scenarios.


The testing performs various subsystem level assessments such as performance verification, functional verification, and power estimation. Reports documenting reference data, namely the performance, power, and functional quality, of selected kits are published internally.  This document focuses on functional aspects only and more on Performance and Power related topics will be covered in future blogs.


The system validation team at ARM has established a repeatable and automated kit development flow, which allows us to build multiple kits for different segments. ARM currently builds and validates about 25 kits annually.


The mix of IPs, their internal configuration, and the topology of the system are chosen to reflect the wide range of end uses. The kits are tested on two primary platforms – emulation and FPGA. Typically testing starts on the emulator and subsequently soak testing is done on FPGA. On average every IP is subjected to 5-6 trillion emulator cycles and 2-3 peta FPGA cycles of system validation. In order to run this level of testing, ARM has developed some internal tools .




System Validation Tools


There are three primary tools used in System validation, which are focused on areas like Instruction pipeline, Ip level and system level memory system, system coherency, Interface level interoperability, etc. Two of these tools are Random Instruction Sequence (RIS) generators. RIS tools explore the architecture and micro-architecture design space in an automated fashion, attempting to trigger failures in the design. They are more effective at covering the space than hand written directed tests. These code generators generate tests to explore different areas of architecture and micro-architecture in an automated fashion. The tests are multi-threaded assembly code, comprised of random ARM and Thumb instructions, designed to thoroughly exercise the functioning of different portions of the implementation.


The third tool  is a lightweight kernel that can be used as a platform to develop directed tests. The validation methodology uses a combination of directed testing and random instruction based automated testing. It supports basic memory management, thread scheduling, and a subset of the pthreads API, which allows users to develop parameterized directed tests.






In order to stress test IP at the system level a more random approach is used rather than a directed approach. This enables ARM to cover a range of scenarios, stimulate multiple timing conditions and create complex events.  To this end,  Kits support various verification-friendly features like changing the clock ratios at different interfaces, enabling error injectors, stubbing out components that are not required for a given feature verification etc. Bus clock ratios at various interfaces in the system like CPU, interconnect and dynamic memory controller can be changed to stimulate realistic system clocking conditions.

System Validation bring up.png
System validation bring up

The diagram above shows how the system is initially brought up and how test complexity is gradually scaled up.


Integration Tests & KVS

Initial testing starts with a set of simple integration tests are run to confirm basic stability of the kit and flush out minor integration issues. Following which a suite of tests called Kit Validation Suite (KVS) is used to thoroughly test the integration of the kit. These tests are run early in the verification cycle to validate the Kit is good enough to run more stressful payloads. KVS can be configured to run on a wide variety of kits. It includes sub-suites to test integration, power, CoreSight debug and trace, and media IPs. There are specific tests in KVS to test integration of GPU and display as well as GPU coherence. Initial boot is usually done on simulation and gradually transition to emulators (hardware accelerators) for the integration testing.


RIS Boot and Bring up

After that we boot all the RIS tools with basic bring up tests on the kit to work through any hardware/software configuration issues.


RIS: Default and Focused Configurations

Once the kit is stable the complexity of tests and therefore the stress that they place on the system is increased. Random stimulus can cover the design space faster than directed stimulus and requires less effort towards stimulus creation. Therefore, for stress testing there is more reliance on random stimulus than directed tests. Initially default configurations of the RIS tools are run and after a suitable number of verification cycles, the tools are re-configured to stress the specific IPs in the kit.


RIS Soak

In the final phase of system validation the kit is soak tested on FPGAs. Though emulators are more debug friendly, FPGAs are faster and can provide a lot more validation cycles. Therefore, once the IPs are stable and mature, ARM does soak test on FPGAs to find complex corner cases.



Metrics, Tracking, Coverage and Milestone Closure


The number of validation cycles run for every Kit is one of the metrics that is tracked to ensure the target number of validation cycles have been met. This is especially useful to ensure the soak-testing cycle target has been met, increasing the confidence of the quality of the IP in various applications. In addition to that we quantify and track coverage using a statistical coverage method to ensure the full design including potential corner cases have been exercised sufficiently.


The latest version of the ARM Juno test chip was subjected to a total validation run time of 6,130 hours, the equivalent of 8 and a half months of testing. This gives a unique perspective into corner cases within the system that makes ARM better able to support partners who are attempting to debug issues within their own design. Furthermore, the bugs that are found during the validation process are then fed back into the IP design teams who use the information to improve the quality of the IP at each release milestone, as well as guide next-generation products.





System complexity has increased in line with SoC performance capabilities, causing a significant growth in the amount of time and money spent on validation. ARM verifies its IP for interoperability before it is released to partners to make sure it is suitable for a wide range of applications. ARM’s IP teams are continuously designing at the leading edge, and are helped by the system validation team to ensure they work together in the systems our partners are building.

Frank Schirrmeister of Cadence Design Systems cites the validation of their tool interoperability as one benefit. As an ARM ecosystem partner, Cadence relies on pre-verified ARM cores and subsystems that can be easily integrated into the designs that we use to validate our tool interoperability. ARM’s software-driven verification approach reflects the industry’s shift toward the portable stimulus specification and allows us to validate the integration and interoperability of ARM cores and subsystems on all Cadence System Development Suite engines, including simulation, emulation and FPGA-based prototyping engines.


Due to the wide variety of applications that the ARM partnership designs for, it is necessary to ensure our IP is functional in many different systems. The multi-stage approach to system validation at ARM gives our partners the peace of mind that they can rely on our IP. Over time the validation methodology has evolved into one that tests several system components and stresses most IPs in the system. In the future we have plans to extend and further improve our test methods to ensure an even higher standard of excellence across ARM IP.


Hi all,


I was recently interviewed by Ann Mutschler at SEMICONDUCTOR ENGINEERING regarding Cache Coherency and Configurability. It's a useful summary of why we need hardware coherency to simplify software, and the importance of configurability of interconnect and system products to optimize the performance, area and cost of SoCs.



Coherency, Cache And Configurability


Coherency is gaining traction across a wide spectrum of applications as systems vendors begin leveraging heterogeneous computing to improve performance, minimize power, and simplify software development.


Coherency is not a new concept, but making it easier to apply has always been a challenge. This is why it has largely been relegated to CPUs with identical processor cores. But the approach is now being applied in many more places, from high-end datacenters to mobile phones, and it is being applied across more cores in more devices.


“Today, in the networking and server spaces, we’re seeing heterogeneous processing there,” said Neil Parris, senior product manager in the Systems and Software Group at ARM. “It’s really a mixture of, for example, ARM CPUs with maybe different sizes of CPUs, but other processors such as DSP engines, as well. The reason they want the cache coherency comes down to efficiency and performance. You want to share data between the processors, and if you have hardware cache coherency, then the software doesn’t need to think about it. It can share the data really easily.”

Without hardware coherency, it has to be written in software. “So every time you need to parse some data from one CPU to the next, you have to clean it out of one CPU cache into main memory,” Parris said. “You have to tell the next CPU if you’ve got any old copies of this data in the cache. You have to invalidate that and clean it out. Then you can read the new data. You can imagine that takes CPU cycles to do that. It takes unnecessary memory accesses and DRAM power to do that. So really, the hardware coherency is fundamental to improving the performance of the system.”



Read the rest at: Semiconductor Engineering .:. Coherency, Cache And Configurability




Filter Blog

By date:
By tag:

More Like This