Skip navigation


1 2 3 Previous Next

ARM Processors

388 posts



At the Linley Processor Conference earlier this week, I had the opportunity to present the challenges facing architects who are building hardware for distributed cloud intelligence. I also

discussed how you can address these challenges with ARM’s 3rd generation coherent backplane IP; the ARM CoreLink CMN-600 and ARM CoreLink DMC-620. The new on-chip network and memory controller IP has been optimized to boost SoC performance across a broad range of applications and markets including; networking, server, storage, HPC, automotive and industrial.


The need for an intelligent flexible cloud

Not only are we seeing a significant growth in the number of connected devices, but we are also seeing evolving use cases. Virtual reality is hitting the mainstream price points requiring a constant high bandwidth stream of content. Autonomous vehicles are catching a lot of buzz, but we probably will not see truly autonomous vehicles on our streets until ultra-low latency car-to-car communication is deployed.  These new use cases will require an intelligent flexible cloud where the applications and services are pushed to the edge of the network.


Blending compute and acceleration from edge to cloud

A new approach will be required to meet the demands of these evolving use-cases.  Today system architects are trying to figure out how to maximize efficiency with heterogeneous computing and acceleration (ex: GPU, DSP, FPGA), to optimize systems across a wide range of power and space constraints.  During the presentation, I showed three different example design points, each with different needs and constraints.  The data center maximizing compute density for a wide variety of workloads, the edge cloud to provide distributed services and the small access point to keep all the end points connected at all times.



New high performance, scalable architecture

These three heterogeneous design points illustrate the targets we set out to address with our 3rd generation coherent backplane IP architecture. Our goal was to maximize compute performance and throughput (a measure of both bandwidth and number of transactions), across a broad range of power and area constraints.


The result is our new CoreLink CMN-600 Coherent Mesh Network and CoreLink DMC-620 Dynamic Memory Controller.  Together they have been optimized to provide a fast, reliable on-chip connectivity and memory subsystem for heterogeneous SoCs that blend ARMv8-A processors, accelerators and IO.


Some of the key new capabilities and performance metrics include:

  • New scalable mesh network that can be tailored for SoCs from 1 to 32 clusters (up to 128 processors)
  • 5x higher throughput than the prior generation and capable of more than 1TeraByte/s of sustained bandwidth
  • Higher frequencies (exceeding 2.5 GHz) and 50 percent lower latency
  • New Agile System Cache with intelligent cache allocation to enhance sharing of data between processors, accelerators and IO
  • Supporting CCIX, the open industry standard for coherent multi-chip processor and accelerator connectivity
  • 1 to 8 channels of DDR4-3200 memory and 3D stacked DRAM for up to 1TeraByte of addressable memory per channel
  • End-to-end QoS and RAS (Reliability, Availability and Serviceability) supported by the combined CMN-600 and DMC-620 solution
  • In-built security with integrated ARM TrustZone Address Space memory protection
  • Automated SoC creation with ARM CoreLink Creator and Socrates DE tooling


The following image illustrates how the technology could be used to build a small access point, focused on throughput with efficiency up to the data center, focused on maximizing compute density.



We are really excited to see the continued evolution of these new intelligent, distributed use-cases and we are excited to see how SoC architects will deploy our new technology. Stay tuned as we’ll be continuing to discuss more about the capabilities in the coming months.


If you would like to find out more about the IP, please check out our developer pages below or attend my upcoming technical talk at ARM TechCon, Oct 25-27 2016 in Santa Clara, CA.



ARM processor

Posted by zoezz Sep 22, 2016

An ARM processor is one of a family of CPUs based on the RISC (reduced instruction set computer) architecture developed by Advanced RISC Machines (ARM).



ARM makes 32-bit and 64-bit RISC multi-core processors. RISC processors are designed to perform a smaller number of types of computer instructions so that they can operate at a higher speed, performing more millions of instructions per second (MIPS).  By stripping out unneeded instructions and optimizing pathways, RISC processors provide outstanding performance at a fraction of the power demand of CISC (complex instruction set computing) devices.



ARM processors are extensively used in consumer electronic devices such as smartphones, tablets, multimedia players and other mobile devices, such as wearables. Because of their reduced instruction set, they require fewer transistors, which enables a smaller die size for the integrated circuitry (IC). The ARM processor’s smaller size, reduced complexity and lower power consumption makes them suitable for increasingly miniaturized devices.



ARM processor features include:



Load/store architecture.

An orthogonal instruction set.

Mostly single-cycle execution.

Enhanced power-saving design.

64 and 32-bit execution states for scalable high performance.

Hardware virtualization support.

The simplified design of ARM processors enables more efficient multi-core processing and easier coding for developers. While they don't have the same raw compute throughput as the products of x86 market leader Intel, ARM processors sometimes exceed the performance of Intel processors for applications that exist on both architectures.



The head-to-head competition between the vendors is increasing as ARM is finding its way into full size notebooks.  Microsoft, for example, offers ARM-based versions of Surface computers. The cleaner code base of Windows RT versus x86 versions may be also partially responsible -- Windows RT is more streamlined because it doesn’t have to support a number of legacy hardwares.



ARM is also moving into the server market,  a move that represents a large change in direction and a hedging of bets on performance-per-watt over raw compute power. AMD offers 8-core versions of ARM processors for its Opteron series of processors. ARM servers represent an important shift in server-based computing. A traditional x86-class server with 12, 16, 24 or more cores increases performance by scaling up the speed and sophistication of each processor, using brute force speed and power to handle demanding computing workloads.



In comparison, an ARM server uses perhaps hundreds of smaller, less sophisticated, low-power processors that share processing tasks among that large number instead of just a few higher-capacity processors. This approach is sometimes referred to as “scaling out,” in contrast with the “scaling up” of x86-based servers.



The ARM architecture was originally developed by Acorn Computers in the 1980s.

Across multiple markets, electronic systems are becoming more complex - including automotive, industrial control and healthcare. Vehicles are beginning to drive themselves, industrial robots are becoming increasingly collaborative, and medical systems are automated to assist with surgery or deliver medication. More of these systems are demanding functionally safe operation and requiring that functional safety be provided at a higher safety level than previous generations of systems demanded. The new ARM® Cortex®-R52 processor has been introduced to addresses the challenging needs of these types of system.

cortex r-52.png

This rise in complexity can be demonstrated in vehicles, where the car compute is expected to rise by 100 times by 2020. For example, engine management systems continue to increase in complexity to meet ever more stringent emission controls and must safely control the engine to prevent damage or hazards like unintended acceleration.  Vehicle electrification requires control of very powerful motors and sophisticated management of batteries with a huge amount of stored energy – the large 90kWh lithium ion battery pack in a Tesla contains the equivalent amount of energy as 77kg of TNT explosive - so the consequences of errors are significant. On the industrial side, factory automation is increasing with autonomous robotics using machine learning and vision systems to enable them to work more flexibly and with less direct control.


Outside the factory, robotics will be used in environments too harsh for humans, such as the nuclear industry, where there is a need to maintain precise and assured operation. They can also be used in the medical operating theaters with remote surgery. In both areas functionally safe operation is critical.


Functional safety

It’s obvious that a car’s brakes need to work exactly when required in order to drive safely. Systems such as these require functional safety. Hazards or errors may occur however; hence a functionally safe system must be capable of detecting these to avoid unsafe situations.


A functionally safe system has to be protected against two types of errors: random or systematic.

Kite Safety 101.PNG

The impact of random errors, for example a memory bit flipping due to radiation, can be protected against through the inclusion of features in the processor. Cortex-R52 integrates the highest level of safety features of any ARM processor to guard against this type of error.

Kite Safety Summary.PNG


Systematic errors on the other hand are typically as a result of software or design errors. Protection against these is provided by the use of appropriate processes and procedures at design. Cortex-R52 has been developed from the ground up within a robust process to help protect it from these systematic issues. A comprehensive safety pack is available to SoC partners which simplifies and reduces the effort needed in certifying the end system.


There are a number of different standards and guidelines related to functional safety. As an example, ISO 26262 was developed for the automotive industry in which four Automotive Safety Integrity Levels (ASIL) are defined, of which D is the highest level.


You can read more about functional safety in The Functional Safety Imperative in Automotive Design whitepaper .

The rise of autonomous systems

There are a range of different applications where functional safety  and fast deterministic execution is necessary. In many real time control systems the application can be managed either with a single Cortex-R52 processor or across multiple homogeneous processors. This might be typical in a conventional control systems like an automotive engine management system or industrial controller.


As mentioned, more and more systems are moving towards autonomous behaviour.  We can divide the functions found in an autonomous system in to a set of stages: sense, perceive, decide, actuate.



  • Sense: a broad range of sensors are used to gather raw information
  • Perceive: data from the sensors is used along with complex algorithms such as machine learning to interpret more about the environment in which the system is operating
  • Decide: the outputs from the various systems are gathered and a decision made
  • Actuate: the decision is carried out or communicated


ARM enables all aspects of these autonomous systems with processors from across the Cortex-A, Cortex-R and Cortex-M families being used according to the need of each stage. The decide and actuate stages must be functionally safe. As an example, the decision stage can take inputs from the navigation system, speed sensors and all of the vision and radar systems and decide when to change lane or to get ready to exit the highway.


Automotive is a prime example of the move to autonomous systems.  We are already seeing driver assistance systems such as lane detection, where the driver is notified, moving to lane keeping where action is taken. Vehicles are introducing functionality on the way to autonomy such as automatic lane changing, that only experimental had previously supported.


The trend is also being seen in other areas. Conventional robotic production lines, where robots carry out a defined fixed task and are segregated from operators, are being replaced by collaborative industrial robots. These have unconstrained interaction with human operators, sensing their environment and taking action safely.  They may be capable of selecting and placing the correct component while working in conjunction with a human operator on the same assembly and avoiding a hazardous conflict. Surgical robots are also increasingly being used to help provide improved patient outcomes and future commercial autonomous drones are expected to be in need of these characteristics.

Autonomous system.PNG

As with the previous real time control system there is a need to take inputs from sensors, decide what to do and then command action.


These autonomous systems need to apply another level of judgement by interpreting more about the environment in which they are operating. These tasks can be confidence based and require high levels of throughput to process large amounts of data. Such operations are well suited to Cortex-A class of processors.


These systems still need to be functionally safe with deterministic execution. When combined together in a heterogeneous processor, the Cortex-R52 can provide a safety island protecting the operation of the system.


In the case of an ADAS system, inputs can be gathered from sensors such as cameras, Radar and Lidar. This data is processed and combined by the Cortex-A processors to identify and classify targets.  This information can be passed to the Cortex-R52 to decide what action to take and perform the necessary checks on the operation to ensure safe operation.


Increasing software complexity

As the functionality of a system has evolved, the complexity of both hardware and software has also increased. Systems are now integrating more software from multiple sources and with multiple safety criticality needs. This is a complex integration challenge.


Safety critical software needs to be validated and certified; a time consuming and complex exercise. Because of the interaction between the software, the entire software stack would typically be safety certified, even if only a small proportion is safety critical. The more complex the system, the harder this becomes.

Kite SW complexity.PNG

A better solution would be the ability to guarantee the independence of safety critical code.  This would simplify the development and integration of functional safety software,  with clear separation between different levels of software criticality. Safety code, critical safety code and non-safety code can each be validated and certified to their required level. Providing this independence means that changes to one module do not require wholesale re-certification of all of the software, thus saving time and effort.


For many of these systems it is important to remember that this separation must be achieved whilst still maintaining deterministic execution.


Cortex-R52 is unique in providing the hardware to support both isolation and real-time execution, and this is achieved through the addition of a new exception level and 2-stage MPU, introduced in the ARMv8-R architecture. This can be used by monitor or hypervisor software to manage access to resources and create sandboxes to protect each task. The design of the Cortex-R52 allows for fast switching between protected applications and maintains deterministic execution.


At the same time as offering protection of software it also simplifies the integration of code together into a single processor. Through the use of a hypervisor, multiple operating systems can be supported more easily, thus enabling consolidation of applications.


Delivering real time performance

Many of these systems I described above require deterministic operation, with the appropriate action being not only controlled but also performed at the right time and without significant delay, regardless of what else is happening in the system.


The Cortex-R family offers real-time processors with high-performance for embedded systems. Cortex-R52 is the first processor in the ARMv8-R architecture and further extends the capabilities of the Cortex-R5, both in terms of functional safety and increased performance.


Cortex-R52 delivers up to 35% higher single core performance over Cortex-R5, when running standard benchmarks. EEMBC has independently certified and published the results of their Automotive Industrial benchmark confirming the processor’s increased capability. Results were achieved using the Green Hills Compiler 2017.


This benchmark performance increase is enhanced by additional real time performance gains. Through fast access and integration of the interrupt controller within the cluster, interrupt latency has been reduced to half that of the Cortex-R5. The improved Memory Protection Unit, with finer granularity and faster reconfiguration, significantly reduces context switching time, to 14 times faster than the Cortex-R5. Compared to the Cortex-R5, system performance is further increased as twice as many Cortex-R52's can integrated within a cluster.


Cortex-R52 supports an adaptable memory architecture with deterministic Tightly Coupled Memories integrated within the processor. These enable assured memory latencies and they can be flexibly allocated to Instruction or Data and configured in a range of sizes to meet the application needs. The processor supports a rich set of interface ports around which the system can be built. Interfaces include a Low Latency Peripheral Port, AXI interfaces and a dedicated wide Flash memory interface to provide access to resources with managed arbitration.


Leveraging the power of ARM

The adoption of Cortex-R52 comes with a lot more than just the processor. The ARM architecture has amassed a broad following of adopters and developers within its ecosystem. With silicon partners delivering hardware to the market, it’s the number one architecture with, at the time of writing, more than 86 billion chips shipped.


Ecosystem partners provide the widest choice of software packages, drivers, stacks, while operating systems and tools - simplifying development for users. Adopters of the Cortex-R52 can leverage this common architecture to reduce costs through availability of multiple suppliers capable of addressing their requirements with the architecture. They can develop on a single platform and implement heterogeneous systems and port solutions between different platforms faster and with more reliable results. For more information check out ARM's software development tools for ARM Cortex-R.

ARM EcoSystem.PNG

Cortex-R52 addresses increased sophistication in safety applications

A high level of deterministic functional safety is needed in automotive, industrial, aerospace and medical markets (amongst others) where there is the need to devolve more autonomy in electronic systems. The Cortex-R52 processor has been designed to address the trend of increasing sophistication in safety applications which are driving a need for higher levels of performance, greater support for functional safety and an improved approach to software separation.

In concert with ARM's rollout today of the new ARM Cortex-R52, the first ARMv8-R processor, Synopsys also announced a broad set of design solutions support to enable design of safety-critical systems for automotive, industrial and healthcare applications with this new processor.


jscobie wrote a good blog that explains how the new ARM processor supports development of safety-critical applications: New ARM Cortex-R52 enables autonomous systems with the highest functional safety standards


Designers can start designing Cortex-R52 designs today using Synopsys solutions, including:

In addition to following the links above to Synopsys solutions for Cortex-R52, you can learn more about Synopsys' automotive IC design and software development solutions, which are enabling safe, secure, smarter cars -- from silicon to software at

The Cortex M7 has twice the DSP power of the M4 by executing twice as many instructions simultaneously, and it also helps that the M7 can operate at a higher clock frequency than the M4. It’s backed by the Keil CMSIS DSP library and includes a single and double precision FPU.




It was developed to provide a low-cost platform that meets the needs of MCU implementation, with a reduced pin count and low-power consumption, while delivering outstanding computational performance and low interrupt latency. You can also use two M7 cores in lock step running the same code – one following two cycles behind the other – so that glitches can be detected by external electronics if the two CPUs sudden behave slightly differently.

Setting up Keil for Your First LED Blinking Program on STM32F7 Discovery Board – KGP Talkie


The STM32F745xx and STM32F746xx devices are based on the high-performance ARM®Cortex®-M7 32-bit RISC core operating at up to 216 MHz frequency. The Cortex®-M7 core features a single floating point unit (SFPU) precision which supports all ARM®single-precision data-processing instructions and data types. It also implements a full set of DSP instructions and a memory protection unit (MPU) which enhances the application security.

If you have been tracking ARM in servers and networking news closely, you will know it has been a busy summer. Most recently at Hot Chips, ARM Fellow and Lead ISA Architect Nigel Stephens disclosed details on our ARMv8-A SVE technology. While Nigel’s technology disclosure was primarily targeted to the HPC community, our next disclosure will have a much broader impact for the ARM server and networking ecosystems.


At the upcoming Linley Processor Conference, ARM Senior Product Manager Jeff Defilippi will introduce the next-generation of ARM coherent backplane IP designed to boost SoC performance in systems based on the ARMv8-A architecture from the edge of the network and into the Cloud. See below for the full description of Jeff's session:




















If you are a member of the press and industry analyst community and would like more information ahead of the conference, please contact


Where: Linley Processor Conference 2016 at the Hyatt Regency, Santa Clara, CA

When: Sept. 27, 1:50 p.m. (Session 5 on the day one agenda entitled SoC Connectivity)


What's next for headsets?

Posted by lorenser Sep 12, 2016

The cat is out of the bag. There has been a lot of speculation around Apple’s plans to remove the headset jack for the iPhone 7. The recent announcement confirming this will now lead to innovation and new opportunities in the headset market. This will be driven by user’s demand for longer listening and talk time for battery-powered headsets and will require scalable platforms to add new features.


Next generation headsets demand scalable solutions


Audio algorithms and codecs cover both encoding and decoding of audio streams, which usually happens in stages. These stages range from MAC intensive modules, such as filters, to modules where control code is dominant. Hence each of these modules has specific system requirements if they are to be efficiently processed.


While the main use case of headsets is audio processing, the human ear is a great source for body diagnostics, too. For example, dedicated sensors in the ear channel could be used to measure heart rate. Adding more sensors into these Bluetooth enabled devices will demand scalable platforms and drive the requirement for even more energy-efficient SoCs.


The ability to process sensor data, control and DSP code in a power and area optimized processor will be essential to enable innovation and consumer excitement. ARM®’s Cortex®-M processors are well positioned to enable scalable platforms to meet current and future requirements. Their ease of use and simple programmer’s model combined with the binary compatibility across the Cortex-M portfolio allow for scalable and future proof systems.


Low-power ARM IP for headset platforms


Cortex-M4 is the ARM’s mainstream Digital Signal Controller and meets the high-performance requirements needed in these battery-powered devices. The highly efficient processing of control code and sensor data is well known in Cortex-M. However, one of the key features of Cortex-M4 is the addition of DSP extensions into the instruction set. This has a number of advantages:


  1. cost savings  - as it enables the integration of a single core instead of two cores
  2. reduced system complexity - by removing the need for shared memory and reducing software development costs


Hence Cortex-M4 is extensively used in audio applications including keyword spotting for voice-activated devices, audio encoding and decoding for phone calls or music playback. It is supported by a rich set of  voice and audio codecs that have been ported to Cortex-M4 including codecs from Adaptive Digital, Alango Technologies, Fraunhofer IIS, Ittiam and Picustech Software.


To make development of wireless systems even easier, the Cortex-M4 is a great combination with ARM’s sub-1V Cordio® radio IP for Bluetooth low-energy applications.


Watch out for my next blog about more information on the signal processing capabilities of Cortex-M4 and Cortex-M7.


See also: Could removing the headphone jack mark the start of the Bluetooth low energy audio accessories market?



Today at Hot Chips in Cupertino, I had the opportunity to present the latest update to our ARMv8-A architecture, known as the Scalable Vector Extension or SVE. Before going into the technical details, key points about ARMv8-A SVE are:


  • ARM is significantly extending the vector processing capabilities associated with AArch64 (64-bit) execution in the ARM architecture, now and into the future, enabling implementation choices for vector lengths that scale from 128 to 2048 bits.

  • High Performance Scientific Compute provides an excellent focus for the introduction of this technology and its associated ecosystem development.

  • SVE features will enable advanced vectorizing compilers to extract more fine-grain parallelism from existing code and so reduce software deployment effort.


I’ll first provide some historical context. ARMv7 Advanced SIMD (aka the ARM NEON instructions) is ~12 years old, a technology originally intended to accelerate media processing tasks on the main processor. It operated on well-conditioned data in memory with fixed-point and single-precision floating-point elements in sixteen 128-bit vector registers.  With the move to AArch64, NEON gained full IEEE double-precision float, 64-bit integer operations, and grew the register file to thirty-two 128-bit vector registers. These evolutionary changes made NEON a better compiler target for general-purpose compute.  SVE is a complementary extension that does not replace NEON, and was developed specifically for vectorization of HPC scientific workloads.


Immense amounts of data are being collected today in areas such as meteorology, geology, astronomy, quantum physics, fluid dynamics, and pharmaceutical research.  Exascale computing (the execution of a billion billion floating point operations, or exaFLOPs, per second) is the target that many HPC systems aspire to over the next 5-10 years. In addition, advances in data analytics and areas such as computer vision and machine learning are already increasing the demands for increased parallelization of program execution today and into the future.


Over the years, considerable research has gone into determining how best to extract more data level parallelism from general-purpose programming languages such as C, C++ and Fortran. This has resulted in the inclusion of vectorization features such as gather load & scatter store, per-lane predication, and of course longer vectors.


A key choice to make is the most appropriate vector length, where many factors may influence the decision:


  • Current implementation technology and associated power, performance and area tradeoffs.

  • The specific application program characteristics.

  • The market, which is HPC today; in common with general trends in computer architecture evolution, a growing need for longer vectors is expected in other markets in the future.


Rather than specifying a specific vector length, SVE allows CPU designers to choose the most appropriate vector length for their application and market, from 128 bits up to 2048 bits per vector register.  SVE also supports a vector-length agnostic (VLA) programming model that can adapt to the available vector length.  Adoption of the VLA paradigm allows you to compile or hand-code your program for SVE once, and then run it at different implementation performance points, while avoiding the need to recompile or rewrite it when longer vectors appear in the future.  This reduces deployment costs over the lifetime of the architecture; a program just works and executes wider and faster.


Scientific workloads, mentioned earlier, have traditionally been carefully written to exploit as much data-level parallelism as possible with careful use of OpenMP pragmas and other source code annotations.  It’s therefore relatively straightforward for a compiler to vectorize such code and make good use of a wider vector unit. Supercomputers are also built with the wide, high-bandwidth memory systems necessary to feed a longer vector unit.


However, while HPC is a natural fit for SVE’s longer vectors, it offers an opportunity to improve vectorizing compilers that will be of general benefit over the longer term as other systems scale to support increased data level parallelism.


It is worth noting at this point that Amdahl’s law tells us the theoretical limit of a task’s speedup is governed by the amount of unparallelizable code. If you succeed in vectorizing 10% of your execution and make that code run 4 times faster (e.g. a 256-bit vector allows 4x64b parallel operations), then you've reduced 1000 cycles down to 925 cycles, providing a limited speedup for the power and area cost of the extra gates. Even if you could vectorize 50% of your execution infinitely (unlikely!) you've still only doubled the overall performance. You need to be able to vectorize much more of your program to realize the potential gains from longer vectors.


So SVE also introduces novel features that begin to tackle some of the barriers to compiler vectorization. The general philosophy of SVE is to make it easier for a compiler to opportunistically vectorize code where it would not normally be possible or cost effective to do so.


What are the new features and the benefits of SVE compared to NEON?


Scalable vector length (VL)Increased parallelism while allowing implementation choice of VL
VL agnostic (VLA) programmingSupports a programming paradigm of write-once, run-anywhere scalable vector code
Gather-load & Scatter-storeEnables vectorization of complex data structures with non-linear access patterns
Per-lane predicationEnables vectorization of complex, nested control code containing side effects and avoidance of loop heads and tails (particularly for VLA)
Predicate-driven loop control and managementReduces vectorization overhead relative to scalar code
Vector partitioning and SW managed speculationPermits vectorization of uncounted loops with data-dependent exits
Extended integer and floating-point horizontal reductionsAllows vectorization of more types of reducible loop-carried dependencies
Scalarized intra-vector sub-loopsSupports vectorization of loops containing complex loop-carried dependencies


SVE is targeted at the A64 instruction set only, as a performance enhancement associated with 64-bit computing (known as AArch64 execution in the ARM architecture). A64 is a fixed-length instruction set, where all instructions are encoded in 32 bits. Currently 75% of the A64 encoding space is already allocated, making it a precious resource.  SVE occupies just a quarter of the remaining 25%, in other words one sixteenth of the A64 encoding space, as follows:


The variable length aspect of SVE is managed through predication, meaning that it does not require any encoding space. Care was taken with respect to predicated execution to constrain that aspect of the encoding space.  Load and store instructions are assigned half of the allocated SVE instruction space, limited by careful consideration of addressing modes. Nearly a quarter of this space remains unallocated and available for future expansion.


In summary, SVE opens a new chapter for the ARM architecture in terms of the scale and opportunity for increasing levels of vector processing on ARM processor cores. It is early days for SVE tools and software, and it will take time for SVE compilers and the rest of the SVE software ecosystem to mature. HPC is the current focus and catalyst for this compiler work, and creates development momentum in areas such as Linux distributions and optimized libraries for SVE, as well as in ARM and third party tools and software.


We are already engaging with key members of the ARM partnership, and will now broaden that engagement across the open-source community and wider ARM ecosystem to support development of SVE and the HPC market, enabling a path to efficient Exascale computing.


Stay tuned for more information


Following on from the announcement and the details provided, initial engagement with the open-source community will start with the upstreaming and review of tools support and associated standards.  General specification availability is expected in late 2016/early 2017.


Nigel Stephens is Lead ISA Architect and ARM Fellow

Today we have exciting news: ARM and Intel Custom Foundry have announced an agreement to accelerate the development and implementation of ARM SoCs on Intel’s 10nm process. Specifically, we are making ARM’s Artisan® Physical IP available on the process as part of an ongoing collaboration.


I’m excited about our collaboration with Intel Custom Foundry for several reasons including:

  • The benefits to our partners by expanding the ARM ecosystem to offer more manufacturing choices for premium mobile and consumer SoCs.
  • Intel Custom Foundry will give its customers access to world-class physical IP and ARM implementation solutions.
  • All the major foundries now offer Artisan platforms, further confirming it as the industry standard for physical IP.


Today’s announcement represents what we expect to be a long-term, mutually beneficial partnership with Intel Custom Foundry.


One of the strengths and differentiators of the Artisan platform is the availability of ARM core-optimized IP—what we call ARM POP™ technology. The value of POP technology for an ARM core on the Intel 10nm process is tremendous, as it will allow for quicker knowledge transfer, enabling customers to lower their risk in implementing the most advanced ARM cores on Intel’s leading-edge process technology. Additionally, POP technology enables silicon partners to accelerate the implementation and tape-outs of their ARM-based designs. The initial POP IP will be for two future advanced ARM Cortex-A processor cores designed for mobile computing applications in either ARM big.LITTLE™ or stand-alone configurations.


Today at the Intel Developer Forum (IDF), I had the pleasure of joining Intel Senior Fellow, Mark Bohr and Intel Custom Foundry Vice President Zane Ball’s Technical Insights session to announce our collaboration.  We discussed how the partnership will accelerate design enablement for future devices in the premium mobile market including smartphones and tablets. Read more about Zane’s perspective on our collaboration.


Ecosystem enablement

You probably glanced at the headline and thought “ARM and Intel collaborating…what?” Despite press stories, Intel and ARM have worked together for years to help enable the ecosystem, and this is just the latest milestone in that long-standing relationship. I see it as a natural evolution of the design ecosystem: ARM is a leader in processor and physical design, and  Intel Custom Foundry is a leading integrated device manufacturer. This combination is a win-win for customers.  It reinforces an ARM tenet throughout our 25-year history: To continuously enable choice and innovation inside the ARM ecosystem.


This agreement provides access to another key manufacturing source and expands the EDA and IP ecosystem to ensure interoperability and a shorter on-ramp for early leading-edge process technology.


I’ve enjoyed broad experience in this industry, working in semiconductors, EDA and now IP. I love the relentless competition but I also am wowed by moments of cooperation that redefine the industry landscape. This agreement is one example of that and will deliver immense value to the design ecosystem and ultimately to our partners. ARM is committed to Intel’s success as a world-class custom foundry at 10nm. We stand behind our mutual customers when they make that choice.


Let me know your thoughts in the comments section below!


Related stories:

Power management is important, and has become increasingly complex. Recently, we have created an application note. See details below. Hopefully, you will find it useful.


  • Provides high-level considerations for power management of a big.LITTLE system and helps you avoid some potential issues in your big.LITTLE design.

Intended Audience

  • It is written for hardware System on Chip (SoC) designers implementing power-down and power-up sequences for ARM processors.
  • It assumes that you have SoC design experience and are familiar with ARM products.


This application note focuses on the following processors and highlights important issues when powering up or powering down processor cores and clusters on an SoC.

  • Cortex®-A7.
  • Cortex®-A15.
  • Cortex®-A17.
  • Cortex®-A53.
  • Cortex®-A57.
  • Cortex®-A72.
  • Cortex®-A73.


This application note is organized into the following chapters:

  • Chapter 1 Introduction

Read this chapter for information about the purpose of the application note.

  • Chapter 2 Power-down and power-up considerations

Read this chapter for high-level considerations for powerdown and powerup.

  • Chapter 3 Potential SoC integration issues

Read this chapter for potential issues when implementing power management for processor cores or clusters on a typical SoC.

  • Chapter 4 Hardware considerations

Read this chapter for general advice from the hardware perspective when implementing power-down and power-up sequences for big.LITTLE systems.

Your feedback

If you have any feedback about this document, please feel free to contact me. My email address is

See the attachment for details about the application note. Thanks.

SemiWiki recently published a book on FPGA-based prototyping titled “PROTOTYPICAL: The Emergence of FPGA-Based Prototyping for SoC Design.” Among other things the book explores ARM’s role in fpga prototyping technology.  Below is a excerpt from the book.  If you want to read the entire book, you can download it from the S2C web site at


“Developing for ARM Architecture

Since ARM introduced its Cortex strategy, with A cores for application processors, R cores for real-time processors, and M cores for microcontrollers, designers have been able to choose price/performance points – and migrate software between them. How do designers, who are often doing co-validation of SoC designs with production software, prototype with these cores?


Some teams elect to use ARM’s hard macro IP offering, with optimized implementations of cores. ARM has a mixed prototyping solution with their CoreTile Express and LogicTile Express products. CoreTile Express versions are available for the Cortex-A5, Cortex-A7, Cortex-A9, and Cortex-A15 MPCore processors, based on a dedicated chip with the hardened core and test features. The LogicTile Express comes in versions with a single Xilinx Vertex-5, dual Virtex-6, or single Virtex-7 FPGAs, allowing loose coupling of peripheral IP.


Others try to attack the challenge entirely in software. Cycle-accurate and instruction-accurate models of ARM IP exist, which can be run in a simulator testbench along with other IP. With growing designs come growing simulation complexity, and with complexity comes drastic increases in execution time or required compute resources. Simulation supports test vectors well, but is not very good at supporting production software testing – a large operating system can take practically forever to boot in a simulated environment.

Full-scale hardware emulation has the advantage of accommodating very large designs, but at substantial cost. ARM has increased its large design prototyping efforts with the Juno SoC for ARMv8-A, betting on enabling designers with a production software-ready environment with a relatively inexpensive development board.


However, as we have seen SoC design is rarely about just the processor core; other IP must be integrated and verified. Without a complete pass at the full chip design with the actual software, too much is left to chance in committing to silicon. While useful, these other platforms do not provide a cost-effective end-to-end solution for development and debug with distributed teams. Exploration capability in a prototyping environment is also extremely valuable, changing out design elements in a search for better performance, power consumption, third-party IP evaluation, or other tradeoffs.


The traditional knock on FPGA-based prototyping has been a lack of capacity and the hazards of partitioning, which introduces uncertainty and potential faults. With bigger FPGAs and synthesizable RTL versions of ARM core IP, many of the ARM core offerings now fit in a single FPGA without partitioning. Larger members of the ARM Cortex-A core family have been successfully partitioned across several large FPGAs without extensive effort and adverse timing effects, running at speeds significantly higher than simulation but without the cost of full-scale hardware emulation.


A hybrid solution has emerged in programmable SoCs, typified by the Xilinx Zynq family. The Zynq UltraScale+ MPSoC has a quad-core ARM Cortex-A53 with a dual-core ARM Cortex-R5 and an ARM Mali-400MP GPU, plus a large complement of programmable logic and a full suite of I/O. If that is a similar configuration to the payload of the SoC under design, it may be extremely useful to jumpstart efforts and add peripheral IP as needed. If not, mimicking the target SoC design may be difficult.


True FPGA-based prototyping platforms offer a combination of flexibility, allowing any ARM core plus peripheral IP payload, and debug capability. Advanced FPGA synthesis tools provide platform-aware partitioning, automating much of the process, and are able to deal with RTL and packaged IP such as encrypted blocks. Debug features such as deep trace and multi-FPGA visibility and correlation speed the process of finding issues.


The latest FPGA-based prototyping technology adds co-simulation, using a chip-level interconnect such as AXI to download and control joint operations between a host-based simulator and the hardware-based logic execution. This considerably increases the speed of a traditional simulation and allows use of a variety of host-based verification tools. Using co-simulation allows faster turnaround and more extensive exploration of designs, with greater certainty in the implementation running in hardware.


Integration rollup is also an advantage of scalable FPGA-based prototyping systems. Smaller units can reside on the desk of a software engineer or IP block designer, allowing dedicated and thorough investigation. Larger units can support integration of multiple blocks or the entire SoC design. With the same synthesis, debug, and visualization tools, artifacts are reused from the lower level designs, speeding testing of the integrated solution and shortening the time-to-success.


Another consideration in ARM design is not all cores are stock. In many cases, hardware IP is designed using an architectural license, customized to fit specific needs. In these cases, FPGA-based prototyping is ideal to quickly experiment and modify designs, which may undergo many iterations. Turnaround time becomes very important and is a large productivity advantage for FPGA-based prototyping.”

The ISC16 event occurred last week in Frankfurt, Germany. ISC stands for International Super-computing and while ARM is known for its energy-efficient, mobile CPU cores, we are beginning to make some waves in the arena of the world’s largest computers.


ARMv8-A brings out the real strength of Fujitsu’s microarchitecture

To kick off the week, our partner Fujitsu unveiled their plan for the next generation “Post-K” supercomputer to be based on ARMv8-A technology.  It turns out ARM Research has been working hard for several years on a number of technical advantages that will give ARM partners an edge in the HPC market and Fujitsu has taken note.   At ISC16, both Fujitsu and RIKEN, the user of Japan’s fastest and current “K” super-computer, presented their plans to collaborate on the ARM-based Post-K supercomputer.  The significance of this announcement can’t be overstated as this strategic project is seen as Japan’s stepping stone to the Exascale tier of super-computing. Exascale requires roughly 10x the computing power of today’s fastest computers, yet must function within similar power envelopes. It is a lofty goal.


More will be divulged by ARM on its HPC technology at HotChips this August, but in the meantime, here are a few links to recent articles covering the Fujitsu announcement and others relating to ARM in HPC. The Next Platform article does a particularly good job of highlighting why Fujitsu and RIKEN see value in the ARMv8-A architecture and ARM server ecosystem:



Designed for HPC Applications

The last two articles linked above are interesting in they seem to imply that ARM is delving in HPC based on “mobile chips”.  This certainly isn’t the case.  ARM and its partners are taking advantage of the architectural flexibility the ARM business model provides them. Fujitsu and others are designing CPU’s from the ground up with HPC codes and end-user super-computer applications fully in mind, while still benefit from the energy-efficiency benefits of the ARMv8-A architecture.  As noted in the slide shown above, Fujitsu’s own “POST-K” microarchitecture and their collaboration with RIKEN and ARM is a great example of this.  We expect more to come from other ARM partners in the future, so stay tuned.

SAN FRANCISCO--In the decades since the open source software movement emerged, it’s always seemed to pick up momentum, never abating.

This year is no exception as we roll into Red Hat Summit, June 27-30, in San Francisco.RedHat-Summit-2016.jpg

ARM and its ecosystem partners will be at the Moscone Center outlining how server, networking and storage applications are deploying today and how optimized ARM technology platforms provide scalable, power-efficient processing.

For a sneak peek at one of interesting trends in open source, check out Jeff Underhill’s post on Ceph’s embrace of ARM (Ceph extends tentacles to embrace ARM in Jewel release - Visit us at Red Hat Summit (booth #129) to find out more!).

Then join us in booth #129 with our partners Cavium, Linaro, Penguin, SoftIron, AppliedMicro and Western Digital to get the latest insights on open source software and hardware design.

Don’t miss the Thursday panel (3:30-4:30 p.m.) “Building an ARM ecosystem for the enterprise: Through the thorns to the stars,” moderated by Red Hat’s Jon Masters and featuring Underhill, Yan Fisher (Red Hat), Mark Orvek (Linaro), and Larry Wikelius (Cavium).


Related stories:

Ceph extends tentacles to embrace ARM in Jewel release - Visit us at Red Hat Summit (booth #129) to find out more!

The amount of data that consumers are producing is increasing at a phenomenal rate, and shows no signs of slowing down anytime soon. Cisco estimated last year that global mobile data traffic would multiply tenfold between 2014 and 2019, up to a rate of 24.3 Exabyte per month. In order to support this continuing evolution, the infrastructure backbone of the cloud needs to stay ahead of the curve.



Cloud and server infrastructure needs to stay ahead of predicted usage trends


This is requiring large volume deployments of servers in the “cloud provider” space. Large Internet companies are building out datacenters at a scale that is unprecedented to manage all of this data. There is an insatiable appetite for more compute, with the caveat of it needing to be delivered at highest compute density within given server constraints to minimize Total Cost of Ownership (TCO). Datacenters are replacing servers on a shorter cycle as well as evaluating new installations more often because workflow demands are constantly changing. There is huge opportunity for server SoC vendors to innovate, with some aspects being critical to successfully building a server SoC:


  • Time-to-market
  • Performance/Watt
  • Higher levels of integration in costly advanced technology nodes




The explosion in computing applications brings opportunity for tailored server SoCs


To help more ecosystem partners enter the market, ARM has designed processors and System IP blocks (e.g. CoreLink™ Cache Coherent Network, Memory Controllers) that can meet the performance and reliability expectations for the mainstream server market. This helps our partners to develop SoCs for their target applications, which in turn enables OEMs and Datacenter providers to get the right performance within budget. ARM has now taken this a step further in enabling our silicon partners to deliver a mainstream server SoC by developing and delivering a server subsystem.



What is a Server Subsystem?


A server subsystem is a collection of ARM IP (processors and System IP) that has been architected and integrated together along with all the glue logic necessary to lay down the foundation for a server class SoC. The subsystem configuration specifically targets mainstream requirements, which covers roughly 80% of the server market.The subsystem allows a partner to quickly go from Register Transfer Level (RTL) which is a high-level hardware description language used for defining digital circuits to silicon. Even with ARM delivering the soft IP blocks to our partners, it can still take multiple months to architect the SoC, integrate all the IP correctly together while meeting power and performance targets. In addition, the design then needs to be fully verified and validated prior to taping out. The ARM subsystem helps “short circuit” this process by delivering verified and validated top level RTL that has already integrated the various ARM IP blocks together in a mainstreamserver configuration. (Find out more about ARM’s system validation process in the whitepaper System Validation at ARM: Enabling our Partners to Build Better Systems). This can save our partners up to a year of effort. The silicon partner can take our subsystem and then add in the necessary Input/Output logic (e.g. PCIe, SATA, and USB) along with any of their own differentiating IP to complete the SoC design. By providing partners with the subsystem, it significantly reduces the effort of integrating, verifying and validating the IP together for this configuration thus reducing overall development time and allows silicon partners to focus resources on differentiation.


Accelerating the Path to Server SoCs


So, how does this server subsystem help our partners build a competitive server SoC with faster time to market? ARM has architected the server subsystem to provide enough CPU compute to allow partners to efficiently manage the majority of server workloads. The subsystem consists of up to 48 cores, 12 Cortex®-A72 processor clusters, each with four CPU cores, attached to the CoreLink CCN-512 Cache Coherent Network along with four server class memory controllers (CoreLink DMC-520). Other ARM System IP has been integrated in to perform specialized tasks within the subsystem for the kind of use cases expected. CoreLink NIC-450 Network Interconnect for low power, low latency rest of SoC interconnect for peripheral inputs such as PCIe CoreLink GIC-500 Generic Interrupt Controller performs critical tasks of interrupt management, prioritization and routing supporting virtualization and boosting processor efficiency. The real value of the subsystem lies in the fact that all of the IP has been pre-integrated and pre-validated with ARM engineering “know-how” of our IP to ensure predictable performance with much less engineering resource or time required. By taking a holistic view to system performance, the integration teams were able to make the whole subsystem greater than the sum of its parts.


The picture below shows a high level view of the subsystem.



So what about Power, Performance, and Area?


In addition to the above, ARM provides a system analysis report along with the pre-configured and optimized RTL. The system analysis report gives the silicon partner data we collected on the performance, power, and area of the subsystem. It includes industry standard benchmark emulation results such as SPECCPU 2006, STREAM, and LMBench. Based on early analysis, expect this subsystem to scale to performance levels needed to win mainstream server deployments in large datacenters.

These benchmarks are key data points that an end customer buying a hardware platform based on the SoC leveraging the subsystem uses to decide what platform they will buy and deploy in their datacenter. It is critical that our silicon partners have a good understanding of performance expectations well before they have actual silicon they can test. The investment to develop server SoCs is high and reducing the likelihood of additional spins is key to time-to-market. In addition to the performance results, ARM also analyzes the power draw of the subsystem and includes this in the report. Also, ARM physical design team does preliminary floor planning and some timing constraint analysis for target process technology. In effect, it helps our partners understand die size and cost implications which ultimately ensure their design will meet customer’s expectations.



Reference Software and Documentation


In addition to giving our partners a head start on hardware design and understanding PPA (Performance, Power, and Area) targets, the subsystem also comes with a reference software package. The subsystem has been built to adhere to industry server standards (e.g. UEFI, ACPI). The reference software includes ARM Trusted Firmware and UEFI source code ported to the subsystem, ACPI tables populated for the subsystem, and any patches needed to run latest Linux kernel along with release documentation and guide on how to use the software. In addition, a        Fixed Virtual Platform (FVP) of the subsystem is available. The FVP is a fast, accurate platform built on fast models that helps our partner’s software development activities. The software developed for the subsystem is
ready-to-run on the FVP. Delivering this reference software stack along with the optimized RTL allows silicon partners to more rapidly develop the necessary software to allow booting an OS as soon as silicon arrives. On the hardware side, the subsystem also includes a technical reference manual that describes the various pieces of the subsystem in detail, implementation guide, and integration manual. All of this documentation is delivered along with the RTL to help our partners quickly understand the subsystem. This is critical in enabling SoC designers to get up to speed fast and devote as much time and resource as possible on differentiating their design through proprietary IP, customized software, or a mixture of both.


ARM’s Server Subsystem Provides a Fast Start to SoC Development


As I mentioned previously, the ever increasing data processing requirements that are occurring due to the continued electronics revolution have big implications for datacenters. It means that mainstream server SoCs are becoming increasingly complex every year. In addition, companies are replacing their server platforms at an unprecedented rate. This requires our silicon partners to deliver more capable SoCs faster. ARM has been enabling our partners with ARM processors and System IP that can be leveraged to deliver server SoCs. The ARM subsystem now takes this enabling activity a step further by giving our partners a fast start to their SoC development. By providing a pre-integrated, pre-verified foundation, it reduces the entry barriers for the ARM ecosystem to enter the changing server market and develop optimized SoCs for their target applications. For more information please contact me directly here via the comments below or private message and I’ll make sure to get back to you.

Functional safety for Silicon IP used to be a niche activity, limited to an elite circle of chip and system developers in automotive, industrial, aerospace and similar markets. However over the last few years that’s changed significantly. There’s now a more tangible vision towards self-driving cars with increasingly adventurous Advanced Driver Assistance Systems (ADAS) to capture people’s interest along with media-rich in-vehicle infotainment. Moreover the emergence of drones in all shapes and sizes and the growing ubiquity of industrial Internet of Things are also proliferating the requirement for functional safety, all of which are relevant to ARM®.


Much like any technology market surrounded in ‘buzz’ these burgeoning applications require semiconductors to make them happen and the fast-pace of product innovation has attracted huge interest from ARM’s partners. In the IP community ARM leads the way with a broad portfolio of IP from ARM Cortex®-M0+ to the mighty Cortex-A72 and beyond. With a heritage in secure compute platforms and functional safety ARM is well placed to enable the success of its silicon partners.



What’s functional safety all about?


In a nut-shell, functional safety is what the name says, it’s about ensuring that products operate safely and continue to do so even when they go wrong. ISO 26262 the standard for automotive electronics defines functional safety as:


ISO 26262 “the absence of unreasonable risk due to hazards caused by malfunctioning behaviour of electrical / electronics systems”.



Standards for other markets such as IEC 61508 for electrical and electronic systems and DO-254 for airborne electronic hardware have their own definitions, although more importantly they also set their own expectations for engineering developments. Hence it’s important to identify the target markets before starting development and ensure suitable processes are followed – attempts to ‘retrofit’ development processes can be costly and ineffective so best avoided. Figure 1 illustrates a variety of standards applicable to Silicon IP.


Standards green.png

Standards for functional safety of silicon IP



In practice, functionally safe means a system that is demonstrably safe to a skilled third-party assessor, behaving predictably in the event of a fault. It must fail safe which could be with full functionality or graceful degradation such as reduced functionality or a clean shutdown followed by a reset and restart. It's important to realize that not all faults will lead to hazardous events immediately. For example a fault in a car's power steering might lead to incorrect sudden steering action. However, since the electronic and mechanical designs will have natural timing delays, faults can often be tolerated for a specific amount of time. In the ISO 26262 this time is known as the fault tolerant time interval, and depends on the potential hazardous event and the system design.



What’s at fault?


Failures can be systematic, such as due to human error in specifications and design, or due to the tools used. One way to reduce these errors is to have rigorous quality processes that include a range of plans, reviews and measured assessments. Being able to manage and track requirements is also important as is good planning and qualification of the tools to be used. ARM provides ARM Compiler 5 certified by TÜV SÜD to enable safety-related development without further compiler qualification.


Another class of failure is random hardware faults; they could be permanent faults such as a short or broken via as illustrated by Figure 2. Alternatively they could be soft errors caused by exposure to natural radiation. Such faults can be detected by counter measures designed into the hardware and software, system-level approaches are also important. For example Logic Built-In-Self-Test can be applied at startup or shutdown in order to distinguish between soft and permanent faults. Error logging and reporting is also an essential part of any functionally safe system, although it’s important to remember that faults can occur in the safety infrastructure too.



Classes of fault.png

Figure 2. Classes of fault



Selection of counter measures is part of the process I enjoy the most, it relates strongly to my background as a platform and system architect, and often starts with a concept-level Failure Modes and Effects Analysis (FMEA). Available counter measures include diverse checkers, selective hardware and software redundancy, as well as full lock-step replication available for Cortex-R5 and the ‘old chestnut’ of error correcting codes which we use to protect the memories of many ARM products.



Get the measure of functional safety


Faults that build up over time without effect are called latent faults and ISO 26262 proposes that a system designated ASIL D, its highest Automotive Safety Integrity Level, should be able to detect at least 90% of all latent faults. As identified by Table 2, it also proposes a target of 99% diagnostic coverage of all single point failures and a probabilistic metric for random hardware failures of ≤10-8 per hour.



Table 1. ISO 26262 proposed metrics

Proposed metrics.png


These metrics are often seen as a normative requirement, although in practice they are a proposal, and developers can justify their own target metrics because the objective is to enable safe products, not add bullet points to a product datasheet.


A question I often ask myself in respect of semi-autonomous driving is whether it’s safer to meet the standard’s proposed metrics for ASIL D with 10,000 DMIPS of processing or have 100,000 DMIPS with reduced diagnostic coverage and enable ‘smarter’ algorithms with better judgement? The answer is application specific, although in many cases a more capable performant system could save more lives than a more resilient system with basic functionality, so long as its failure modes are not wildly non-deterministic.


Irrespective of the diagnostic coverage achieved, it’s essential to follow suitable processes when targeting functionally safe applications – and this is where the standards really help. Even if you’re not targeting safety, more rigorous processes can improve overall quality.



Get it delivered


When developing for functional safety, an essential part of the product is the supporting documentation which needs to include a safety manual to outline the product’s safety case, covering aspects such as the assumptions of use, explanation of its fault detection and control capabilities and the development process followed.


Safety cases are hierarchical in use, the case for an IP is needed by chip developers to form part of their safety case which then enables their customer and so forth. Most licensable silicon IP will be developed as a Safety Element out of Context (SEooC), where its designers will have little no idea how it will subsequently be utilised. Hence the safety manual must also capture insight from the IP developers about their expectations in order to avoid inappropriate use.


At ARM we support users of targeted IP with safety documentation packages, which always includes a safety manual.


So in summary when planning for functional safety think PDS:

  • Process
  • Development
  • Safety documentation package

Filter Blog

By date:
By tag:

More Like This