1 2 3 Previous Next

ARM Processors

282 posts

How the ARM Architecture has fostered differentiation through diversity?


Since 2014, there has been an ever increasing number devices shipping with ARMv8-A based Cortex Processors – ranging from $65 smartphones to premium flagship devices. This is a wide range and evidence of the ways in which the transition to 64-bit continues the advance in system design and process technology in the mobile space; enabling a fresh wave of innovation on the ARM architecture. I thought now would be a good time to explore the degrees of freedom ARM partners have in building SoCs based on the ARM CPU architecture.


When designing a CPU, ARM IP offers two levels of possible differentiation through the ARM licensing model – proprietary or custom microarchitecture and an ARM Cortex processor with system design and implementation choices. Both are fully compatible with the ARM architecture.



  • Proprietary microarchitecture

This allows our partners to license one of the architectures (e.g. ARMv8-A or ARMv7) and have their own implementation of the ARM ISA. The ISA remains unaltered in these cases but partners can choose their own approach to design a CPU from the ground up that complies with the ARM architecture specification.


ARM partners do this to target unique design points or features to address specific segments of the market, albeit it at higher development cost. It is important to remember that independently developed, proprietary microarchitecture CPUs based on the ARM architecture have to pass an ARM mandated compliance suite to ensure that they are 100% compatible with the ARM architecture. This ensures the ecosystem value of the ARM partnership is preserved and enhanced - code written for custom ARM Architecture CPUs will run on other ARM CPUs.


  • ARM Cortex Processor

Partners license ARM designed implementations of the ARM Architecture, such as the ARM Cortex-A processors. At ARM, we are focused on sustaining and growing the largest ecosystem on the planet for efficient computing. Software developed for one ARM-based SoC will run on any other ARM-based SoC that uses the same or newer version of the ARM architecture.


When licensing any combination of Cortex Processors, partners configure the cores to suit their applications without modifying the microarchitecture. This retains the strong foundation of software compatibility. We take great care to ensure that no special modifications are made that could break this compatibility – it is extremely important that all ARM SoCs in a given profile (Cortex-A, Cortex-R, Cortex-M) are software compatible so that the ecosystem is as broad and deep as possible.


Innovation and differentiation within the ARM ecosystem


Even with a “standard” Cortex CPU, there are many ways that partners can in fact differentiate.


  • CPU configuration:

Partners who license ARM CPUs can choose the cache size (L1 and L2), bus interface (e.g. AMBA4 or AMBA5), number of cores in a cluster (1 to 4), and how many CPU clusters to use in the design (2 clusters in a big.LITTLE. design for example). We have seen that partners have built 2+4 big.LITTLE configurations with 2 high performance cores and 4 max efficiency cores for midrange and premium smartphone markets, and 4+4 topologies for higher end smartphone and tablet markets. Similarly, we have seen partners build 2 clusters of 4 LITTLE cores to deliver Octacore capabilities at low to mid-range price points.


L2 size is an important factor in performance on many benchmarks, so high-end designs often push L2 sizes to 2MB for the high performance CPU cluster; low-end and mid-range designs can sometimes play this trade-off differently, with a 1MB L2 for the high performance cluster, or 512kB L2 cache size for a high efficiency CPU cluster in a big.LITTLE SoC, trading off performance for cost savings. This range of configurability allows ARM partners to tailor the CPU capacity in their SoC to their target markets, while retaining full compatibility with the ARM architecture and full access to the benefits of the ARM ecosystem.


  • Power domains:

Cortex-A CPU IP comes with optional power domains around each CPU core, the L2 subsystem, and other areas of the design. Partners can choose how to implement these voltage domains, and can choose to share or group some domains. Further to this, ARM introduced state retention modes for CPU cores and for the Advanced SIMD units in some of our more recent CPUs that partners can optionally use to offer finer grained power management in the SoC.

  • Peripherals:

There are of course numerous peripherals and interfaces beyond the CPU, GPU, and other processing subsystems that can differentiate an SoC. By taking standard Cortex-A CPUs, some partners choose to devote more of their engineering resources to optimizing and tuning specific peripherals and interfaces to differentiate their SoCs.


  • Memory system performance:

Although every Cortex-A CPU is equivalent to every other Cortex-A CPU of the same revision in terms of performance within the CPU, often CPU performance depends quite heavily on memory system performance, and we can observe two Cortex-A CPUs of the same type delivering significantly different performance as a result of this. As one example, the latency to L2 memory depends on the number of slices a partner uses to meet timing for their target frequency; a partner with lower latency to the L2 will have an advantage in performance benchmarks that spill outside the L1 instruction or data caches. As another example, the latency to main memory can differ a lot from one SoC to another - if one SoC has a memory latency of 100 cycles and the other 140 cycles, the 100 cycle latency memory system will be a big advantage in many (but not all) of the key benchmarks, and is often an observable advantage in terms of delivered performance on real-world workloads.


Often partners seek to differentiate on memory system performance, recognizing the large impact this has on overall performance even against other SoCs with the same Cortex-A CPU. One last point on the topic of memory system performance; CPU performance is very sensitive to latency to main memory, and GPU performance is more sensitive to bandwidth to main memory, so ARM partners will optimize and balance between latency and bandwidth in the design of the memory system for their target applications.


  • SoC level power management:

The way in which a given SoC manages power incorporates several different mechanisms to slow down or shut down components when under light or zero demand during different phases of use. With so many different components in an SoC design, ARM partners have a lot of ways in which they can manage power, and some partners differentiate on the power management mechanisms in the SoC, the big.LITTLE tuning and power management framework, or the software that organizes the management of component shutdown and presents it to the OS or middleware.


There are several system and implementation choices which further offer ways for ARM partners to differentiate when using Cortex-A standard CPUs:

  • Process node: ARM IP is shipped as synthesizable RTL that can be implemented on several different process nodes. Today (early 2015) partners at the highest premium end of the market are building with ARM IP on 16nm and 14nm, while many premium designs are being built and currently shipping on 20nm, with a range of designs targeting 28nm for lower-cost premium SoC platforms for the mid-range and entry level. The frequency and power characteristics can vary significantly for the same ARM CPU implemented on different process nodes, so the choice of process remains one of main ways (and most obvious) that partners differentiate on ARM IP


  • Physical implementation: The time and effort spend on physical placement, routing, and optimization of the logic and RAM arrays in a design can significantly differentiate one Cortex-A CPU from another. For example, investment in physical design can produce higher maximum frequency for the same design, lower power at the same maximum frequency, or some combination of the two. Also, partners sometimes iterate on the physical design of a CPU, such that the 2nd or 3rd generation of a product can be significantly improved in power, performance, and area (cost) characteristics due to improvements in the physical implementation of the same Cortex-A CPU, providing further differentiation for the partner. ARM POP IP has been a factor in improving the quality of results that can be achieved in physical design by partners, and also improves the next differentiation factor in this list… time to market.


  • Time to market: Release windows are critical in markets like high-end smartphones and premium tablets, where a delay of one month can mean missing a whole year design cycle for devices with an annual refresh. Some partners differentiate on being very fast to market based on designs with Cortex-A CPUs. Often in those fast markets, the initial SoC product will be followed with a revised version that improves on the original.


  • GPU, ISP, video and audio subsystems: In a modern mobile SoC, the performance of the chip is often influenced even more strongly by the performance of the graphics processor, the image processing, the video and audio subsystem, and of course the way these components all work together. ARM provides industry leading IP in the Mali GPU and video subsystem, but we allow our partners to mix and match between our IP, their own IP, and that of 3rd parties. This allows the ARM partnership to experiment with different combinations of IP, iterate rapidly, and compete for the best combination in each device generation. This competitive iteration has led to rapid innovation in smartphones and tablets and is a key benefit of the ARM ecosystem, a benefit that is now well established in networking markets, and making inroads into server markets, for example.


  • System design: The way in which the CPU, GPU, ISP, video subsystem, coherent interconnect, and memory system work together as a combined system is an increasingly important factor in modern SoCs, and a key way for partners to differentiate their chips. Examples of differentiation in the system design include the use and configuration of cache coherent interconnect, next level cache memories, dynamic memory controllers, and the software that configures the system and optimizes things like power down modes and operating points at run-time.


  • Software: Beyond the hardware IP and custom components in an SoC, there is of course the software that configures and operates the SoC. The key attribute we have been discussing is the compatibility of all ARM-based designs, so that the Linux kernel, application software, and middleware all run the same on ARM-based CPUs. ARM partners can differentiate along all of the dimensions listed above, and still maintain full software compatibility that allows them to tap in to the vast wealth of software written for the ARM architecture. The chip support and board support packages with a given SoC can be a point of differentiation for ARM partners that invest there.


As a result of all of these opportunities for differentiation, any 2 Cortex-A57, Cortex-A72, or Cortex-A53-based processors can be quite different in their system, power, and performance characteristics, while still being identical from a software perspective.


A quick listing of ways the performance can differ (summarizing some of the points made above):

  • Max frequency (and max sustainable frequency - influences by power)
  • Power (affects sustained frequency in a thermally constrained environment)
  • Latency to the L2
  • Latency to main memory
  • Bandwidth to main memory
  • L2 size (and L1 size for some ARM CPUs)
  • big.LITTLE topology - number of cores
  • big.LITTLE tuning and scheduling policy
  • Coherent interconnect


Beyond all these of course, our partners innovate around the core with their own IP blocks and design techniques.


To sum up, ARM prioritizes the value of the ecosystem - that ability to design code for all ARM-based CPUs of a given architecture release - and offers partners two ways in which to achieve this – through proprietary microarchitecture or by licensing standard ARM Cortex CPUs. There remain numerous important ways in which ARM partners can differentiate and as our partners can and do differentiate along all of these dimensions, it is very important to analyze these characteristics when assessing one SoC based on an ARM Cortex-A core against another.


A benefit of this range of configurability and differentiation is that ARM CPU IP can scale to address a broad range of different markets, and the ARM partnership can respond quickly as new markets start to emerge. An example of this is the recent emergence of the wearables market. ARM partners have repurposed low-end smartphone SoCs, based on the incredibly low-power consuming Cortex-A7, to service the initial wave of watches, along with even lower power Cortex-M CPUs (an order of magnitude less power) for fitness bands and other wearables that don’t require a UI, complex display, or MMU-based OS. Now we are starting to see Cortex-A7 based designs optimized specifically for wearable product, and the targeted physical implementation enables low power wearable implementation that runs under 10mW at 100MHz for a full Apps core - this coming from the same Cortex-A CPU that is shipping in 8 core 2GHz versions for low-cost mid-range smartphones.


Clearly it is important for OEMs to assess the many differentiating factors when choosing between ARM-based SoCs for devices, and it is even more critical for ARM partners to differentiate along each of these paths in the competitive market for SoCs in the ARM ecosystem. It is through this freedom of choice that the ARM partnership has innovated so rapidly and will continue to do so as the ARM ecosystem expands to more fully serve other markets.

In early 2015, ARM announced a suite of IP for Premium Mobile designs, with the ARM® Cortex®-A72 Processor delivering a 3.5x increase in sustained delivered performance over 28nm Cortex-A15 designs from just a two years ago. At ARM we are focused on delivering IP that enables the most efficient computing and the Cortex-A72 micro-architecture is the result of several enhancements that increase performance while simultaneously decreasing power. Last week, at the Linley Mobile Conference, I had the pleasure to introduce the audience to the micro-architectural detail of this new processor which I thought would be good to summarize as there has been quite some interest.


From a CPU Performance view, we have seen a tremendous growth in performance: a 50x increase in the last five years (15x at the individual core level). The graph below zooms in on performance increases in single core workloads broken into floating point, memory, and integer performance. All points shown up to 2015 are measured from devices in market, and Cortex-A72 is projected based on lab measurements to give a preview of what is expected later in 2015 and in 2016. The micro-architectural improvements in Cortex-A72 result in a tremendous increase across all aspects – floating point, CPU memory and integer performance. For the next generation of mobile designs, Cortex-A72, particularly on 14nm/16nm process technology, is going to change the game – the combination of the performance and efficiency of this process node and CPU are extremely compelling.




The improvements shown here have come through improvements at the micro-architectural level coupled with increasing clock frequencies. But delivering peak performance alone isn’t the challenge of designers – mobile devices are characterized by the constrained thermal envelope SoC designers have to operate within. Hence, to increase performance within the mobile power and thermal envelope, turning up the frequency or increasing the issue rate in the micro architecture isn’t the answer – you have to improve power efficiency.




The Cortex-A72 micro-architectural improvements increase efficiency so much that it can deliver the same performance of Cortex-A15 in half the power even on 28nm, and for 75% less power on 14/16nm FinFET nodes. The performance of a Cortex-A15 CPU can be reproduced on the Cortex-A72 processor at reduced frequency and voltage resulting in a dramatic power reduction. However, mobile apps often push the CPU to maximum performance rather than a specific absolute required level of performance. In this case, a 2.5GHz Cortex-A72 CPU consumes 30~35% less power than the 28nm Cortex-A15 processor, still delivering more than 2x the peak performance.


Enhancements to the Cortex-A72 micro-architecture

Below is a simplified view of the micro-architecture. Those familiar with the Cortex-A57 pipeline will recognize that the Cortex-A72 CPU sports a similar 3-wide decode, 8 wide issue pipeline. However in Cortex-A72 the dispatch unit has been widened to deliver up to 5 instructions (micro-ops) per cycle to the execution pipelines.



I list here some key changes and the difference they make (an exhaustive list would be too long!) that highlight the way in which the design of Cortex-A72 CPU was approached, beginning with the pipeline front end.

Pipeline front end

One of the most impactful changes in the Cortex-A72 micro-architecture is the move to a sophisticated new branch prediction unit. There is an interesting trade-off here - a larger branch predictor can cost more power, but for realistic workloads where branch misses occur, the new predictor’s reduction in branch miss rate more than pays for itself in reduction of mis-prediction and mis-speculation. This reduces overall power, while simultaneously improving performance across a broad range of benchmarks.


The instruction cache has been redesigned to optimize tag look-up such that the power of the 3-way cache is similar to the power of a direct mapped cache – doing early tag lookup in just one way of the data RAM instead of 3 ways. The TLBs and micro BTBs have been regionalized, so that the upper bits can be disabled for the common case when page lookups and branch targets are closer rather than farther away. Similarly, small-offset branch-target optimizations reduce power when your branch target is close. Suppression of superfluous branch predictor accesses will reduce power in large basic blocks of code – the A72 recognizes these and does not access the branch predictor during those loops to save power.


Decode/Rename block

Of the many changes in the decode block, the biggest change is in handling of microOps – the Cortex-A72 keeps them more complex up to the dispatch stages – this increases performance and reduces decode power. AArch64 instruction-fusion capability deepens the window for instruction level parallelism. In addition to this, the decode block has undergone extensive decoder power optimization, with buffer optimization and flow-control optimizations throughout the decode/rename unit.



In the dispatch/retire section of the pipeline, the effective dispatch bandwidth has increased to 5-wide dispatch, offering increased performance (by increasing instruction throughput), while reducing decode power – decoding full instructions rather than microOps gets more work done per stage for those instructions. Cortex-A72 also features a power-optimized reorganization of architectural and speculative register files, with significant port-reduction and area. It has also optimizations in commit-queue and register-status FIFOs, arranging and organizing them in a more power efficient manner.


One final example of the improvements in the dispatch/retire section is the suppression of superfluous register-file accesses - detecting cases where operand data is guaranteed to be in the forwarding network. Every time you avoid a read from the register file, you save power.


Floating Point Unit and Advanced SIMD

Here the biggest improvement is the introduction of new lower latency FP functional units. We’ve reduced latencies to:

  • 3-cycle FMUL unit (40% latency reduction)
  • 3-cycle FADD unit (25% latency reduction)
  • 6-cycle FMAC (33% latency reduction)
  • 2-cycle CVT units (50% latency reduction)

These are very fast floating point latencies, comparable with the latest high performance server and PC CPUs. Floating point latency is important in typical mobile and consumer use cases where there is commonly a mix of FP and integer work. In these settings, the latency between computation and result is critical. Shorter latencies mean integer instructions waiting on the results of those instructions are less likely to be stalled.


This performance increase shows up in SpecFP and SpecFP2006 as an uplift of approximately 25%. This type of improvement is less useful for high-performance compute applications where pure floating point throughput is required. For mobile use cases, floating point shows up in combination with integer work. A good example of this combination of floating point and integer is in javascript code where the native data type is double precision float. In addition, the divide unit has gone to a Radix-16 FP divider, doubling the throughput of divide instructions executed.


Other improvements in this area of the design include an improved issue-queue load-balancing algorithm, and multiple zero-cycle forwarding data paths resulting in improved performance and reduced power. Finally, the design features a source-reduction in the integer issue-queue which cuts power without performance loss.


Load/Store unit

The Load/Store unit features several key optimizations. The main improvement is the replacement of the pre-fetcher with a more sophisticated combined L1/L2 data prefetcher - it is more advanced and recognizes more streams. The Load/Store unit also includes late-pipe power reduction with a L1 D-cache hit predictor. Performance tuning of Load-Store buffer structures improves both effective utilization as well as capacity. Increased parallelism in MMU table-walker and a reduced-latency L2 main TLB improves performance on typical code scenarios where data is spread across many data pages, for example in Web browsing where data memory pages typically change frequently. Finally, there is power optimization in configurable pipeline support logic, and extensive power optimization in L2-idle scenario. The combination of the new pre-fetcher and other changes enable an increase of more than 50% CPU memory bandwidth.


In summary, the Cortex-A72 micro architecture is a significant change from Cortex-A57. There are several changes to the microarchitecture working in concert to produce a big step up in performance while consuming less power. that importantly also results in a design that consumes lower power. The important takeaways for the ARM Cortex-72 are:


  • Performance improvements in numerous key areas
  • Generational performance upside across all workload categories
  • Extensive power-efficiency improvements throughout the microarchitecture
  • Reduced-area, lower-cost solution


Cortex-A72 is truly the next generation of high-efficiency compute. A Cortex-A72 CPU based system increases performance and system bandwidth further when combined with ARM CoreLink CCI-500 Interconnect. A premium mobile subsystem will also contain the Cortex-A53 in a big.LITTLE CPU subsystem to reduce power and increase sustained performance in thermally constrained workloads such as gaming. Finally, the ARM Mali-T880 graphics processor can combine with Cortex-A72 and CCI-500 to create a premium mobile system. That’s all for this edition of the blog. What further features of the Cortex-A72 and the big.LITTLE premium mobile system are you interested in hearing about in future editions?

A technical report from ARM Reveals Cortex-A72 Architecture Details, hope it helps.

The modern SoC is a feat of engineering that continually squeezes greater performance from defined power and area constraints. However the arch nemesis of reliability is complexity.


“Debugging is twice as hard as writingthe code in the first place. Therefore,

if you write the code as cleverly as possible, you are, by definition, not smart

enough to debug it.”

          — Brian W.Kernighan and P. J. Plauger in The Elements of Programming Style


As SoC complexity continues to grow exponentially, it is only wise to build in some advanced debug capability in to the SoC. We’re all familiar with the concept of “a stitch in time saves nine” and this is particularly relevant for debugging; the later you find a bug, the more tedious, time-consuming and expensive it becomes to resolve. Visibility is a precious resource to system designers, as it gives them an opportunity to spot bugs early, and make subtle changes that can alter and optimize an SoC’s performance. On-chip visibility acts as a screening process to identify any snags.

There are certain SoC bugs that tend to manifest themselves through either a data corruption or a system lock up which occurs only when a series of contributing factors align to cause the fault. Factors may be as diverse as manufacturing tolerances being exceeded, bit errors being introduced, complex real-world software exercising new unvalidated spaces, or race conditions between multiple out-of-order transactions.

So if your design does get hit by a rare, extremely difficult to reproduce and tricky to diagnose issue it’s critical you have some tools to deploy to help you get to the bottom of the problem as fast as possible. Almost by definition any bug found in silicon is not going to be found by a simple test case you can run on a simulator or emulator of parts of your design.




Diagnosing the problem

The complexity of multi-core processors and cache coherent interconnects mean much of what was previously visible through CoreSight Embedded Trace Macrocells (ETM), essentially the programmer’s view, is now hidden inside the IP blocks.

With this in mind, ARM has developed a new weapon to add to the CoreSight on-chip debug and trace armoury called the CoreSight™ ELA-500 Embedded Logic Analyzer in order to provide a more accurate diagnosis of system bugs. As the name suggests this is a logic analyzer-like IP block for embedding in to your SoC to monitor up to 12 groups of 128 signals, generate triggers from assertion-like conditions and with a small embedded SRAM to collect a recent trace history of selected signals.




Example debug setup with ELA

So step one is to find out the state your system has got itself in to and what illegal or suspicious condition has occurred. The trace aspect of debug is similar to a detective using CCTV cameras when solving a crime. To help with this the ELA-500 contains a way to set up complex multi-state conditional triggers such as:


  • Trace next 6 write requests plus cache attributes to address 0x12345678
  • Load request from core 0 to address A will advance to trigger_state_1 which
    will then trigger debug mode after core 1 read from address A

The ELA-500 provides a number of tools to discover any malicious conditions:


  • A state machine with 4 trigger states programmable in any sequence including loops
  • Each trigger state can select one of the 12 signal groups as input for trigger conditions
  • Each trigger condition is programmable for comparisons to mask and match any
    combination of 128 signals:  =, !=,  >,  >=, <, <=
  • Each trigger state has a 32-bit counter input to count events, count clock cycles or act
    as watchdog timer



  Figure1: Trigger set up in the ELA-500


Step 2 is to start looking at what happened around this suspect state or condition; which can be done by storing selected signal states to the ELA-500 dedicated SRAM, configurable between four and over one billion trace data entries, or by triggering another action outside the ELA-500.

Up to 8 programmable output actions that can be triggered for each trigger state, such as: Stop clocks, enter debug state, start/stop signal trace, trigger another logic analyser or ETM, or assert a CPU interrupt.

It is likely that from the information gleaned new trigger conditions will be set to see what other unexpected conditions or states are occurring, so repeating steps 1 and 2 to establish the chain of events leading to the error condition.

For really extreme cases even further visibility may be required around the trigger condition, not visible except through a scan chain dump. For this step 3 is to program a stop clock action on the ELA-500 and then use scan chain dump and information on the SoC’s scan chains to provide exact state or any and all registers within the SoC on a scan chain. The ELA-500 here provides the precision on which scan chain dumps to analyse, so less of this time-consuming exercise needs to be done.



Where to deploy an Embedded Logic Analyzer

The ELA-500 can monitor any signal you connect to its inputs. SoC designers will benefit from connecting up signals from ARM IP and proprietary or third party IP. A typical design might contain multiple ELA-500’s deployed to monitor signals in different domains of the SoC, as shown in figure 2, with one per main processor cluster, one for the Cache Coherent Interconnect and one for other signals selected by the SoC designer.



ELA-500 diagram.png Figure 2: Example deployment of the ELA-500 in a system


Figure 2 shows the clock stop requests (in red) running the Clock Controller from each ELA and the connectivity (in black) of trigger in/out to the CoreSight Cross Trigger Interfaces (CTI) and the Cross Trigger Matrix (CTM). The debug APB bus is used to both set up trigger conditions and to read back the contents of the ELA’s SRAM, as controlled by the debugging tool, such as the ARM® DS-5™ debug tool.




Connecting the ELA-500 to the Cortex-A72 processor

For connection to ARM IP a Logic Analyzer IP Kit (LAK-500A) is provided with a pre-selected set of signals for that IP. The first of these is available for the recently released Cortex®-A72 processor to ensure the ELA-500 can sample signals at the maximum operating frequency of the Cortex-A72 without any impact on the operation of the processor.

The LAK-500A Logic Analyzer IP Kit includes the following:


  • Documented debug signal list and organization into 12 signal groups of 128 debug signals
  • A port puncher script that takes the debug signal list and adds connection to the top level
    ports of the Cortex-A72 processor. The script also has an option to add a register slice to
    debug signals to ensure timing closure
  • A LEC script to ensure nothing but the debug ports changed in the Cortex-A72 processor

The observation interface signals provide debug visibility of: each core-to-L2 interface, power-management interfaces, and the L2 memory system power-management interface. The core-to-L2 interface provides visibility of the physical addresses of L1 misses to the L2, and the following transaction details:


  • Memory type: normal, device, or strongly ordered
  • Read or write
  • Fetches
  • DSB or DMB
  • AArch32 or AArch64
  • L1 set index
  • Byte transfer size
  • Last data received
  • Memory attributes: not shareable, inner shareable, or outer and inner shareable
  • Whether access is from privileged mode
  • Read type: read clean, read unique, icache, data cache, or TLB invalidate
  • Write type: eviction, device, unique, or streaming
  • Eviction has double bit ECC error
  • Signals that determine proper operation of the Load/Store L2 interface.
  • Core snoops,  including cache maintenance Instruction Cache Maintenance Operation
    (ICMO) and TLB Maintenance Operation (TMO)
  • L2 pre-fetch

Future support is planned for new ARM Cortex-A and Mali™ processors as well as the CoreLink™ CCI Cache Coherent Interconnects, where transactions in flight and snoop traffic can be observed.



CoreSight ELA-500 can find corner-case bugs

The CoreSight ELA-500 provides visibility into the states leading to lock-ups and data corruption. It provides visibility of CPU load, stores, speculative fetches, cache activity and transaction lifecycle; properties that are not visible with existing ETM trace of instructions. This offers a greater scope for finding corner-case bugs that could potentially spell disaster if discovered too late.

The ELA-500 can monitor error states and hazard conditions across the SoC, giving visibility to debug lock ups in designs without resorting to complex scan chain dump analysis, and cases with invalid accesses to device memory. The ELA can spot data corruptions early, whereas conventional timeouts occur too late and causation events are often lost/overwritten. I go into even more detail on some of the use cases for the CoreSight ELA-500 in a Video interview with silicon debug expert Mark LaVine

All this ensures you have the fastest debug route available should your SoC suffer a catastrophic failure found only when the silicon comes back and full software is running on the device.



A full specification of the CoreSight ELA-500 can be found on the ARM Infocenter



You can find more information on the CoreSight ELA-500 webpage

On the list of activities that system designers enjoy doing, “debugging” is invariably near the bottom. That’s because it is often complicated, time-consuming and downright frustrating to track down and identify what is going wrong on the chip. However it goes without saying that debug is a critical part of SoC development. The ‘quality control’ that debugging provides means that OEMs can be assured of a high standard of functionality from a chip. The peace of mind this affords is invaluable, much in the same way you would be a lot more relaxed in the knowledge the new car you have just bought has had its brakes tested for quality assurance.


An effective debug strategy requires an experienced head and a good set of tools to get things done properly. When it comes to experience, Mark LaVine is an expert on the matter, having spent the last 15 years developing debug and trace solutions in order to minimize the frustrations that system designers feel when attempting to diagnose on-chip problems. Mark sat down recently with William Orme to talk about some of the common challenges related to silicon debug and some of the strategies available to overcome them.


In the video below he opens up about the topic of silicon debug and the major problems that surround this area, “today we’re looking at highly integrated products with very limited visibility”. He goes through some of the scenarios that lead to bugs being found in silicon, as well as the implications they can have, “usually it’s either a lock up or data corruption. Data corruption is the most difficult area to debug because typically it gets detected very late, from where the originating corruption occurred. To do experiments and trace back the original source can be very time-consuming”.






Mark has just finished working on the development of the brand new ARM® CoreSight™ ELA-500 Embedded Logic Analyzer, which is designed especially to diagnose and identify corner-case bugs. These are the type of bug that typically slip through the net of normal debug and trace protocols and only show up later on in the process when it suddenly becomes a more arduous task to get rid of them. Not to mention the greater costs involved in removing bugs found in silicon. In the video below you can see Mark speak about some examples of how the ELA-500 could be used to provide greater visibility and detect these issues before it’s too late, including on the new Cortex®-A72 processorWith the Cortex-A72 processor we provide a visibility on the CPU to L2 interface, which is very useful for accesses that could go external. In the case of a hang or lock-out you could find out which accesses were going on prior to the lockup. For other things like data corruption you could get a trace of those instructions”.


ELA-500 diagram.png

An example of the ELA-500 being deployed in a system




If you have any questions for Mark on the subject of silicon debug then please leave them in the comments section below and we will do our best to answer them here or with a follow up video.


My colleague William Orme has also written a blog that goes into more detail on the ELA-500 and how it succeeds in Taking the fear out of silicon debug.


For more information on the CoreSight ELA-500

One of the things I have noticed about ARM over the last year that I have been working here is people having a great interest in ARM’s history. After a quick Google search and multiple open tabs I realized that there was much debate and comments on the actual history of ARM.

You can easily attain the timeline of events of ARM as a company,  but it doesn’t really tell the story of how ARM came into existence and how it rose to the top of its respected industry. It does however give you a full timeline of licensees of ARM and the key moments in the company's history.  Please join in the debate in the comment section if you feel there is more to add or possible topics you would like to see researched in further blog entries. Also don’t be afraid to comment with corrections or extra information from the era of 1980-1997 of which this Part 1 blog is based.

This blog will be posted over two entries – The History of ARM: Part 1 and The History of ARM: Part 2.

The Beginning: Acorn Computers Ltd

Any British person the age of 30 will most likely remember Acorn Computers Ltd and the extremely popular BBC Micro (launched with a 6502 processor in 1981). The background of Acorn is a very interesting story in itself (and probably deserves its own blog), set in the booming computer industry in the 1980’s. The founders of Acorn Computers Ltd were Christopher Curry and Herman Hauser. Chris Curry was known for working very closely with Clive Sinclair of Sinclair Radionics Ltd for over 13 years. After some financial trouble Sinclair sought government help, but when he lost full control of Sinclair Radionics he started a new venture called Science of Cambridge Ltd or later known as Sinclair Research Ltd. Chris Curry was one of the main people in the new venture, but after a disagreement with Sinclair on the direction of the company, Curry decided to leave Sinclair Computers Ltd.

Curry soon partnered with Herman Hauser, an Austrian PHD of Physics who had studied English in Cambridge at the age of 15 and liked it so much, returned for his PHD. Together they set up CPU Ltd which stood for Cambridge Processing Unit which had such products as microprocessor controllers for fruit machines which could stop crafty hackers from getting big pay outs from the machine. They launched Acorn Computers as the trading name of CPU to keep the two ventures separate. Apparently the reasoning behind the naming of Acorn was to be ahead of Apple computers in the telephone directory!

Fast forward a few years and they landed a fantastic opportunity to produce the BBC Micro, a government initiative to put a computer in every classroom in Britain. Sophie Wilson, and Steve Furber were two talented computer scientists from the University of Cambridge who were given the wonderful task of coming up with the microprocessor design for Acorn’s own 32 bit processor – with little to no resources. Therefore the design had to be good, but simple – Sophie developed the instruction set for the ARM1 and Steve worked on the chip design. The first ever ARM design was created on 808 lines of Basic and citing a quote from Sophie from a telegraph interview; ‘We accomplished this by think about things very. Very carefully beforehand’. Development on the Acorn RISC Machine didn't start until some time around late 1983 or early 1984. The first chip was delivered to Acorn (then in the building we now know as ARM2) on 26th April 1985. The 30th birthday of the architecture is this year! The Acorn Archimedes which was released in 1987, was the first RISC based home computer.

If there is enough interest I will do a full blog on the history of Acorn Computers Ltd but for now you can find a great TV movie by the BBC called Micro Men  – watch out for the Sophie Wilson cameo appearance! (Credit to the BBC - Source here for British iPlayer users)

Micro Men - A BBC Movie


ARM is founded.

ARM back then stood for ‘Advanced RISC Machines’ but to answer the age old question asked by many people these days, it actually doesn’t stand for anything – as the machines they were named after are long but outdated, ARM continued with its name – which funnily enough, means nothing! It does have a cool logo though!


ARM Logo (2015)



The company was founded in November 1990 as Advanced RISC Machines Ltd and structured as a joint venture between Acorn Computers, Apple Computer (now Apple Inc.) and VLSI Technology. The reason for this was because Apple wanted to use ARM technology but didn’t want to base a product on Acorn IP – who, at the time were considered a competitor. Apple invested the cash, VLSI Technology provided the tools, and Acorn provided the 12 engineers and with that ARM was born, and its luxury office in Cambridge – A barn!

Fig_ARM_Headquarters.jpgARM headquarters



In an earlier venture, Hermann Hauser had also created the Cambridge Processor Unit or CPU. While at Motorola, Robin Saxby supplied chips to Hermann at CPU. Robin was interviewed and offered the job as CEO around 1991. In 1993 the Apple Newton was launched on ARM architecture. For anyone that has ever used an Apple Newton you will know it wasn't the best piece of technology, as unfortunately Apple over reached for the technology that was available for them at the time - the Newton has flaws which lowered its usability vastly. Due to these factors ARM realized they could not sustain success on single products, and Sir Robin introduced the IP business model which wasn’t common at the time. The ARM processor was licensed to many semiconductor companies for an upfront license fee and then royalties on production silicon. This made ARM a partner to all these companies, effectively trying to speed up their time to market as it benefited both ARM and its partners. For me personally, this model was one that was never taught to us in school, and doesn’t really show its head in the business world much, but it creates a fantastic model of using ARM architecture in a large ecosystem – which effectively helps everyone in the industry towards a common goal; creating and producing cutting edge technology.

TI, ARM7, and Nokia

The crucial break for ARM came in 1993 with Texas Instruments (TI). This was the break that gave ARM credibility and proved the successful viability of the company’s novel licensing business model. The deal drove ARM to formalize their licensing business model and also drove them to make more cost-effective products. Such deals with Samsung and Sharp proved networking within the industry was crucial in infecting enthusiastic support for ARM’s products and in gaining new licensing deals. These licensing deals also led to new opportunities for the development of the RISC architecture. ARM’s relatively small size and dynamic culture gave it a response-time advantage in product development. ARM’s big break came in 1994, during the mobile revolution when realistic small mobile devices were a reality. The stars aligned and ARM was in the right place at the right time. Nokia were advised to use ARM based system design from TI for their upcoming GSM mobile phone. Due to memory concerns Nokia were against using ARM because of overall system cost to produce. This led to ARM creating a custom 16 bit per instruction set that lowered the memory demands, and this was the design that was licensed by TI and sold to Nokia. The first ARM powered GSM phone was the Nokia6110 and this was a massive success. The ARM7 became the flagship mobile design for ARM and has since been used by over 165 licensees and has produced over 10 Billion chips since 1994.


mtnok61g.jpgNokia 6110 - the first ARM powered GSM phone (You may remember playing hours of the game snake!)

Going Public

By the end of 1997, ARM had grown to become a £26.6m private business with £2.9m net income and the time had come to float the company. Although the company had been preparing to float for three years, the tech sector was in a bubble at the time and everyone involved was very apprehensive but felt it was the right move for the company to capitalize on the massive investment in the tech sector of the time.

On April 17th, 1998, ARM Holdings PLC completed a joint listing on the London Stock Exchange and NASDAQwith an IPO at £5.75. The reason for the joint listing was twofold. First, NASDAQ was the market through which ARM believed it would gain the sort of valuation it deserved in the tech bubble of the time which was mainly based out of the states. Second, the two major shareholders of ARM were American and English, and ARM wished to allow existing Acorn shareholders in the UK to have continued involvement. ARM going public caused the stock to soar and turned the small British semiconductor design company into a Billion Dollar company in a matter of months!


mo_052008f.jpgARM Holdings was publicly listed in early 1998



Keep an eye out for Part two which covers the last 20 years of ARM’s history (1997-2015). Please leave comments and feedback for items you would like to see discussed!

Credit to Markus Levy, Convergence Promotions. For more see here for more detailed information on the technologies used during those early years

Credit also to the internal help received by many during the writing of this blog.

Last week, several of our partners unveiled new Chrome OS devices powered by Cortex-A17 based processor. These new products include two Chromebooks from Haier and HiSense at very competitive low price, a convertible laptop-tablet called the Chromebook Flip and a brand new kind of HMDI dongle called Chromebit, both from Asus.




Following Cortex-A17’s top score in Antutu’s “Best Performance Android Smartphones 2014”, these new devices re-affirm the capabilities of Cortex-A17 CPU in combination with ARM Mali-T760 GPU, to provide a high-performance computing experience in devices such as tablets in highly cost-effective implementations.

The announcement of these new devices is a very good opportunity to review the characteristics of the Cortex-A17 that make it a success in many popular consumer products like smartphones, tablets and OTT devices that require highest performance in thermally constraint form factors.

Cortex-A17 - A Balanced design for premium performance and cost efficiency


Cortex-A17 is the third generation of ARMv7-A out-of-order processors, following successful products as Cortex-A9 and Cortex-A15. Cortex-A17 processor was designed to some very aggressive PPA goals, including:

  1. provide a significant boost of performance over the current generation CPUs with improved branch prediction and out-of-ordering issue capabilities
  2. maintain an optimal power and area profile that fits thermally constrained form factors, especially by keeping a 2-way super scalar architecture
  3. build a micro-architecture that is tuned and optimized towards mobile workloads through for instance better use of the memory system

  This enables Cortex-A17 to provide best single thread performance for 32-bit application over any other ARMv7-A cores.




The single thread performance is critical for the user experience as it is at the heart of key applications like user interface and mostly web browsing. If Cortex-A17 and Cortex-A15 have similar SpecInt2k results, Cortex-A17 exceeds Cortex-A15 performance for web browsing, enabling the new Chrome OS devices to score better than previous 2014 successful devices.



Source arstechnica.com


Cortex-A17 achieves higher performance on benchmarks representative of today's complex and demanding real-world web applications running on mobile and desktop browsed such as kraken, octane, sunspider. This is achieved through a combination of design optimization, especially around memory system and streaming performance. These optimizations are designed in an optimal power and area profile to result into a better power efficiency. Better power efficiency allows sustaining maximum frequency before hitting thermal limits on the SoC and so directly translates into performance uplift. Area is also a significant part as it contributes to silicon cost as well as leakage power. The Cortex-A17 has been extensively tuned, and is considerably more area and power efficient than Cortex-A15 and similar to Cortex-A9.


This power efficiency enables our partners to optimize Cortex-A17, especially in a mature and cost efficient node like 28nm. The Cortex-A17 has broad support from ARM Physical IP in 28nm like ARM Artisan POP IP that allows system design with lowest risk and fast-time-to-market.


An optimized software ecosystem is fundamental for a great user experience. Today’s mobile world is based around the ARMv7-A architecture which supports over one million applications across many device categories. The Cortex-A17 processor leverages the popular applications and libraries that are specifically optimized for performance and efficiency on this architecture. New ecosystems around ARMv8-A are being built, and these complement the ARMv7-A ecosystem, particularly where a 64-bit instruction set is a necessity such as in server and enterprise applications.


What’s coming next for Cortex-A17?


We are very happy to see our partners introducing new innovative devices and enabling access to premium performance at a very attractive price. In the coming months, Cortex-A17 will continue to be at the heart of a great number of new mid-range devices while Cortex-A57 will power high-end products. It is today's choice for 32-bit devices that require highest performance in thermally constrained form factors. So we are expecting to see more and more Cortex-A17 devices from smartphone to Smart TV and set-top boxes, but also in key markets with similar technology constraints like home networking, industrial applications and high-end wearable.

Which new Cortex-A17 devices will your imagination build ?

While TI has slowly shrunk the Tiva (former Luminary Micro) families, it has sampled a few weeks ago a Cortex-M4F part. Guess in which top level family the Cortex M4F has landed? Find out in our latest post...

As part of my last assignment, I had the opportunity to spend 3 months in Taipei. The focus of this was to work closely with ARM partners in the APAC region and get valuable insights and feedback for ARM System IP specifically and ARM in general. Spending this time “on the ground” helped me understand our partners’ needs over a sustained period of time. Even though global communication is instant these days there is still no substitute for meeting people face-to-face and immersing yourself in their business flows and culture. I built some great relationships in the process and learnt a lot.


We all have our “impressions of and views on” the tech industry in the APAC region and re-using the terms that you may come across in the media, I try to share my take on this. Let me say that there is nothing like first-hand experience!


  1. Low-cost innovations: The key observation here was that as IP product developers we should not only consider innovations in the technical features of our products but also innovations on the overall cost effectiveness of using this IP in building end products. By considering the “cost to our partners” as an evaluation criterion in product development, many innovations can be achieved in the IP. It brings to mind the famous Toyota Production System from the 1980’s where the elimination of waste from the system enabled them to become world leaders in automotive manufacturing. The next big milestone in the technology revolution is connecting ‘the next billion’ people who are not necessarily from Western Europe or the US. The low cost innovations provided by our partners in APAC are absolutely critical to making technology accessible for the majority of the worlds’ population.
  2. Break-neck pace of development: Partners in the APAC region have access to a very large and fast-growing market at its doorstep. What this leads to is increased competition with many players and numerous “new” opportunities. Being able to quickly turnaround products is key to grab a good share of the market and an important strategy. A quote that sums this up “In normal business situations, managers make decisions with 80% of the information. Due to the fast pace of the market in China, managers need to make decisions with only 50-60% of the available information”. Hence, products that partners can have a high degree of confidence in and that are easy to design with are essential.
  3. Language barrier: There is no denying that the language barrier causes many challenges. Not so much on the engineering side of things but definitely on the product marketing side. The challenge is both ways in relaying product value proposition but also collecting invaluable requirements/feedback. I relied on a meeting style where I shared my material a week in advance and opened up an email thread way before the meeting and used the meeting more for white board styled discussions with the support from the brilliant ARM local FAE and Sales team. This meant the key messages were being understood on both sides and the meetings were effective.


If you have had any similar or different experiences when working in APAC or with APAC partners please leave a comment below, I’m very interested to hear others’ feedback. Or indeed, some of our community members from APAC can tell us about the challenges you face when working in Europe or the US! In conclusion, business in APAC is challenging and by adapting to this we have a great opportunity to build technology that has the potential to reach the masses. There is a reason why we see innovative products coming out of APAC both on the high-end side and also on the low-end mass market side and this is incredibly exciting stuff with existing mobile technology and IoT on the horizon!


The good news is that we in ARM are closely aligned with what our partners are requesting and I believe we are on the right track to co-innovate the technology needed to help bring tangible benefits to many. The System and Software Group (SSG) in ARM is responsible for delivering System IP and key software to enable partners to design best-in-class systems with ARM processors and technology. SSG’s mantra is to not only bring system level innovations to our partners but also to "help" them develop with ARM. This fits in very nicely with my 2nd point on making sure that our products make our partners’ lives easier and make it a simple decision to continue designing SoCs with ARM technology.


Add to this the wonderful and highly capable ARM FAE and Sales teams dedicated to partner support around the world. In APAC itself, ARM has offices in Shanghai, Beijing, Shenzhen, Bangalore, Noida, Yokohama, Seoul, Taipei and Hsinchu providing local points of contact for partners. And the local teams are great fun! (see picture above from Taiwan end of year fancy dress party)


If you are in the region then do come along to the ARM System IP technology seminar taking place in Hsinchu, Taiwan on 16th April and it would be a great opportunity for you to learn about ARM System IP and provide feedback to ARM product managers face-to-face.


2015 ARM System IP Technology Seminar

Venue: Sheraton Hsinchu Hotel, 5F, Chapel

Location: Hsinchu, Taiwan

Date: 16 April 2015


If you managed to read down this far and heard of "alchemy" in Taipei 101 region then let me know

Automobile manufacturers are constantly making improvements to the design of cars, creating an experience that is safer and more comfortable with each new model. In the last number of years we have seen a defined move towards more technology in our cars, designed to make it easier to drive. This year’s CES was notable not only for the amount of automobile manufacturers in attendance, but also the fact that nearly all of them had a version of a self-driving car, prompting speculation that they will appear on our roads within the next 5 years. My first experience of a driverless car came while watching the movie Demolition Man back in 1993 (although I’m sure there are prior examples of this in sci-fi movies). The movie, set in 2032, also had very accurate depictions of an iPad and Skype that can be seen in this video compilation. The only surprise is that the movie director was too conservative with his estimate of when these technologies would be developed!



Science fiction is always an interesting barometer of predicting the future, introducing concepts ahead of their time. It’s exciting to see these visions turn into a reality, but there is also a more pressing need for their development. In August of 2012, KPMG and the Center for Automotive Research published a comprehensive report on Self-driving cars: The next Revolution. It included some powerful statistics on the dangers involved in driving a car:


In 2010 there were approximately six million vehicle crashes, of which 93 percent were attributable to human error.”


ADAS can save lives

Therefore there are huge social benefits to making cars safer through applying new technologies. As technology enthusiasts and engineers, we are generally more concerned with how to make the automotive experience safer. Much has been publicised about the safety benefits of autonomous vehicles, and the path to achieving this is via advanced driver-assistance systems (ADAS). ADAS is a combination of technologies located in the car designed to enhance vehicle systems for safety and better driving, as Soshun Arai explains. Certain features like cruise control, rear-view cameras or automated lighting have been included as standard options for many years now. Some of the features we are currently seeing are things like traffic warnings, keeping the driver in the correct lane, providing visibility of blind spots and automated braking. The rate of consumer acceptance is generally slower than the rate of development, which may be why it takes time to see safety features appear in consumer automobiles.





ADAS can be based upon different systems, including vision/camera, sensor technology, car data networks, vehicle-to-vehicle, or vehicle-to-infrastructure systems. The Connected Car of the future will increasingly utilise wireless networks to interact with other vehicles and the highway, providing extra safety and more up-to-date information to passengers.


One of the fascinating things about working at ARM is seeing how our partners develop SoCs to make the quantum leap of turning concepts into reality. The bottom line is that self-driving cars will be controlled by an SoC instead of a person. At Embedded World 2015 Xilinx announced the new UltraScale+ family of FPGAs, 3D ICs and MPSoCs, which included the Zynq UltraScale+ MPSoC (multi-processor system-on-chip). You can find the full press release here. It is leading in the area of heterogeneous MPSoCs, with the All Programmable UltraScale SoC architecture providing processor scalability from 32 to 64 bits with support for virtualization, a combination of soft and hard engines for real time control, graphics/video processing, advanced power management, and technology enhancements that deliver multi-level security, safety and reliability. All of these improvements have powerful implications for next-generation driver assistance systems. You can see ARM’s Phil Burr speaking to Larry Getman of Xilinx at Embedded World about the new release.




Zynq MPSoC gives users peace of mind by undergoing rigorous validation

The new Zynq MPSoC has gone through a rigorous planning and validation process to ensure no compromises have been made on security, safety and reliability. The processing sub-system includes a dual core ARM® Cortex®-R5 real-time processor for deterministic operation, ensuring responsiveness, high throughput, and low latency for the highest levels of safety and reliability. A separate security unit enables military-class security solutions such as secure boot, key and vault management, and anti-tamper capabilities—standard requirements for machine-to-machine communication and industrial IoT applications. Every millisecond counts on the road, and ARM's 16FF memory compilers generate fast cache instances to ease access to the most critical data and allow the system to perform even quicker. In addition, ARM CoreSight debug and trace technology was implemented in the chip’s development to provide on-chip visibility that enables fast diagnosis of bugs and performance analysis. Amongst other things, CoreSight ensures it meets the high quality standards required by ISO 26262.


There are significant enhancements over the previous generation Zynq-7000, with performance increases and power savings. Along with the advantage that comes with multiple processors, it has also integrated the host controller to become the primary computing system for driver safety ECUs. The system intelligence of a more integrated vehicle ECU expands the functional safety capabilities that next-generation automobiles will be able to provide.


The new Zynq UltraScale+ MPSoCs deploy new UltraScale+ FPGA technologies that included enhancements to DSP, transceivers and included new features like UltraRAM memory and interconnect optimisation. This is in addition to providing unprecedented level of heterogeneous multi-processing, deploying ‘the right engines for the right tasks’. At the centre of the processing-subsystem is the 64-bit quad-core ARM Cortex-A53 processor, capable of hardware virtualization, asymmetric processing, and full ARM TrustZone® support. The new MPSoC delivers approximately 4X system level performance/watt relative to previous alternatives.

Zynq performance increase.png

Source: Xilinx



One of the key aspects of increasing driver safety involves showing what is happening around the vehicle in real time with the use of cameras. Displaying 3D surround view with flying camera requires efficient 3D graphics rendering. For complete graphics acceleration and video compression/decompression, the Zynq incorporates an ARM Mali™-400MP dedicated graphics processor as well as a H.265 video codec unit, combined with support for Displayport, MIPI D-PHY and HDMI.


The combined power of the Xilinx FPGA with the Cortex-R5 and Cortex-A53 processors, along with optimized HW/SW partitioning, allows the Zynq to perform features such as adaptive cruise control, forward collision warnings and autonomous braking for cyclists or pedestrians. Cyclists will be delighted to hear that car manufacturers keep them in mind when designing safety features! Finally, dedicated platform and power management unit (PMU) has been added that supports system monitoring, system management and dynamic power gating of each of the processing engines.


Technology is making our roads safer

I will conclude with another statistic from the KPMG report I mentioned above, “The economic impact of crashes is also significant. According to research from the American Automobile Association (AAA), traffic crashes cost Americans $299.5 billion annually”. When viewed with this perspective the onset of ADAS and autonomous cars can’t come quickly enough. Thankfully the current rate of technology development makes it likely for them to appear much earlier than the year 2032 as imagined in Demolition Man by sci-fi directors of the 90s. In fact the Xilinx Zynq is a major step towards that, as it greatly increases the features that next-generation ADAS will provide.


Read more about ARM's commitment to ADAS


For more information on the Zynq UltraScale+ MPSoC please visit the Xilinx website 

Hardware virtualization has been a long-established technology in the computing industry and is ubiquitous in the server markets providing resource management while maintaining performance and security benefits. In the mobile and embedded space, virtualization also enables hardware to run with less memory and fewer chips on a smaller scale, reducing BOM costs and further increasing energy efficiency. It can also help to address safety and security challenges, and can reduce software development and porting costs significantly.


Virtualization benefits.png


I’ll go through some of the different meanings that virtualization has in different areas.


In general terms, virtualization of CPU, Memory and I/O enables the support of multiple system images or guests for deployment of applications, system libraries and OSes and for efficiency or right-sizing of system resources from one large system into smaller virtual systems.


A very simplified explanation of how this is done is that a guest OS or multiple OSes run on top of hypervisor software which is set up in such a way that it communicates directly with the CPU and underlying IO on behalf of the guest(s). The hypervisor then arbitrates access to hardware such that each OS ‘believes’ that it is talking directly to the CPU and IO.

A benefit of virtualizing one guest is to keep the same software binary, compatible with multiple hardware configurations thus saving development and deployment costs. A benefit of running multiple guests is to allow a single computer to be used by multiple people at once such as in a data-centre environment – this is particularly useful where the users are mainly running applications with relatively low processing demands such as on-line commerce.


For networking it can mean:

  • Virtualization of networking equipment (NFV) to replace physical network functions with virtual functions that run on more generic hardware. The shift places more importance on software for tasks like firewalling and load balancing.
  • Virtualization of the network itself creating overlays of virtual connections over the physical network. The marriage of hardware and software here provides greater cost efficiency and resource management across the network.


For Mobile Client designs virtualization hardware can also provide:

  • Premium Content separation from client apps
  • Reduced bill of materials (BOM) cost by more efficient use of resources such as physical memory
  • Another use case for mobile virtualization is in the enterprise market. Today, many consumers carry two mobile phones: one for business use and another for personal use. With mobile virtualization, mobile phones can support multiple domains/operating systems on the same hardware, so that the enterprise IT department can securely manage one domain (in a virtual machine), and the mobile operator can separately manage the other domain (in a virtual machine)



Mobile virtualization can reduce the amount of hardware you carry around by running multiple OSes on the same device


ARM is committed to reducing the total cost of SoC development by designing IP blocks that support on-chip processor and IO virtualization. ARM has developed System MMU and GIC architectures for above use cases, these were developed alongside the ARMv8-A architecture and are fully scalable to efficiently handle 100’s of CPUs and 1000’s of end devices. ARM has also developed silicon proven System MMU and GIC implementations which are in turn designed and tested with ARMv8-A® architecture CPUs and with ARM Interconnect and Dynamic Memory Controller products.


ARM SMMU and GIC implementations combine to provide industry leading IP to help efficiently accelerate virtualization in ARM based Server, Networking and Mobile Client designs.


According to one IT infrastructure provider, "Ultimately, virtualization dramatically improves the efficiency and availability of resources and applications in an organization. Instead of relying on the old model of “one server, one application” that leads to underutilized resources, virtual resources are dynamically applied to meet business needs without any excess fat".



If you found this interesting, we’ll be going into more detail on virtualization at the ARM System IP Seminar on April 16th in the Sheraton Hotel, Hsinchu, Taiwan. If you are in the region then do come along to the ARM System IP technology seminar taking place in Hsinchu, Taiwan on 16th April and it would be a great opportunity for you to learn about ARM System IP and provide feedback to ARM product managers face-to-face.


Attendance is free, you can register for attendance via this link: 2015 ARM System IP Technology Seminar


2015 ARM System IP Technology Seminar


Sheraton Hsinchu Hotel, 5F, Chapel


Hsinchu, Taiwan


16 April 2015

1. Have AXI bus support for read data interleaving ? In the specification of the AMBA AXI do not  report description about read data interleaving

2. I have an example about wrapping burst address :

     - Start_address = 16 (decimal)

     - Burst_size = 4

     - Burst_length = 8

This is my result :


Address_1 = Start_address = 16

Address_2 = Aligned_address + (N-1) x Number_bytes = 16 + (2-1) x 4 = 20

Address_3 = Aligned_address + (N-1) x Number_bytes = 16 + (3-1) x 4 = 24

Address_4 = Aligned_address + (N-1) x Number_bytes = 16 + (4-1) x 4 = 28

Address_5 = Aligned_address + (N-1) x Number_bytes = 16 + (5-1) x 4 = 32

--> Because  Address_5 = wrap_boundary + (Number_bytes x Burst_length) = 0 + ( 4x8) =32

--> Address_5 = Wrap_boundary = 0

--> Address_6 = Aligned_address + (N-1) x Number_bytes = 16 + (5-1) x 4 = 32  ( this result is follow the specification of AMBA AXI , pdf file )

--> Address_6 = Wrap_boundary  + (N-1) x Number_bytes = 0 + (5-1) x 4 = 20 (this result is follow http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/index.html )


Who can help me calculate the Address_6 in this example ?

Thanks you very much and hope a good job for all !!!

The memory sub-system is one of the most complex systems in a SoC, critical for overall performance of the chip. Recent years have witnessed explosive growth in the memory market with high-speed parts (DDR4/3 with/without DIMM support, LPDDR4/3) gaining momentum in mobile, consumer and enterprise systems. This has not only resulted in increasingly complex memory controllers (MC) but also PHYs that connect the memory sub-system to the external DRAM. Due to the high-speed transfer of data between the SoC and DRAM, it is necessary to perform complex training of memory interface signals for best operation.


Traditionally, MC and PHY integration was considered to be a significant challenge, especially if the two IP blocks originated from different vendors. The key reason was the rapid evolution of memory protocols and DFI  interface boundary between controller and PHY being incompletely specified, or in some cases ambiguous, with respect to requirements for MC-PHY training.

DMC Blog Picture.png

Why is MC-PHY integration NOT such a big issue now?

I’ll try to shed some light on this topic.  Recently, with the release of the DFI 4.0 draft specification for MC-PHY interface, things certainly seem to be heading in the right direction. For folks unfamiliar with DFI, this is an industry standard that defines the boundary signals and protocol between any generic MC and PHY. Since the inception of DFI 1.0 back in 2006, the specification has steadily advanced to cover all aspects of MC-PHY operation encompassing all relevant DRAM technology requirements. The DFI 4.0 specification is more mature compared to previous releases and specifically focuses on backwards compatibility and MC-PHY interoperability.


But that’s not the only reason why MC-PHY integration has gotten easier. To understand this better, we need to examine how MC and PHY interact during training. There are 2 fundamental ways that training of memory signals can happen:


  1. PHY evaluation mode or DFI Training mode  - This training mode can be initiated either by the PHY or the MC. Regardless of which side initiates the training, the MC sets up the DRAM for gate/read data eye/write/CA training and periodically issues training commands such as reads or writes. The PHY is responsible for determining the correct delay programming for each operation but the MC has to assist by enabling and disabling the leveling logic in the DRAMs and the PHY, and by generating the necessary read, mode register read or write strobe commands. The DFI training mode thus imposes a significant effort on the MC, and was mandatory in the earlier DFI 3.1 specification. However, in DFI 4.0, this training mode has become optional for MC.
  2. PHY independent mode – This is a mode where the PHY performs DRAM training with little involvement from the MC. The PHY generates all the read/write commands and programs the delays for each operation while the MC waits patiently for ‘done’ status. In DFI 3.1, the PHY independent mode could be initiated through a DFI update request protocol or through a special non-DFI training mode. In DFI 4.0, the non-DFI training mode has been replaced with a more generalized PHY master interface that enables the PHY to train the memory signals along with other low power operations.


Interestingly, PHY IP providers have decided to take ownership of training by implementing support for PHY independent mode in their IP, thereby keeping the reins to optimize the PHY training algorithms based on their PHY architecture. With PHY complexity growing and challenges with closing timing at high DDR speeds, the support for PHY independent mode training adds a valuable differentiator for PHY IP providers.



What is the memory controller’s role during PHY-independent mode training?

With the PHY doing most of the heavy lifting during the training, the MC only needs to focus on two questions:

  1. When does a request for training happen?
  2. In what state does memory need to be when handing control to the PHY for training, and in what state will memory be when PHY hands the control back to the MC?


The MC thus deals with the PHYs request for independent-mode training as an interrupt, something it needs to schedule along with a multitude of other things that it does for best memory operation. Training thus becomes a Quality-of-Service (QoS) exercise for the controller with a different set of parameters to optimize. The positive about all this is that QoS is essentially what a good MC does very well.



But what about proof of working silicon for MC-PHY integration?

With the clarity at the DFI interface, silicon proof is really a burden on the PHY because it has to train correctly at high speeds and provide a good data eye. Risk for critical bugs in MC that can only be found through silicon proof is low, something that a strong architecture and design/verification methodology can help eliminate. So the demands on MC have become less on MC-PHY interoperability, but more so on performance (memory bandwidth and latency).


Does your MC operate with the best performance in a realistic system scenario with guaranteed QoS, shortest path to memory, guaranteed response to real-time traffic, etc.?

I am leaving that as the topic of my next blog.


ARM is building state-of-the-art memory controllers with emphasis on CPU-to-memory performance, and supporting DFI-based PHY solutions available in the market today.  We have setup partnerships with 3rd party PHY providers for ensuring that integration at the DFI boundary is seamless for the end customer. ARM’s controllers support all the different training modes used by different PHYs thereby providing customers flexibility in choosing the best overall solution for their memory sub-system deployment.


Thanks for reading my blog, I welcome your feedback.


Join the ARM Server Segment next week at the Open Compute Project’s yearly main event in San Jose, CA. Our team will be on hand
in support of our ARM partners and also to support OCP’s mission of providing ever more efficient data center server solutions.
OCP Summit is one of the largest attended data center oriented shows each year, bringing together all tiers of the server supply chain and
showcasing a wide variety of compute technologies for Fortune 500 customers.  We are excited to be a sponsor of this event.



We will be located in Booth #D14, tucked in along the back wall of the exhibition floor, as shown below.


ARM plays an important role in driving an ever broader, more competitive, more efficient, and more robust ecosystem from
which data center customers can choose.  That said, it is never just about ARM, but about our partners, their designs, and their
inspirations that leverage the underlying ARM architecture and ecosystem to deliver optimized and differentiated products into

the data center space.  Please visit the booths of ARM partners such as HP, Wiwynn, Linaro, Microsoft, AppliedMicro and Cavium.


Our booth will feature server products from our partners SoftIron, Wiwynn, and Prodrive as well as SoC partners AMD and TI.

2015 is an exciting year for ARM in the datacenter as we expect to see AMD and Cavium join the ranks of Applied Micro
in shipping production ARM-based server and networking solutions.

Highlighting our ecosystem enablement work of the past few years and the emergence of 64-bit ARMv8 chips in the marketplace,
we will be raffling off 96Boards.org developer reference boards which sport a 64-bit ARM-based Quad A53 SoC, as shown below.

These boards are the size of a business card, or perhaps a bit smaller, and we are excited to have low-cost 64-bit solutions on hand for our
software ecosystem partners in 2015.


Again, please stop by our booth at OCP Summit 2015, or track down one of the ARM server segment attendees (Lakshmi Mandyam,
Jeff Underhill, or myself).  We’d love to hear your ideas and feedback on how we can partner to deliver efficient ARM-based solutions into
the data center in 2015.  Safe Travels!


OCP Week Press Summary

We will be tracking any and all ARM partner related announcements and activity and posting the associated links here in this blog space for
your consumption as OCP week unfolds.


* Applied Micro & Gigabyte: https://www.***.com/news/gigabyte-and-appliedmicro-announce-commercial-availability-of-gigabyte-mp30/

* Cavium & Hyve Solutions:  http://www.cavium.com/newsevents-Hyve-Solutions-and-Cavium-Collaborate-to-Deliver-64-bit-ARM-based-Server-Solutions.html

* Cavium & Stack Velocity: http://www.cavium.com/newsevents-StackVelocity-and-Cavium-Partner-to-Bring-Advanced-ARM-Processor-Efficiency-to-Open-Compute-Project.html

* Datacentered & CodeThink host OpenStack powered cloud: http://datacentred.co.uk/datacentred-world-first-openstack-public-cloud-on-64-bit-arm-servers/

Filter Blog

By date:
By tag: