In early 2015, ARM announced a suite of IP for Premium Mobile designs, with the ARM® Cortex®-A72 Processor delivering a 3.5x increase in sustained delivered performance over 28nm Cortex-A15 designs from just a two years ago. At ARM we are focused on delivering IP that enables the most efficient computing and the Cortex-A72 micro-architecture is the result of several enhancements that increase performance while simultaneously decreasing power. Last week, at the Linley Mobile Conference, I had the pleasure to introduce the audience to the micro-architectural detail of this new processor which I thought would be good to summarize as there has been quite some interest.
From a CPU Performance view, we have seen a tremendous growth in performance: a 50x increase in the last five years (15x at the individual core level). The graph below zooms in on performance increases in single core workloads broken into floating point, memory, and integer performance. All points shown up to 2015 are measured from devices in market, and Cortex-A72 is projected based on lab measurements to give a preview of what is expected later in 2015 and in 2016. The micro-architectural improvements in Cortex-A72 result in a tremendous increase across all aspects – floating point, CPU memory and integer performance. For the next generation of mobile designs, Cortex-A72, particularly on 14nm/16nm process technology, is going to change the game – the combination of the performance and efficiency of this process node and CPU are extremely compelling.
The improvements shown here have come through improvements at the micro-architectural level coupled with increasing clock frequencies. But delivering peak performance alone isn’t the challenge of designers – mobile devices are characterized by the constrained thermal envelope SoC designers have to operate within. Hence, to increase performance within the mobile power and thermal envelope, turning up the frequency or increasing the issue rate in the micro architecture isn’t the answer – you have to improve power efficiency.
The Cortex-A72 micro-architectural improvements increase efficiency so much that it can deliver the same performance of Cortex-A15 in half the power even on 28nm, and for 75% less power on 14/16nm FinFET nodes. The performance of a Cortex-A15 CPU can be reproduced on the Cortex-A72 processor at reduced frequency and voltage resulting in a dramatic power reduction. However, mobile apps often push the CPU to maximum performance rather than a specific absolute required level of performance. In this case, a 2.5GHz Cortex-A72 CPU consumes 30~35% less power than the 28nm Cortex-A15 processor, still delivering more than 2x the peak performance.
Enhancements to the Cortex-A72 micro-architecture
Below is a simplified view of the micro-architecture. Those familiar with the Cortex-A57 pipeline will recognize that the Cortex-A72 CPU sports a similar 3-wide decode, 8 wide issue pipeline. However in Cortex-A72 the dispatch unit has been widened to deliver up to 5 instructions (micro-ops) per cycle to the execution pipelines.
I list here some key changes and the difference they make (an exhaustive list would be too long!) that highlight the way in which the design of Cortex-A72 CPU was approached, beginning with the pipeline front end.
Pipeline front end
One of the most impactful changes in the Cortex-A72 micro-architecture is the move to a sophisticated new branch prediction unit. There is an interesting trade-off here - a larger branch predictor can cost more power, but for realistic workloads where branch misses occur, the new predictor’s reduction in branch miss rate more than pays for itself in reduction of mis-prediction and mis-speculation. This reduces overall power, while simultaneously improving performance across a broad range of benchmarks.
The instruction cache has been redesigned to optimize tag look-up such that the power of the 3-way cache is similar to the power of a direct mapped cache – doing early tag lookup in just one way of the data RAM instead of 3 ways. The TLBs and micro BTBs have been regionalized, so that the upper bits can be disabled for the common case when page lookups and branch targets are closer rather than farther away. Similarly, small-offset branch-target optimizations reduce power when your branch target is close. Suppression of superfluous branch predictor accesses will reduce power in large basic blocks of code – the A72 recognizes these and does not access the branch predictor during those loops to save power.
Decode/Rename block
Of the many changes in the decode block, the biggest change is in handling of microOps – the Cortex-A72 keeps them more complex up to the dispatch stages – this increases performance and reduces decode power. AArch64 instruction-fusion capability deepens the window for instruction level parallelism. In addition to this, the decode block has undergone extensive decoder power optimization, with buffer optimization and flow-control optimizations throughout the decode/rename unit.
Dispatch/Retire
In the dispatch/retire section of the pipeline, the effective dispatch bandwidth has increased to 5-wide dispatch, offering increased performance (by increasing instruction throughput), while reducing decode power – decoding full instructions rather than microOps gets more work done per stage for those instructions. Cortex-A72 also features a power-optimized reorganization of architectural and speculative register files, with significant port-reduction and area. It has also optimizations in commit-queue and register-status FIFOs, arranging and organizing them in a more power efficient manner.
One final example of the improvements in the dispatch/retire section is the suppression of superfluous register-file accesses - detecting cases where operand data is guaranteed to be in the forwarding network. Every time you avoid a read from the register file, you save power.
Floating Point Unit and Advanced SIMD
Here the biggest improvement is the introduction of new lower latency FP functional units. We’ve reduced latencies to:
These are very fast floating point latencies, comparable with the latest high performance server and PC CPUs. Floating point latency is important in typical mobile and consumer use cases where there is commonly a mix of FP and integer work. In these settings, the latency between computation and result is critical. Shorter latencies mean integer instructions waiting on the results of those instructions are less likely to be stalled.
This performance increase shows up in SpecFP and SpecFP2006 as an uplift of approximately 25%. This type of improvement is less useful for high-performance compute applications where pure floating point throughput is required. For mobile use cases, floating point shows up in combination with integer work. A good example of this combination of floating point and integer is in javascript code where the native data type is double precision float. In addition, the divide unit has gone to a Radix-16 FP divider, doubling the throughput of divide instructions executed.
Other improvements in this area of the design include an improved issue-queue load-balancing algorithm, and multiple zero-cycle forwarding data paths resulting in improved performance and reduced power. Finally, the design features a source-reduction in the integer issue-queue which cuts power without performance loss.
Load/Store unit
The Load/Store unit features several key optimizations. The main improvement is the replacement of the pre-fetcher with a more sophisticated combined L1/L2 data prefetcher - it is more advanced and recognizes more streams. The Load/Store unit also includes late-pipe power reduction with a L1 D-cache hit predictor. Performance tuning of Load-Store buffer structures improves both effective utilization as well as capacity. Increased parallelism in MMU table-walker and a reduced-latency L2 main TLB improves performance on typical code scenarios where data is spread across many data pages, for example in Web browsing where data memory pages typically change frequently. Finally, there is power optimization in configurable pipeline support logic, and extensive power optimization in L2-idle scenario. The combination of the new pre-fetcher and other changes enable an increase of more than 50% CPU memory bandwidth.
In summary, the Cortex-A72 micro architecture is a significant change from Cortex-A57. There are several changes to the microarchitecture working in concert to produce a big step up in performance while consuming less power. that importantly also results in a design that consumes lower power. The important takeaways for the ARM Cortex-72 are:
Cortex-A72 is truly the next generation of high-efficiency compute. A Cortex-A72 CPU based system increases performance and system bandwidth further when combined with ARM CoreLink CCI-500 Interconnect. A premium mobile subsystem will also contain the Cortex-A53 in a big.LITTLE CPU subsystem to reduce power and increase sustained performance in thermally constrained workloads such as gaming. Finally, the ARM Mali-T880 graphics processor can combine with Cortex-A72 and CCI-500 to create a premium mobile system. That’s all for this edition of the blog. What further features of the Cortex-A72 and the big.LITTLE premium mobile system are you interested in hearing about in future editions?
Well if I understand you right that's an easy one. None of the ARM processor cores implements multi-threading yet. If they do implement it I guess it will be in servers where one is not quite so concerned with saving energy when the processor has only one task - they get a continuous stream of things to do and if the workload goes down one can switch off a bank at time.
Thanks for the extremely informative write up!
How effective is the Cortex A72 (and for that matter, the A57) at hiding pipeline stalls when confronted with multiple threads and instructions that have a multi-cycle latency? For example, if a SQRT operation is dispatched, can the operation (which as I understand can take 10s of cycles) be "hidden" with other operations in a multi-threaded workload, or does the processor stall until the calculation has been completed, holding up subsequent calls from other threads?