In early 2015, ARM announced a suite of IP for Premium Mobile designs, with the ARM® Cortex®-A72 Processor delivering a 3.5x increase in sustained delivered performance over 28nm Cortex-A15 designs from just a two years ago. At ARM we are focused on delivering IP that enables the most efficient computing and the Cortex-A72 micro-architecture is the result of several enhancements that increase performance while simultaneously decreasing power. Last week, at the Linley Mobile Conference, I had the pleasure to introduce the audience to the micro-architectural detail of this new processor which I thought would be good to summarize as there has been quite some interest.
From a CPU Performance view, we have seen a tremendous growth in performance: a 50x increase in the last five years (15x at the individual core level). The graph below zooms in on performance increases in single core workloads broken into floating point, memory, and integer performance. All points shown up to 2015 are measured from devices in market, and Cortex-A72 is projected based on lab measurements to give a preview of what is expected later in 2015 and in 2016. The micro-architectural improvements in Cortex-A72 result in a tremendous increase across all aspects – floating point, CPU memory and integer performance. For the next generation of mobile designs, Cortex-A72, particularly on 14nm/16nm process technology, is going to change the game – the combination of the performance and efficiency of this process node and CPU are extremely compelling.
The improvements shown here have come through improvements at the micro-architectural level coupled with increasing clock frequencies. But delivering peak performance alone isn’t the challenge of designers – mobile devices are characterized by the constrained thermal envelope SoC designers have to operate within. Hence, to increase performance within the mobile power and thermal envelope, turning up the frequency or increasing the issue rate in the micro architecture isn’t the answer – you have to improve power efficiency.
The Cortex-A72 micro-architectural improvements increase efficiency so much that it can deliver the same performance of Cortex-A15 in half the power even on 28nm, and for 75% less power on 14/16nm FinFET nodes. The performance of a Cortex-A15 CPU can be reproduced on the Cortex-A72 processor at reduced frequency and voltage resulting in a dramatic power reduction. However, mobile apps often push the CPU to maximum performance rather than a specific absolute required level of performance. In this case, a 2.5GHz Cortex-A72 CPU consumes 30~35% less power than the 28nm Cortex-A15 processor, still delivering more than 2x the peak performance.
Enhancements to the Cortex-A72 micro-architecture
Below is a simplified view of the micro-architecture. Those familiar with the Cortex-A57 pipeline will recognize that the Cortex-A72 CPU sports a similar 3-wide decode, 8 wide issue pipeline. However in Cortex-A72 the dispatch unit has been widened to deliver up to 5 instructions (micro-ops) per cycle to the execution pipelines.
I list here some key changes and the difference they make (an exhaustive list would be too long!) that highlight the way in which the design of Cortex-A72 CPU was approached, beginning with the pipeline front end.
Pipeline front end
One of the most impactful changes in the Cortex-A72 micro-architecture is the move to a sophisticated new branch prediction unit. There is an interesting trade-off here - a larger branch predictor can cost more power, but for realistic workloads where branch misses occur, the new predictor’s reduction in branch miss rate more than pays for itself in reduction of mis-prediction and mis-speculation. This reduces overall power, while simultaneously improving performance across a broad range of benchmarks.
The instruction cache has been redesigned to optimize tag look-up such that the power of the 3-way cache is similar to the power of a direct mapped cache – doing early tag lookup in just one way of the data RAM instead of 3 ways. The TLBs and micro BTBs have been regionalized, so that the upper bits can be disabled for the common case when page lookups and branch targets are closer rather than farther away. Similarly, small-offset branch-target optimizations reduce power when your branch target is close. Suppression of superfluous branch predictor accesses will reduce power in large basic blocks of code – the A72 recognizes these and does not access the branch predictor during those loops to save power.
Of the many changes in the decode block, the biggest change is in handling of microOps – the Cortex-A72 keeps them more complex up to the dispatch stages – this increases performance and reduces decode power. AArch64 instruction-fusion capability deepens the window for instruction level parallelism. In addition to this, the decode block has undergone extensive decoder power optimization, with buffer optimization and flow-control optimizations throughout the decode/rename unit.
In the dispatch/retire section of the pipeline, the effective dispatch bandwidth has increased to 5-wide dispatch, offering increased performance (by increasing instruction throughput), while reducing decode power – decoding full instructions rather than microOps gets more work done per stage for those instructions. Cortex-A72 also features a power-optimized reorganization of architectural and speculative register files, with significant port-reduction and area. It has also optimizations in commit-queue and register-status FIFOs, arranging and organizing them in a more power efficient manner.
One final example of the improvements in the dispatch/retire section is the suppression of superfluous register-file accesses - detecting cases where operand data is guaranteed to be in the forwarding network. Every time you avoid a read from the register file, you save power.
Floating Point Unit and Advanced SIMD
Here the biggest improvement is the introduction of new lower latency FP functional units. We’ve reduced latencies to:
These are very fast floating point latencies, comparable with the latest high performance server and PC CPUs. Floating point latency is important in typical mobile and consumer use cases where there is commonly a mix of FP and integer work. In these settings, the latency between computation and result is critical. Shorter latencies mean integer instructions waiting on the results of those instructions are less likely to be stalled.
Other improvements in this area of the design include an improved issue-queue load-balancing algorithm, and multiple zero-cycle forwarding data paths resulting in improved performance and reduced power. Finally, the design features a source-reduction in the integer issue-queue which cuts power without performance loss.
The Load/Store unit features several key optimizations. The main improvement is the replacement of the pre-fetcher with a more sophisticated combined L1/L2 data prefetcher - it is more advanced and recognizes more streams. The Load/Store unit also includes late-pipe power reduction with a L1 D-cache hit predictor. Performance tuning of Load-Store buffer structures improves both effective utilization as well as capacity. Increased parallelism in MMU table-walker and a reduced-latency L2 main TLB improves performance on typical code scenarios where data is spread across many data pages, for example in Web browsing where data memory pages typically change frequently. Finally, there is power optimization in configurable pipeline support logic, and extensive power optimization in L2-idle scenario. The combination of the new pre-fetcher and other changes enable an increase of more than 50% CPU memory bandwidth.
In summary, the Cortex-A72 micro architecture is a significant change from Cortex-A57. There are several changes to the microarchitecture working in concert to produce a big step up in performance while consuming less power. that importantly also results in a design that consumes lower power. The important takeaways for the ARM Cortex-72 are:
Cortex-A72 is truly the next generation of high-efficiency compute. A Cortex-A72 CPU based system increases performance and system bandwidth further when combined with ARM CoreLink CCI-500 Interconnect. A premium mobile subsystem will also contain the Cortex-A53 in a big.LITTLE CPU subsystem to reduce power and increase sustained performance in thermally constrained workloads such as gaming. Finally, the ARM Mali-T880 graphics processor can combine with Cortex-A72 and CCI-500 to create a premium mobile system. That’s all for this edition of the blog. What further features of the Cortex-A72 and the big.LITTLE premium mobile system are you interested in hearing about in future editions?
I think that this is a big part of what I'm asking, and I suppose I need to learn a lot about the way that modern cpu architectures operate! Thank you for considering my question, and being most helpful!
You mean instruction scheduling as in out of order execution as in Out-of-order execution - Wikipedia, the free encyclopedia ?
There's some documents that give an outline of that, for the A72 for instance you can find the following in the documentation part under self-service resources
If the processor is waiting for some data to be loaded and it isn't in the cache then other operations that don't depend on the result are executed, and it can speculatively do instructions when it predicts a branch and throw away the results if wrong - but with a few loads outstanding it'll eventually just stall waiting till it can do something.
Apologies... My question was worded poorly, as I understood your previous message. What I meant to ask (more specifically) for is how modern ARM CPU cores and compilers handle stall-situations in multi-threading (and single-threaded) scenarios with instructions that have a greater-than-one-cycle latency (eg. SQRT, Load-stores, etc)?
There seems to be some misunderstanding. This article on WIkipedia about multithreading seems fairly reasonable
Multithreading (computer architecture) - Wikipedia, the free encyclopedia
This is the type of multithreading I said current ARM cores do not support. Do you have something else in mind?
Thanks for the prompt reply, it is very much appreciated!
Do you have links to literature to offering deeper clarity into the way modern ARM Cortex A-cores handle multi-threading?