A walk through of the Microarchitectural improvements in Cortex-A72

May 4, 2015

7 minute read time.

In early 2015, ARM announced a suite of IP for Premium Mobile designs, with the ARM® Cortex®-A72 Processor delivering a 3.5x increase in sustained delivered performance over 28nm Cortex-A15 designs from just a two years ago. At ARM we are focused on delivering IP that enables the most efficient computing and the Cortex-A72 micro-architecture is the result of several enhancements that increase performance while simultaneously decreasing power. Last week, at the Linley Mobile Conference, I had the pleasure to introduce the audience to the micro-architectural detail of this new processor which I thought would be good to summarize as there has been quite some interest.

From a CPU Performance view, we have seen a tremendous growth in performance: a 50x increase in the last five years (15x at the individual core level). The graph below zooms in on performance increases in single core workloads broken into floating point, memory, and integer performance. All points shown up to 2015 are measured from devices in market, and Cortex-A72 is projected based on lab measurements to give a preview of what is expected later in 2015 and in 2016. The micro-architectural improvements in Cortex-A72 result in a tremendous increase across all aspects – floating point, CPU memory and integer performance. For the next generation of mobile designs, Cortex-A72, particularly on 14nm/16nm process technology, is going to change the game – the combination of the performance and efficiency of this process node and CPU are extremely compelling.

The improvements shown here have come through improvements at the micro-architectural level coupled with increasing clock frequencies. But delivering peak performance alone isn’t the challenge of designers – mobile devices are characterized by the constrained thermal envelope SoC designers have to operate within. Hence, to increase performance within the mobile power and thermal envelope, turning up the frequency or increasing the issue rate in the micro architecture isn’t the answer – you have to improve power efficiency.

The Cortex-A72 micro-architectural improvements increase efficiency so much that it can deliver the same performance of Cortex-A15 in half the power even on 28nm, and for 75% less power on 14/16nm FinFET nodes. The performance of a Cortex-A15 CPU can be reproduced on the Cortex-A72 processor at reduced frequency and voltage resulting in a dramatic power reduction. However, mobile apps often push the CPU to maximum performance rather than a specific absolute required level of performance. In this case, a 2.5GHz Cortex-A72 CPU consumes 30~35% less power than the 28nm Cortex-A15 processor, still delivering more than 2x the peak performance.

Enhancements to the Cortex-A72 micro-architecture

Below is a simplified view of the micro-architecture. Those familiar with the Cortex-A57 pipeline will recognize that the Cortex-A72 CPU sports a similar 3-wide decode, 8 wide issue pipeline. However in Cortex-A72 the dispatch unit has been widened to deliver up to 5 instructions (micro-ops) per cycle to the execution pipelines.

I list here some key changes and the difference they make (an exhaustive list would be too long!) that highlight the way in which the design of Cortex-A72 CPU was approached, beginning with the pipeline front end.

Pipeline front end

One of the most impactful changes in the Cortex-A72 micro-architecture is the move to a sophisticated new branch prediction unit. There is an interesting trade-off here - a larger branch predictor can cost more power, but for realistic workloads where branch misses occur, the new predictor’s reduction in branch miss rate more than pays for itself in reduction of mis-prediction and mis-speculation. This reduces overall power, while simultaneously improving performance across a broad range of benchmarks.

The instruction cache has been redesigned to optimize tag look-up such that the power of the 3-way cache is similar to the power of a direct mapped cache – doing early tag lookup in just one way of the data RAM instead of 3 ways. The TLBs and micro BTBs have been regionalized, so that the upper bits can be disabled for the common case when page lookups and branch targets are closer rather than farther away. Similarly, small-offset branch-target optimizations reduce power when your branch target is close. Suppression of superfluous branch predictor accesses will reduce power in large basic blocks of code – the A72 recognizes these and does not access the branch predictor during those loops to save power.

Decode/Rename block

Of the many changes in the decode block, the biggest change is in handling of microOps – the Cortex-A72 keeps them more complex up to the dispatch stages – this increases performance and reduces decode power. AArch64 instruction-fusion capability deepens the window for instruction level parallelism. In addition to this, the decode block has undergone extensive decoder power optimization, with buffer optimization and flow-control optimizations throughout the decode/rename unit.

Dispatch/Retire

In the dispatch/retire section of the pipeline, the effective dispatch bandwidth has increased to 5-wide dispatch, offering increased performance (by increasing instruction throughput), while reducing decode power – decoding full instructions rather than microOps gets more work done per stage for those instructions. Cortex-A72 also features a power-optimized reorganization of architectural and speculative register files, with significant port-reduction and area. It has also optimizations in commit-queue and register-status FIFOs, arranging and organizing them in a more power efficient manner.

One final example of the improvements in the dispatch/retire section is the suppression of superfluous register-file accesses - detecting cases where operand data is guaranteed to be in the forwarding network. Every time you avoid a read from the register file, you save power.

Floating Point Unit and Advanced SIMD

Here the biggest improvement is the introduction of new lower latency FP functional units. We’ve reduced latencies to:

3-cycle FMUL unit (40% latency reduction)
3-cycle FADD unit (25% latency reduction)
6-cycle FMAC (33% latency reduction)
2-cycle CVT units (50% latency reduction)

These are very fast floating point latencies, comparable with the latest high performance server and PC CPUs. Floating point latency is important in typical mobile and consumer use cases where there is commonly a mix of FP and integer work. In these settings, the latency between computation and result is critical. Shorter latencies mean integer instructions waiting on the results of those instructions are less likely to be stalled.

This performance increase shows up in SpecFP and SpecFP2006 as an uplift of approximately 25%. This type of improvement is less useful for high-performance compute applications where pure floating point throughput is required. For mobile use cases, floating point shows up in combination with integer work. A good example of this combination of floating point and integer is in javascript code where the native data type is double precision float. In addition, the divide unit has gone to a Radix-16 FP divider, doubling the throughput of divide instructions executed.

Other improvements in this area of the design include an improved issue-queue load-balancing algorithm, and multiple zero-cycle forwarding data paths resulting in improved performance and reduced power. Finally, the design features a source-reduction in the integer issue-queue which cuts power without performance loss.

Load/Store unit

The Load/Store unit features several key optimizations. The main improvement is the replacement of the pre-fetcher with a more sophisticated combined L1/L2 data prefetcher - it is more advanced and recognizes more streams. The Load/Store unit also includes late-pipe power reduction with a L1 D-cache hit predictor. Performance tuning of Load-Store buffer structures improves both effective utilization as well as capacity. Increased parallelism in MMU table-walker and a reduced-latency L2 main TLB improves performance on typical code scenarios where data is spread across many data pages, for example in Web browsing where data memory pages typically change frequently. Finally, there is power optimization in configurable pipeline support logic, and extensive power optimization in L2-idle scenario. The combination of the new pre-fetcher and other changes enable an increase of more than 50% CPU memory bandwidth.

In summary, the Cortex-A72 micro architecture is a significant change from Cortex-A57. There are several changes to the microarchitecture working in concert to produce a big step up in performance while consuming less power. that importantly also results in a design that consumes lower power. The important takeaways for the ARM Cortex-72 are:

Performance improvements in numerous key areas
Generational performance upside across all workload categories
Extensive power-efficiency improvements throughout the microarchitecture
Reduced-area, lower-cost solution

Cortex-A72 is truly the next generation of high-efficiency compute. A Cortex-A72 CPU based system increases performance and system bandwidth further when combined with ARM CoreLink CCI-500 Interconnect. A premium mobile subsystem will also contain the Cortex-A53 in a big.LITTLE CPU subsystem to reduce power and increase sustained performance in thermally constrained workloads such as gaming. Finally, the ARM Mali-T880 graphics processor can combine with Cortex-A72 and CCI-500 to create a premium mobile system. That’s all for this edition of the blog. What further features of the Cortex-A72 and the big.LITTLE premium mobile system are you interested in hearing about in future editions?

Sean Lumly over 8 years ago

Thanks daith!
I think that this is a big part of what I'm asking, and I suppose I need to learn a lot about the way that modern cpu architectures operate! Thank you for considering my question, and being most helpful!
Cheers,
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
daith over 8 years ago

You mean instruction scheduling as in out of order execution as in Out-of-order execution - Wikipedia, the free encyclopedia ?
There's some documents that give an outline of that, for the A72 for instance you can find the following in the documentation part under self-service resources
http://infocenter.arm.com/help/topic/com.arm.doc.uan0016a/cortex_a72_software_optimization_guide_external.pdf
If the processor is waiting for some data to be loaded and it isn't in the cache then other operations that don't depend on the result are executed, and it can speculatively do instructions when it predicts a branch and throw away the results if wrong - but with a few loads outstanding it'll eventually just stall waiting till it can do something.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 8 years ago

Apologies... My question was worded poorly, as I understood your previous message. What I meant to ask (more specifically) for is how modern ARM CPU cores and compilers handle stall-situations in multi-threading (and single-threaded) scenarios with instructions that have a greater-than-one-cycle latency (eg. SQRT, Load-stores, etc)?
Sean
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
daith over 8 years ago

There seems to be some misunderstanding. This article on WIkipedia about multithreading seems fairly reasonable
Multithreading (computer architecture) - Wikipedia, the free encyclopedia
This is the type of multithreading I said current ARM cores do not support. Do you have something else in mind?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Sean Lumly over 8 years ago

Thanks for the prompt reply, it is very much appreciated!
Do you have links to literature to offering deeper clarity into the way modern ARM Cortex A-cores handle multi-threading?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Ker Liu

In-depth analysis of what the PMU of L2D_CACHE_WR counts on the Neoverse N2 server.
- April 15, 2024
Arm SPE: SoC Telemetry & Performance Analysis using Statistical Profiling Extension

Brian Jeff

We refer to the SPE performance methodology whitepaper published by Arm for details on the content of this blog.
- December 8, 2023
Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture

Anton Kirilov

We discuss some of the challenges that we face when we are trying to implement the WebAssembly SIMD bitmask operations on the 64-bit Arm architecture.
- December 6, 2023

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

A walk through of the Microarchitectural improvements in Cortex-A72

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Arm SPE: SoC Telemetry & Performance Analysis using Statistical Profiling Extension

Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture