Chinese Version中文版
The ARM Cortex-A mobile application processor product line spans several generations and three main product tiers. Developers and SoC designers experienced with one or more of the newer ARM ARM Processors benefit from an awareness of how the product line has evolved from a single high performance general purpose CPU design to three distinct product lines that target the high end, mid-range, and entry level of the mobile device SoC market.
ARM introduced the Cortex-A8 processor to the market in 2005 as the first processor supporting the upgraded armv7-a architecture. ARMv7 incorporated 3 key elements: the NEON single instruction multiple data (SIMD) unit, ARM trustZone security extensions, and the thumb2 instruction set for reduced code size via a mix of 16b and 32b extensions. Cortex-A8 implemented the extended ISA in the first fully superscalar design from ARM: it implemented a full dual-issue pipeline, which meant the cortex-a8 could simultaneously issue any two instructions that occurred sequentially in the instructions stream whose arguments didn’t have unresolved dependencies. It could not, however issue or retire instructions out of the order they appeared in the compiled assembly block – that feature would come in future designs.
At the time we introduced the Cortex-A8, many partners at the time thought it was overkill for mobile phones – the natural refrain was, “customers browsing the internet on their phones? Unlikely” However with some key thought leaders in the industry, we progressed towards what customers would want – especially with the advent of hi-bandwidth wireless connectivity (3G ) and larger screens in mobile devices by the time the Cortex-A8 reached mass production in 2008. The innovative mobile industry made full use of it : the Cortex-A8 introduction coincided with the tremendous ramp in the smartphone volumes.
Shortly after the introduction of Cortex-A8, ARM introduced our first multi-core ARMv7 CPU, the cortex-a9. The Cortex-A9 made use of a hardware block to manage cache coherency among one to four cores in a CPU cluster, with an external L2 cache. The external L2 cache in theory would allow customers to design a smaller version of Cortex-A9 that didn’t incorporate an L2, and the design allowed configurations that didn’t include the coherency logic for smaller single core designs. In practice, however, most designs opted for 2 or more cores and an L2 cache. Furthermore, once multiple cores were available to mobile SoC designers, the push to increase performance through core count rather than raw MHz drove mobile SoCs to begin with dual-core topologies then quickly migrate to quad-core Cortex-A9 as the flagship high end mobile CPU in late 2011 and early 2012.
Besides enabling multi-core performance, each Cortex-A9 processor introduces about 25% higher instruction throughput per clock cycle compared to the Cortex-A8. It achieved this improved, in a similar power and area footprint, by moving to an out of order design on a shorter pipeline, with the neon SIMD and floating point capability integrated at an earlier stage of the pipeline.
As the smartphone market began to accelerate, ARM again saw the performance demanded by mobile systems moving upward and we defined a processor with a more substantial performance increase that would define a new high-end tier of the mobile market. With the cortex-a15 ARM would enable a >50% increase in performance over the already powerful Cortex-A9. Additionally, the Cortex-A15 adopted a set of architectural extensions that allowed for larger physical address space, hardware virtualization support, and extended coherency. Larger physical address space is important as devices move above 2GB of RAM, in systems where 32b of memory is divided up into 2GB of device and 2GB of normal memory. Virtualization is being explored in mobile systems for business and user OS’s in bring your own device (BYOD) and other similar deployment scenarios. Extended coherency is useful in big.LITTLE processing technology as a means of reducing average power consumption and tuning for maximum delivered performance in power constrained settings.
The Cortex-A15 cluster combines an integrated snoop control unit (SCU) for hardware coherency, one to four CPU cores in a cluster, and an integrated L2 cache controller – the topology shared by all of the ARM Cortex-A CPUs after Cortex-A15.
Pushing the bounds consistently higher in a mobile envelope
Comparing the performance of thme high end of the Cortex-A series processors shows just how much the performance bar has moved up since 2008 device shipments on Cortex-A8 1GHz.
The extended coherency mechanism, ACE, enables big.little SoCs like the one shown in the diagram below. In a big.LITTLE system, the “big” CPU cluster is typically implemented and tuned for peak performance, while the smaller CPU cluster is tuned for power efficiency. In typical workloads, the LITTLE processors can handle most of the work, with the “big” CPU cluster activating less than 10% of the time, in many cases less than 1% of CPU runtime. The CoreLink CCI-400 cache coherent interconnect enables the CPU clusters to snoop into the caches of the other cluster, enabling a fast software transition of threads from one side to the other.
With the explosive growth of the smartphone market, SoC vendors and OEMs are segmenting the market into a flagship high-end tier, a mid-range tier, and a low-cost entry level tier. As these segments have emerged, ARM has been defining processors to specifically target the three tiers of the market. The cortex-a12 is a new microarchitecture in a new CPU specifically targeting the fast-growing mid-range mobile segment. The graph below shows the size of the segments of the market, and ARM’s Cortex-A products for those segments:
The design of the Cortex-A12 targets the area and power budget for mid-range mobile SoCs. It uses an out of order dual-issue pipeline that delivers greater than 40% more performance than Cortex-A9, currently used in many mid-range mobile SoCs. Cortex-A12 was released to market in the middle of 2013, and is expected to reach production in 2014. It is a 32b processor with the same physical addressing extensions and architectural feature support found in the Cortex-A15.
Cortex-A12 is able to deliver performance that is close to Cortex-A15 in many, but not all use cases. The Cortex-A12 is also optimized for mid-range mobile design, omitting some enterprise features and using a slightly simpler pipeline, so the Cortex-A15 is found in high-end devices across multiple markets while Cortex-A12 more squarely targets mid-range mobile designs.
The flagship CPU from the ARM CPU portfolio for 2013, 2014, and 2015 design starts is the cortex-a57 It delivers 64b capability through the armv8-a architecture, again with full backward compatibility with the ARMv7 architecture through the Aarch32 execution state. While 64b capability is not truly necessary for mobile systems until more than 4GB of RAM becomes common, and even then it can be addressed with extended physical addressing, the early introduction of 64b capability allows a longer and smoother software transition, and enables high performance apps to take advantage of larger virtual address ranges for content creation applications like video editing, photo manipulation, and augmented reality to name a few. The architecture allows a 64b operating system to power the system, while a mix of 32b applications and 64b applications seamlessly operate on top. The ARMv8 architecture allows easy transition from one state to the other.
In addition to the architectural benefits of ARMv8, the Cortex-A57 also increases performance per cycle by 20% to 40% over the high performance Cortex-A15 CPU. It also improves the power efficiency through changes to the L2 design and other elements of the memory system. The Cortex-A57will deliver unprecedented levels of power efficient performance to mobile systems, and with big.LITTLE SoCs will do so at very low average power levels.
As the smartphone market took off, the first segment that appeared was the entry- level segment. In emerging markets, mobile devices are not subsidized by wireless carriers so individuals typically pay full price for mobile devices and pay for service off contract month to month. The price range for emerging markets is below $150 and in rapidly falling below $75 – a different class of SoC design is required to support these markets. Shortly after the launch of the Cortex-A9, ARM sought to create a processor to support this market: something that was the same size and power as a feature phone processor like the venerable ARM926, but with more performance than the arm11 family that was used in the first smartphones. In 2009 we launched the Cortex-A5, a design that achieved these goals through an in-order single issue 8 stage pipeline. The simple pipeline design allowed the power to be quite low, and simplifications in the feature set allowed the efficiency (performance per mW) to be the highest ARM had ever delivered for an applications processor.
The cortex-a7 processor built on the success of the cortex-a5 which is now shipping in high volume in entry-level smartphones, and has defined a vibrant category of the smartphone market. With the success of the Cortex-A5, the next goal was to create a similar processor that was capable of matching the architectural feature set of Cortex-A15 and thereby combine with it in a big.LITTLE pair, while also increasing the performance over Cortex-A5 at the same level of power efficiency and a similar power and area footprint. The Cortex-A7 delivered 20% more performance per cycle by adding partial dual-issue, increased TLB and memory structures, and integrating the level 2 cache.
The latest entry in the high efficiency CPU product line leverages the same 8 stage in-order pipeline, but increases performance significantly through a full dual-issue approach, wider internal busses, increased floating point and SIMD throughput capacity, bigger TLBs, and other improvements in the memory system. The cortex-a53 includes optional ECC protection on the internal RAMs, and offers a choice of external bus options that allow it to be deployed in mobile and enterprise applications.
In addition to the microarchitecture performance improvements, the Cortex-A53 adds support for the ARMv8 architecture, which brings 64b capability into standalone entry-level mobile designs, scalable enterprise applications with multiple Cortex-A53 clusters, and in high-end mobile systems that combine Cortex-A53 with its big brother the Cortex-A57 in a big.LITTLE subsystem design.
The graph below shows the performance between the successive generations of high efficiency Cortex-A CPUs. With the latest design, the Cortex-A53 is able to deliver more performance than the flagship CPU from just a few years ago (the Cortex-A9). Note that the performance comparison shown below is given at the same frequency. In physical implementation, the 8 stage pipelines of the Cortex-A53, Cortex-A7, and Cortex-A5 achieve frequencies within about 15% of those achieved by the longer pipelines in the larger Cortex-A CPUs. Actual production SoC frequencies vary a great deal based on process options and back end design, and we’ve seen Cortex-A7 pushed to frequencies of 1.2GHz, 1.5GHz, and above in 28nm technology.
For more on the high efficiency product line from ARM, see Kinjal Dave's blog - High efficiency, midrange or high performance Cortex-A - What is the difference?
For more on the Cortex-A53, see my earlier blog on the latest high efficiency Cortex-A CPU - The Top 5 Things to Know about Cortex-A53
A Comprehensive Mobile roadmap
Combining all of these processors in a single diagram, the roadmap below shows the high performance tier, the mid-range tier, and the entry-level tier of mobile applications processors from ARM and the supporting coherent interconnect.
The roadmap above highlights the three product tier's in the ARM mobile CPU roadmap for application processors, that we expect to continue in our development of future products. Focusing our design efforts specifically on high-end, mid-range, and highest efficiency tiers allows us to deliver bespoke CPU offerings that uniquely address these 3 segments of the smartphone and tablet market.
Finding the right processor for the right task doesn’t have to be “either or”
ARM’s big.LITTLE technology is designed to give consumers the best overall user experience – performance on demand, better energy efficiency and a “cool” device that lasts. Initial products in the market include Samsung GS4 (international versions) and the Samsung Note 3 (international versions)
The diagram below showcases the high-end mobile CPU subsystem for future designs taping out in 2013 and 2014, for devices in 2014 and 2015. It features big.LITTLE power management, the added performance of the Cortex-A57, and a Cache Coherent Interconnect (cci) capable of supporting IO coherency for GPU compute.
For more detail on big.LITTLE, see my recent presentation on measured results from big.LITTLE platforms that I presented at ARM TechCon 2013 showcasing performance improvements and power savings - big.LITTLE technology moves towards fully heterogeneous Global Task Scheduling - Techcon Presentation
Als@o, you may want to have a look at my earlier blog that addresses the key points of the tech@nology - Ten Things to Know About big.LITTLE
The system diagram above represents the state of the art for mobile CPU design, featuring the Cortex-A57 and Cortex-A53 combined with our latest mali_t760 GPU. Note that it features 2 big cores - from the performance measurements we have been doing, 2 big cores seems a very good match to current workloads. The diagram would look very similar with the Cortex-A15 and Cortex-A7 CPUs that are now being featured in the highest-end mobile SoCs; the Cortex-A50 series processors represent a view toward the future topology for mobile SoCs that will begin to arrive in 2014… but ARM is not stopping there. We continue to develop new innovations in low power CPU, GPU, and system design that fuel the innovation in the smartphone, tablet, and emerging new device categories for power efficient mobile computing that improves the lives of billions of people worldwide.
Thanks Brian for the overview and roadmap of the Cortex-A profile.
Samsung Electronics