Arm Cortex-A55: Efficient performance from edge to cloud

May 29, 2017

12 minute read time.

Have you heard? There are a couple of new CPUs in town... and they pack a punch! Of course, I am talking about the Arm Cortex-A75 and Cortex-A55, the first Cortex-A processors based on the recently announced, DynamIQ technology. In this blog, we will discuss the Cortex-A55: a processor destined to play a foundational role in tomorrow’s digital world, and here’s why.

A proven pedigree

Diagram of Cortex-A75 and Cortex-A55 processors

First DynamIQ processors debut with the Arm Cortex-A75 and Cortex-A55

To understand the true potential of the Cortex-A55, let us briefly revisit its predecessor: the Arm Cortex-A53. This CPU has been shipped in over 1.5 billion devices, and is still the highest shipping 64-bit Cortex-A CPU in the industry today. Launched in 2012, the Cortex-A53’s unique blend of performance, power efficiency and area scalability, combined with a versatile feature set, allowed it to be deployed across markets from premium smartphones to network infrastructure, to automotive infotainment and advanced driver assistance systems (ADAS), digital TVs, entry level mobile and consumer devices, and even in satellites.

However, since 2012, a lot has changed in the world around us. Emerging trends that we see today show great promise for an always-connected, intelligence-everywhere digital world. From fully autonomous, self-driving cars to intelligent apps on our devices, artificial intelligence (AI) and machine learning (ML) are set to become truly embedded in our everyday lives. The prevalent application of Internet-of-Things (IoT) will mean an explosion of ‘things’ constantly producing, consuming and reacting to data. Augmented, Virtual and Mixed Reality (AR, VR and MR) are set to transform how we interact with each other and with our devices, merging the real world with the digital.

For the last two years, engineers at Arm have been working tirelessly on the successor to the Cortex-A53, to meet the demands of such emerging technologies. The premise was to build a CPU with significantly more performance and power efficiency, enhanced scalability and packed with advanced features for future applications, from the edge to the cloud. And they did it.

Performance gains across the board

Performance diagram for Arm Cortex-A55

Higher performance across the board with Cortex-A55

The Cortex-A55 implements the latest Armv8.2 architecture and builds on the success of its predecessor. It pushes the boundaries on performance while maintaining the same levels of power consumption as the Cortex-A53. We pulled all the stops in improving on the Cortex-A53, achieving:

Up to 2x more memory performance than Cortex-A53 at iso-frequency, iso-process
Up to 15% better power efficiency than Cortex-A53 at iso-frequency, iso-process
More than 10x more scalability than Cortex-A53

It was achieved by focusing on and challenging existing concepts around the design of the Cortex-A53.

We overhauled the branch predictor by incorporating neural network elements in its algorithm to improve prediction. Zero-cycle branch predictors were also added to further reduce bubbles in the pipeline. This cumulatively reduced idle time between instructions.

We made the L2 cache private to each CPU, which resulted in a reduction of memory access time to the L2 cache by more than 50% when compared to the Cortex-A53. The L2 cache has also been designed to run at the same frequency as the CPU. This, combined with the lower latency, offers a sizeable increase in performance across a wide range of benchmarks.

We introduced an L3 cache, which is shared across all the Cortex-A55 CPUs within the cluster. This allows DynamIQ clusters to benefit from enhanced memory capacity situated closer to the CPU, thus improving performance and reducing system power. The L3 cache is a part of a new functional unit in DynamIQ processors called the DynamIQ Shared Unit (DSU).

8-bit integer matrix multiplication impacts over 85% of the neural network performance. New architectural instructions were added to the Cortex-A55 NEON pipeline, allowing it to perform sixteen 8-bit integer operations per-cycle. These new instructions also allow eight 16-bit float operations per-cycle, and rounding double MAC instructions, beneficial for colour space conversion.

Significantly more efficient than the Cortex-A53

Cortex-A55 v Cortex-A53 diagram

Continued leadership in power and thermal efficiency

The improvements to the branch predictor, NEON and FP unit as described above, as well as the reduced latency to memory, are only a few reasons for the Cortex-A55’s impressive performance gains. It can achieve this while maintaining similar power consumption as the Cortex-A53. Overall, the Cortex-A55 offers a significant 15% improvement in power efficiency. In designs where power is more important than performance, it can be tuned to deliver the same level of performance as the Cortex-A53 at a whopping 30% lower power!

The Cortex-A55 delivers sustained performance for a significantly longer duration compared to today’s Cortex-A53 solutions. This is critical for user experiences in markets such as AR, VR and MR that are expected to dominate the future mobile landscape. These use cases are highly threaded and have strict requirements on latency. The latter speaks to motion-to-photon latency, which, per industry research, needs to be consistently at 20ms or lower so as not to cause nausea and dizziness. Whilst CPUs today have achieved the required level of performance to enable 20ms latency, thermal limits mean that they are not able to sustain these levels for very long. With Cortex-A55, we present the solution for sustained performance over longer periods in future VR devices.

Cortex-A55 features and performance diagram

Advanced features and higher performance for infrastructure markets

This market-leading efficiency enables Cortex-A55 to excel in infrastructure markets. Applications such as Power over Ethernet (PoE) wireless access points and thermally constrained rear-view mirror mounted automotive solutions are able to take advantage of the thermally efficient Cortex-A55 and deliver the highest amount of performance in a given thermal budget. The Cortex-A55 CPU is also able to maximize networking throughput for a given power budget in 5G remote radio head (RRH).

Scalable from the edge to the cloud and everything in between

Cortex-A55 scalability diagram

Right size compute for any need

In addition to performance and efficiency, the Cortex-A55 has also been designed to be highly scalable in physical die area and compute performance. To that end, multiple RTL configuration options were included to give it capacity to be 10x more configurable than Cortex-A53. In fact, it has over 3000 unique configurations, making it the most scalable Cortex-A CPU ever designed.

The Cortex-A55 maintains the flexibility of Cortex-A53, with options such as NEON, Crypto and ECC (Error Correct Codes), but also introduces new, practical configuration options. For example, the private-L2 caches can be configured in size from 64KB to as much as 256KB, giving an upward swing in performance of 10%. While the private-L2 caches bring a good degree of added performance, and will undoubtedly be the default option for many markets, it is designed as optional, to further reduce the die area in area-sensitive markets, such as IoT.

Diagram of new features in the DynamIQ Shared Unit

High-level view of the new features in the DynamIQ Shared Unit

The DSU, which is common to both the Cortex-A55 and Cortex-A75, contains further configuration options that allow it to be customized to your application. For example, the L3 cache, which is shared across the CPUs, is scalable from 0KB to a maximum size of 4MB. It also supports versatile interface options to the wider system through AMBA 5 ACE or CHI. The Accelerator Coherency Port (ACP) and a low-latency peripheral port (PP) are also Integrated into the DSU, which enable closely coupled accelerators to connect to the Cortex-A55 for general compute. These features, alongside the ML capabilities of the Cortex-A55, enable more compute to happen closer to the ‘edge’ in IoT gateway applications.

Packed with advanced features for a diversity of emerging applications

Diagram of improvements for Arm architecture Cortex-A55

Accelerating AI adoption everywhere

It is no secret that AI is set to become a ubiquitous part of our daily lives. By extension, so is the amount of ML workloads that run on our devices. There are multiple ways of dealing with ML on a chip, however CPUs have a distinct advantage here. For example, CPUs are used for general compute, and are therefore, already present in chips that perform AI today. Moreover, much of today’s ML workloads and AI applications are in a state of constant evolution, which makes fixed-function hardware both an expensive and easily outdated solution to ML.

The improvements to the NEON pipeline of the Cortex-A55, and the addition of the 8-bit integer architectural instructions, mean that the Cortex-A55 can deliver significantly more ML performance in matrix multiplication operations than the Cortex-A53. The recent announcement of the collection of low-level software functions optimized for Arm Cortex-A NEON and Mali GPU IP, can also be applied to the Cortex-A55 NEON and further boosts this performance advantage.

DynamIQ safety-critical ASIL D application diagram

Safer autonomous systems with Cortex-A55

The Cortex-A55 also includes advanced Reliability, Availability and Serviceability (RAS) features that allow it to service a wide host of markets, such as infrastructure and automotive. For automotive, the level of safety has been extended in Cortex-A55. It offers optional ECC and parity on every level of cache memory, and supports data poisoning – a method of deferring detected, non-correctable errors for more resilient systems. It is also the first Cortex-A CPU to undergo a new design flow for systematic fault avoidance, making it suitable for ASIL D applications, when paired with the Cortex-R52.

Advanced power management deeply embedded

Arm DynamIQ advanced power features

Advanced power management features for increased power savings

The Cortex-A55 comes with many new power features, such as faster and more rapid hardware-controlled state transitions from ON to OFF. Cortex-A55 is also capable of autonomously powering down the L3 cache depending on the application it’s running. For heavy applications that require more memory, such as VR, the L3 cache is fully powered on. However, for light applications that are fully L1 and L2 resident, such as music playback, the L3 cache is powered off. There are also a further two power modes for anything in between.

It is also now possible to create individual or groups of CPUs each in their own, independent voltage domain within a cluster, thus being able to dynamically scale voltage and frequency at a finer granularity. This has two main benefits: firstly, it provides designers with further levers in tuning their systems for best performance and power efficiency. It also means that DynamIQ systems will be more capable of closely matching a device’s varying thermal envelope, and therefore, extracting the maximum amount of performance available.

A new age for big.LITTLE processing

Since big.LITTLE technology was introduced to the world in 2011, it has become a household name for heterogeneous processing. So much so that two out of every three Android Armv8-based devices shipped today rely on big.LITTLE for power and performance optimization. DynamIQ big.LITTLE is the next generation of heterogeneous computing for systems built with DynamIQ technology.

It enables a fully integrated solution with the Cortex-A75 ‘big’ and Cortex-A55 ‘LITTLE’ CPUs, physically located in a single CPU cluster. All software thread migrations and resultant cache snoops between big and LITTLE CPUs now occur within the cluster. The Cortex-A75 CPU can be designed for higher frequencies compared to the Cortex-A73, while maintaining a continues DVFS curve with the Cortex-A55 – an important design requirement for big.LITTLE systems. Together, they deliver substantially greater peak performance, higher sustained performance and more intelligent capabilities, compared to previous generations of big.LITTLE.

big.LITTLE v DynamIQ big.LITTLE features

Richer user experience with DynamIQ big.LITTLE

Today’s mid-range mobile and consumer markets are brimming with quad and octa Cortex-A53 based solutions. However, as advanced use cases such as AI and VR trickle down from premium markets to the mid-range, so does the need for delivering greater performance and intelligent capabilities at a lower cost. DynamIQ big.LITTLE addresses this need by introducing new heterogeneous CPU configurations, such as 1xCortex-A75 + 3xCortex-A55 (1b+3L) and 1xCortex-A75 + 7xCortex-A55 (1b+7L). These new configurations deliver more than 2x more single thread performance at similar die area to quad and octa Cortex-A55 designs respectively.

System-on-chip (SoC) design guidance for the new DynamIQ processors

Designing the SoC includes being able to implement quickly to your Performance-Power-Area (PPA) targets. Along with the new suite of system IP, Arm offers POP technology that support Cortex-A75 and Cortex-A55 for the process technologies that matter the most to our customers. The Cortex-A75 POP IP for TSMC 16FFC offers the fastest performance in one of the most cost-effective process technologies available. For those customers looking for leading-edge process technologies, the Cortex-A75 and Cortex-A55 POP IP for TSMC 7FF also will be available by Q4 2017. In addition to helping to meet PPA, the Arm POP IP can help customers accelerate the implementation cycle to take advantage of the flexibility of DynamIQ big.LITTLE. The Cortex-A75 and Cortex-A55 POP IP offers the most common configurations for SoC designs focused on applications from the edge to the cloud.

In addition, Arm has a long-standing investment in validating our IP in example SoC designs. As the Arm IP portfolio has grown, so has the complexity and scope of these example systems. This work includes everything from SoC architecture to detailed pre-silicon analysis. Arm is delivering this knowledge as 'System Guidance'.

Alongside the new CPUs there is a range of new system guidance deliverables covering both mobile and infrastructure systems:

CoreLink SGM-775 System Guidance for Mobile has been designed and optimized with Cortex-A75, Cortex-A55 and Mali-G72
CoreLink SGI-775 System Guidance for Infrastructure describes the types of infrastructure SoC that can be built using the new Arm IP

Both deliverables come with documentation, models and software, and are available for free to Arm partners. To learn more about implementing mobile and infrastructure systems, visit our System Guidance page.

When can I expect to see Cortex-A55 based devices?

It’s exciting to finally lift the curtain on the Cortex-A55. The monumental step up in performance, power efficiency and scalability that the Cortex-A55 brings, sets it up nicely to become the next highest-shipping Cortex-A CPU from Arm. However, the excitement doesn’t stop here. A host of Arm partners in the ecosystem have already licensed the Cortex-A55 and I cannot wait to hear about the next wave of intelligent computing solutions that they will be announcing in the coming months. While it is impossible to predict the shapes and forms of devices in which the Cortex-A55 will be used, what is certain is that it is going to be a thrilling 2018 and beyond!

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog