At Vision Day in March 2021, Arm introduced the new Armv9 architecture. This was a momentous occasion in our recent history and will be the foundation of the next decade of compute. Bringing more performance, machine learning (ML), DSP, and security to empower our partners to deliver best-in-class solutions for all workloads and applications across all markets. Building on this bold vision for the future of compute, we are delighted to announce the first Armv9 Cortex CPUs targeting a wide range of consumer devices for a variety of workloads and use cases. These CPUs are designed to push to limits of performance and efficiency and are tuned to deliver exceptional user experiences.
The new Armv9 Cortex CPUs are the foundation of our wider Total Compute strategy. This takes a holistic system approach to SoC design to ensure our solutions can seamlessly and securely handle ever more complex and compute intensive workloads and use cases. This is particularly relevant with the proliferation of more consumer devices, use cases and ‘killer apps’ that require more powerful compute resources. The Total Compute strategy focuses on three pillars – accelerating compute performance, expanding security through greater protections across the ecosystem and improved developer access to more performant software and tools.
In 2020, we launched the Cortex-X Custom (CXC) program as part of our relentless drive for peak performance. The CXC program enables customization and differentiation beyond the traditional roadmap of Arm Cortex products, offering our partners a way to deliver the ultimate performance required for their specific use cases. This year, as part of the CXC program, we are announcing a second-generation Cortex-X CPU – Arm Cortex-X2, which is designed for ultimate performance, making it our most performant ArmV9 CPU.
We are also announcing the Arm Cortex-A710. This is our first Armv9 generation “big” CPU, with it providing the best balance of performance and efficiency. Accompanying the “big” Cortex-A710 is the first Armv9 high efficiency “LITTLE” CPU, the Arm Cortex-A510, which is the successor to the highly popular Arm Cortex-A55 CPU.
All of these CPUs can bind together in different CPU cluster configurations through the brand new DynamIQ Shared Unit-110 (DSU-110). The versatility of this configurable cluster approach serves diverse market needs from premium smartphones and laptops to DTVs and wearables. This forms the backbone of our new Total Compute solutions that offer the latest Armv9 features at different levels of performance, efficiency, and scalability across multiple consumer device markets and use cases.
Before delving into the different Armv9 CPUs, it is important to reflect on the Arm design philosophies behind each CPU series. For all three Armv9 CPUs, we are building on three distinct micro-architecture paths that are unmatched in the industry. Essentially this means we are building the right CPU for the right workload.
The Cortex-X series is designed to maximize performance on single-threaded and “bursty” workloads. The pipeline in the microarchitecture is structured and provisioned to push IPC performance improvements. The Cortex-A700 series is prioritized for sustained multiprocessor workloads, with the best balance of efficiency and performance for workloads that require sustained performance within thermally constrained envelope. Finally, the Cortex-A500 series is focused on lightweight workloads, with an efficiency first design focus. These “LITTLE” cores are inspired, in part, by some of our previous bigger cores, with pre-fetch and predication now adopted as key microarchitecture features. The central theme across all three micro-architecture paths is bringing next-generation architectural features and use-case driven optimizations to our Armv9 CPUs.
As we said back in 2020, partners who sign up to the CXC program work in close collaboration with Arm engineering teams to shape a final CPU product that meets their specific market demands. This means partners can define their own performance points outside of the usual Cortex-A PPA design envelope.
The first Cortex-X custom CPU – the Arm Cortex-X1 – has been a tremendous success, representing a major shift in our performance trajectory. As discussed in this blog, it has been implemented in silicon as part of a tri-cluster CPU configuration in Samsung LSI's Exynos 2100 SoC and Qualcomm's Snapdragon 888 5G mobile platform. This ensures the delivery of peak performance on premium smartphone devices, including the new Samsung Galaxy S21 smartphone. There is now a further step-change in performance through the new flagship Cortex-X2.
Cortex-X2 represents our most performant Armv9 CPU and is scalable across premium smartphones and laptops. This is our second generation of Cortex-X class microarchitecture and continues to deliver double digit IPC performance improvements. When combined with the latest process nodes and appropriate system configuration, Cortex-X2 is capable of delivering 30 percent single threaded performance improvements over today’s best Android flagship smartphones¹. We have seen great momentum this past year in the laptop market for both Windows and Chrome laptops that are based on Arm technology. This drive for performance in the laptop market is reflected through the Cortex-X2, which is capable of delivering 40 percent single threaded performance improvements over 2020 mainstream laptop devices². These impressive improvements in Cortex-X2 performance are amplified by ensuring performance scaling from the DSU-110. This enables up to 8 Cortex-X2 cores in a single DSU cluster and larger L3 cache support of up to 16MB. This means partners can adjust CPU configurations for different market needs.
Cortex-A710 elevates our performance and efficiency leadership to new levels, providing uncompromised scalability and performance across multiple form factors. This means Cortex-A710 can target a broad range of consumer devices, from premium smartphones and laptops to smart home devices and smart TVs.
We are very aware of the need to balance performance, power, and area (PPA) in our “big” CPU designs. The new Cortex-A700 series CPUs are all about prioritizing sustained performance for demanding workloads, while also maximizing battery life. Cortex-A710 provides a 10 percent uplift in performance at the same power envelope of the previous generation Arm Cortex-A78 CPU (ISO process)³. Through these performance uplifts, users have enhanced experiences while running demanding applications on smartphones, such as AAA gaming. There is also up to 30 percent improvement in energy efficiency over the Cortex-A78 (ISO process)³. This extends battery life across all mobile devices, as well as reducing thermal throttling events while running applications.
Cortex-A510 provides the highest performance of all the Arm “LITTLE” CPUs, bringing a 35 percent performance increase on the previous generation Cortex-A55 (ISO process)4. These performance improvements are important, as it gives Cortex-A510 a bigger operating range and raises the ‘performance floor’ to meet the growing performance demands across multiple consumer device markets. This means workloads can run longer on the “LITTLE” CPUs before switching to the “big” CPUs. This, in turn, boosts the overall efficiency in the CPU cluster, as fewer compute workloads need to run on the bigger cores. When combined with our latest interconnect technologies, there is a further boost in system performance, particularly in lower end systems.
However, as with all of Arm’s “LITTLE” CPUs, efficiency is still king. Not only does Cortex-A510 boost power efficiency by up to 20 percent (ISO process)4 through the 3-wide in-order design, but it also provides industry-leading area efficiency. An innovation that makes this possible is merged core microarchitecture. This allows two Cortex-A510 CPUs to be grouped into a complex, with multiple complexes per CPU cluster. The result is increased area efficiency at a higher performance point. The merged core microarchitecture also offers a wide configuration range for scalability across different consumer devices.
Arm’s new DSU-110 is the backbone of the DynamIQ CPU cluster. This binds together different Armv9 CPUs across different cluster configurations that address diverse market segments across various PPA points. As we mentioned earlier, the max CPU cluster configurability is 8x Cortex-X2; however, there are a range of different CPU cluster configurations for different market needs. For example, the high-performance CPU configuration of 4x Cortex-X2 and 4x Cortex-A710 is targeted for premium laptop devices. We then have a 1+3+4 configuration for premium smartphones. This delivers a combination of a single Cortex-X2 for high performance and a better, faster user experience, and three Cortex-A710s and four Cortex-A510s for sustained use-cases like AAA gaming. This stretches right across to the ultra-area efficient solution consisting of four Cortex-A510s, which deliver 35 percent performance improvements for entry markets like Home and AR and VR Wearable devices.
Micro-architectural improvements in the redesigned DSU-110 provide increased bandwidth (up to 5x), improved multiprocessor performance, and greater scalability across all device markets, along with power reductions. The higher frequency capabilities of DSU-110 bring a combination of bandwidth, latency and power improvements, which can be tuned around different requirements. For example, this could enable higher bandwidth and lower latencies or reduced power at existing frequencies. The multiprocessor performance improvements are made possible by the larger L3 cache sizes, which are up to 16MB, and support for up to 8 Cortex-X2 cores. On power, the DSU-110 reduces the leakage of power from the CPU clusters for improved ‘days of use’ on consumer devices. Even when configured for a higher bandwidth, this is a lower power leakage than the previous generation. Moreover, low intensity workloads can still run when the DSU-110 is partially powered down, which is ideal for ‘screen off’ scenarios. The DSU-110 also brings advanced power management features through a new integrated Power Policy Unit (PPU) and multiple power-saving modes.
Machine learning (ML) and security are two areas where all the Armv9 CPUs have made significant improvements. Both are fundamental to most user experiences associated with next-generation consumer devices, from everyday web scrolling to advanced video and camera modes.
As ML performance becomes a requirement for all consumer devices, Arm’s Cortex CPUs are being increasingly used for ML computations. This is largely due to their pervasiveness and ease of programming. As mentioned at Vision Day, we are bringing multiple improvements in our A-profile architecture to enable future ML, DSP, and XR use cases.
Through the new Armv9 CPUs, we are bringing in new architectural features like support for BFloat16 format, Matrix Multiply instructions for the Int8 and BF16 and SVE2 (more on that later). These enable newer use cases at improved performance. For example, thanks to the new Matmul instructions support, Cortex-X2 doubles ML performance compared to the Cortex-X1. This is the same with Cortex-A710 compared with the Cortex-A78. Meanwhile, Cortex-A510 has an ML uplift of 3x compared to Cortex-A55.
Security threats are becoming increasingly sophisticated and commonplace as more consumer devices come to market with more advanced compute capabilities and, as a result, larger attack surfaces. At the same time, the quantity and value of personal content and data becoming available through these always-connected devices are increasing all the time. Therefore, it is vital that we offer trustworthy and easily deployable security capabilities to enable our partners to build more secure SoCs, with this ultimately giving the end user a safe and secure digital experience.
As part of the Total Compute solutions, we are raising the bar on security. We have built in a range of new and existing security features into the Armv9 architecture to improve security across all consumer market segments. This means that our partners can achieve better value from software investment into security measures, leading to a more standardized and scalable security solution that can address a diversity of security challenges.
Secure-EL2 provides a standard secure isolation mechanism for trusted services and enables an easier way to maintain security in devices. Memory Tagging Extension (MTE) detects and prevents memory safety vulnerabilities across the entire ecosystem, providing performance and time-to-market benefits for a range of Arm partners. From the silicon vendor addressing bugs in the SoC to the OSVs and application developers using MTE-enabled devices to find their own buffer overflows and heap corruption in their code. We are already working with Google on the adoption of MTE on Android after its Chromium Project team stated that 70 percent of all serious security bugs are memory safety issues. We are also addressing Control Flow Integrity with two new built-in features – Pointer Authentication (PAC) and Branch Target Identifiers (BTI). These two hardware mechanisms enable a strong prevention of Return Orientated Programming (ROP) and Jump Orientated Programming (JOP) attacks. Based on our studies from enabling these two features, the number of gadgets available to an attacker in Glibc reduces by about 98 percent, with a code size increase of only around 2 percent. For more detailed information about PAC and BTI and preventing ROP and JOP attacks, then I recommend reading this blog.
We have further extended existing security support through adding Crypto instructions in NEON and SVE2 space. This accelerates cryptography algorithms relevant to a broad range of consumer devices. Finally, our Armv9 CPUs support speculation barriers with micro-architectural built-in defenses to mitigate side-channel attacks.
A compulsory part of the new Armv9 CPUs is SVE2, which is scalable vector extension architecture. Announced at Vision Day, we see SVE2 as an evolution of our Advanced SIMD architecture, bringing many useful features beyond those already provided by Neon.
For developers who are optimizing new code for consumer devices based on Armv9 CPUs, their code will be simpler, shorter, and easier to maintain through SVE2. This is because SVE2 has predicate-driven loop control and management for clearer code. Also, the removal of scalar tail code through SVE2 means developers can spend less time debugging and more time optimizing for performance. We see the benefits of SVE2 bringing improved performance for many popular consumer applications in ML and Computer Vision, as well as improved DSP capabilities and advanced imaging and video processing.
Our work on SVE2 supports the developer access pillar of the Total Compute strategy where we aim to create the easiest and most efficient way for developers to build their applications. As SVE2 is a better auto-vectorization target than Neon, more code is generated by the compiler for a wider range of algorithms, with less need for handwritten assembler. This means developers can write the code once, optimize once, and then deploy it many times across a broad range of consumer devices. Essentially, allowing their applications to reach more users. For more information, visit our SVE2 developer page.
Through Total Compute, we are focusing on bringing together powerful assets from across Arm’s range of highly performant and efficient IP to deliver flexible solutions. An important component to making this a reality is our Physical IP. Translating RTL into silicon can prove challenging when adopting the latest CPUs at the advanced process nodes. Arm’s new approach to POP IP for Total Compute solutions combine the most advanced node physical implementation focused on Cortex-A710 and Cortex-X2. This enables industry-leading performance within a power envelope that uses memory and logic optimized for the applicable Arm RTL. This new approach to POP IP is key to getting the most out of a given process technology, and solves the many challenges of advanced node designs to bring products to volume faster.
The new Armv9 CPUs are the foundation of our Total Compute solutions, providing uncompromised performance and efficiency. Beyond our continuous drive for PPA, we are broadening the dimensions of performance delivery to provide significant uplifts in ML and other real-world workloads. We are raising the bar on security across the industry by introducing new features like MTE, which enables a more standardized and scalable security solution across all consumer markets. Finally, we are providing an improved developer experience through introducing SVE2 as the SIMD architecture, so developers can code quicker for an improved time-to-market and more time to focus on performance. With all of these new features and enhancements available, the Armv9 CPUs offer the most comprehensive and holistic compute across the broadest range of markets and consumer devices. It is the heart of the Total Compute experience that will digitally empower people to do more with their favorite devices.
Learn more about Cortex CPUsLearn more about the Total Compute Solutions
"You don't have permission to access"
:( Links are bad.
Hello! Which link(s) are you referring to? They all go to our developer or Arm website. Which is all public