In 2019, we introduced our Total Compute strategy, taking a holistic, solution-focused approach to SoC design. We are moving beyond individual IP elements to designing and optimizing the system to create use case driven solutions to power the next decade of compute innovation. The 2021 Total Compute solutions launch is the first realization of this vision, with solutions for all consumer device markets and different performance and efficiency tiers.
At the heart of these system-wide optimized solutions are the three pillars of the Total Compute strategy. First, we aim to expand the dimensions of compute performance beyond general-purpose workloads to specialized processing, such as AAA gaming. Second, we are building on robust security foundations, while minimizing fragmentation, cost and the performance impact of new security capabilities. Third, we are building our solutions with greater developer accessibility in mind, making it easier to develop, debug, deploy, optimize, and port applications across a broad range of consumer devices.
Each solution offers different levels of performance, efficiency, and scalability to deliver specialized compute across multiple consumer device markets. The premium solutions are designed for top performance and connected user experiences on high-end smartphone and laptop devices. The performance solutions address a wide range of requirements across multiple segments, including mid-range smartphones, Chromebooks, and high-end Smart TVs. Finally, the efficiency solutions offer ultra-scalability to achieve best-in-class cost efficiency across entry level smartphones, AR and VR wearables, mid-range and entry level DTVs and set-top boxes and smart watches. Providing the backbone of these new solutions are hardware IP (including the latest Armv9 CPU cores, Mali GPUs and System IP), physical IP, software, tools, and standards.
This new solutions approach is needed because consumer demands across the spectrum of consumer devices are leading to more complex compute requirements. In the laptop segment, consumers want devices that increase their productivity, enable rich 4k – and even 8k – video content and support video meetings throughout the day without needing a charge. In the smartphone segment, 5G is expected to move rapidly across all tiers of smartphone – premium, mainstream and entry-level. Consumers want to do more on their smartphone devices regardless of its cost. They want more gaming content and more camera capabilities with higher resolutions and frame-rates. In the home segment, consumers want to view 4K and 8K content on their TVs, with smoother experiences and quicker app loading times. Finally, in the emerging wearables and XR segment, consumers want immersive experiences that last longer while untethered. Covering all these different segments is the foundational requirement of robust security.
Our Total Compute solutions expand the dimensions of performance beyond general-purpose workloads to specialized workloads. One key workload which we have examined and optimized as part of the new Total Compute solutions is gaming. This continues to push the limits of mobile technology, not just on the GPU but across CPUs and System IP too.
Gaming is a great example of how Arm’s Total Compute solutions deliver tangible benefits to complex, real world and specialized workloads. To accelerate the CPU performance for gaming workloads, we focused on optimizations across Armv9 CPUs, Mali GPUs and their software drivers running on the CPU cluster. We also combined the microarchitectural innovations in the CPU with the introduction of new GPU features like Command Steam Frontend (CSF) to reduce the CPU load. This results in improved CPU performance for gaming workloads. Alongside new feature improvements in the Arm Mali-G710 GPU, such as the redesigned texture unit and execution engines, we can also tackle more demanding gaming content. At the same time, we achieve better overall frames per second (FPS). Finally, our brand-new Interconnect CoreLink CI-700 supports a System Level Cache (SLC). This, in combination with new features introduced in the GPU, reduce latency and system power consumption for a variety of gaming content running in the system.
We have measured the performance and efficiency benefits of these cross IP system and software optimizations for different gaming content. This gave an average 27 percent improvement across a range of Mali-DDK workloads for different gaming content running on the Arm Cortex-A710 CPU. It also delivered 20 percent improvements in performance and efficiency across different gaming content on Mali-G710 compared to the previous generation Arm Mali-G78 GPU. Furthermore, by enabling core features in the system, such as SLC, FP16, AFBC, and CSF, we see a 15 percent system efficiency improvement.
As discussed in this blog on the new suite of Mali GPUs, we are not just relying on IP improvements for gaming. We are investing in the gaming ecosystem and working with leading game engines and companies to ensure that their gaming content can be optimized for Arm IP. Arm Mobile Studio is a great example of a tooling platform that supports developers to optimize their gaming content and unlock further performance and efficiency benefits. It is a suite of free-to-use performance analysis tools that analyze the CPU activity, GPU activity and content metrics of games. This means game developers can quickly identify and fix any problems that might cause the game to run slowly, overheat the device or quickly drain battery life.
Alongside gaming, we are also addressing the explosion of AI and machine learning (ML) use cases across all consumer devices. This poses a unique set of performance challenges that can be addressed by our Total Compute solutions. Compute intensive use cases, such as AI camera, often require real time and concurrent processing for ML algorithms under certain timing constraints. Therefore, performance efficiency is vital. Then, there is the need to support a broad diversity of ML algorithms in a single SoC. This requires the system to support different data formats across different compute elements. Finally, the increasing use of ML algorithms for security sensitive use cases and high value assets, such as face unlock and mobile banking, demands strong protection.
With more consumer devices being equipped with very powerful neural processors, it can be easy to think that NPUs are able solve these complex problems on its own. In reality, there are multiple stages in the compute pipeline where different specialized processing units are needed. For example, with portrait mode in smartphone cameras (a common ML workload), the CPU can be used for pre and post processing the image, followed by GPU or NPU for extracting the depth map and segmentation. This is then passed again to the CPU for bokeh, and finally to the NPU for super resolution. This illustrates how one common AI workload requires diverse computational AI requirements on a single SoC and system-wide optimizations improve this dramatically.
All of the IP in the Total Compute solutions provide specialized and scalable AI compute capabilities. The new Armv9 CPU cores advance DSP and ML workloads with SVE2, Matmul, and BFloat16 support. The Mali GPUs offer mixed precision capability for image enhancement and the Ethos NPUs are highly efficient neural processors with multiprocessor support for high throughput AI video processing. Then, there are also the Arm Cortex-M55 CPU and Arm Ethos-U65 NPU that are both specialized for always-on ML use cases.
The importance of AI and ML workloads on consumer devices is exactly why Arm Total Compute solutions provide the broadest range of specialized, scalable AI. Across all the IP in the solutions, there are significant ML performance uplifts, ranging from 35 percent on the Mali-G710 GPU to 10x performance on Cortex-M55 for power constrained use cases like keyword spotting. This cross IP, system-wide focus on ML capabilities enables our partners to execute specialized, AI, and ML workloads for a variety of use cases in different power and silicon cost constraints.
As discussed in the Armv9 CPUs blog, we are rethinking the security architecture, moving it from diverse and expensive mitigations to a standardized and scalable solution. Through incorporating security into the foundational layers of the architecture, a broad range of consumer devices are better protected against security threats. This approach also minimizes fragmentation, cost, and the performance impact of introducing these new security capabilities. This provides significant performance improvements as opposed to software only security solutions.
Through the Armv9 CPUs, we introduce new security features and technologies, as well as enhancing existing features and support, to address a diversity of security issues across multiple consumer devices. These are all in response to growing security threats and attack surfaces.
Secure-EL2 provides a standard secure isolation for trusted services. Memory Tagging Extension (MTE), which was co-designed with Google from definition to development to deployment, makes it easier and more efficient to detect memory safety violations, a common vulnerability in existing C/C++ code. Finally, Pointer Authentication (PAC) and Branch Target Identifier (BTI) mitigates against Return-Oriented Programming (ROP) and Jump Oriented Programming (JOP) attacks that are targeted towards complex software stacks.
This approach to security is not just useful for silicon partners and device manufacturers. Developers also benefit through being able to efficiently deploy reliable, stable, and secure applications to their customers without getting bogged down in the complexity of security. For developers, MTE, PAC and BTI are particularly useful. PAC, and BTI protect against ROP and JOP attacks that many developers are not familiar with. Meanwhile, MTE allows developers to quickly track down memory safety bugs in languages like C and C++, providing an improved time-to-market.
MTE, in particular, has been warmly received by our partners. Here is what Chris Rohlf, a Security Engineer at Facebook, had to say:
“Security engineering teams at Facebook believe the adoption of Arm’s Memory Tagging Extension (MTE) technology in Armv9 CPU cores can help our industry find and eradicate critical memory safety security vulnerabilities from Android devices.”
Physical implementation is crucial when extending the Total Compute solutions’ stack into silicon. Arm POP IP for Total Compute addresses this solution diversity with unique implementation schemes, dependent on the design requirements and selected process node. The uniqueness is key to getting the best silicon performance results for a given process technology. This provides solutions to the many challenges of advanced node designs and enabling our partners to bring their products to volume faster.
Alongside support for application developers, we continue to invest in a range of tools that enable SoC developers to extract stunning performance and reduce risk when developing leading-edge designs. At the earliest development stage, our partners can experiment through virtual prototyping of the latest IP using Arm Fast Models and evaluate system-level performance through the CoreLink NI-700 advanced design and verification tooling. In addition, a reference open-source software stack is now available for the first version of the Total Compute Fixed Virtual Platform (FVP), named TC0. This enables product software development and the seamless integration of a variety of Arm Partner product and software solutions. For profiling, Arm Development Studio provides heterogeneous performance analysis for complex systems and full workloads. This allows developers to interrogate hardware counters and further optimize systems across CPU, GPU, and NPU resources. Finally, we have collaborated with key partners to ensure users can take advantage of the new SVE2 and MTE features when using our toolchains, with new architecture support provided in LLVM9 and GNU10. All of these tools are built into or support the Total Compute solutions for the ultimate developer experience.
This is a truly exciting time for compute. Through the realization of our Total Compute strategy, we are providing more performant, secure, efficient, scalable and developer friendly solutions that will power the next generation of consumer devices. This lays the foundation for true innovation in the ecosystem, providing the compute experiences that will transform our digital lives.
[CTAToken URL = "https://www.arm.com/company/news/2021/05/arm-total-compute-solutions-and-armv9-to-the-broadest-range-of-client-devices" target="_blank" text="Read the Arm.com Newsblog" class ="green"]
[CTAToken URL = "https://www.arm.com/solutions/mobile-computing" target="_blank" text="Learn more about the Total Compute Solutions" class ="green"]