Total Compute: Providing the Compute Performance for Future Digital Immersion

April 6, 2020

7 minute read time.

Today’s mobile devices act as hub for everything we do with thousands of different use-cases and millions of apps. The modern smartphone, for instance, not only makes phone calls and sends messages. It emails, takes photos, records and streams videos, plays games and makes instant payments. It also acts as a personal assistant and responds and interacts with your voice. It even interfaces with your home and other connected devices around the office, school, or retail environments. As a result, we are now living in a world of digital immersion through our mobile devices.

New use-cases and experiences on devices will only continue to advance in the future, at an even grander scale. We are engineering even more intelligent devices that provide richer, quicker, more immersive, more fulfilling, more convenient experiences, completely customized around the end user. However, this move towards greater digital immersion will also lead to more advanced, complex, and demanding multi-domain compute workloads. Especially for all the new XR (augmented reality (AR) and virtual reality (VR)), gaming, viewing, and AI-based experiences. This presents two challenges. Firstly, future Arm IP will require high performance for these compute-intensive workloads. Secondly, this high performance will need to be packed into the small power envelope of SoCs on the devices of the future.

To meet these conflicting requirements of high performance and efficiency, there needs to be a total shift in approaches to SoC design. Arm plans to achieve this through an approach we are calling Total Compute. We are moving beyond optimizing individual IP and taking a system-level solution view of the entire SoC. Focusing on the use-cases and experiences on next-generation devices. Ensuring that the entire system works together seamlessly to provide maximum performance and efficiency to allow digital immersion.

Interconnectivity

Interconnectivity on the system

Performance, but not as you know it

Constantly improving compute performance is one of things we do best at Arm. Each year we release new products that push the limits on performance while respecting the required power efficiency envelope of mobile devices. Total Compute is no different, only this time we are taking a solution-based approach for accelerating these performance gains. This means looking at performance across the entire system through a deep analysis of the workloads. This will analyze how interconnecting data and compute is best deployed between the different IP blocks and compute domains.

Ultimately, more complex use cases require greater performance. The challenge with integrating diverse blocks of IP into the SoC is that increasing the active die area can lead to an increase in thermal and power budgets. That is why the emphasis on the entire system is needed, so each IP block is developed with a common underlying architectural approach for performance, efficiency, and data exchange. This means that all components work together seamlessly and can be accessed by developer tools simply.

This also builds intelligence into the system that is beyond the individual blocks of compute. It is not just about individual IP, it is about each IP block interconnecting effectively across the system. The result is best in-class performance and efficiency to enable the use-cases and experiences of the future on next-generation devices.

Augmented Reality in the Future

Augmented Reality in the future

Looking at AR as one example

AI capabilities, such as AI camera and computer vision, and Augmented Reality (AR) experiences, such as multi-user AR gaming, are complex use-cases that Total Compute will power. However, focusing specifically on AR, you can see why the Total Compute approach is needed.

For the different AR use-cases and experiences, many compute elements need to come together to make them work seamlessly on devices. The CPU is driving performance in a power efficient manner. The GPU is driving the graphics. AI is being used for detection – from the user’s location to specific objects and landmarks. Then, we need to bring this IP together to work seamlessly in the system. This is where System IP – which includes our interconnects, security IP and controllers – adds huge value. Helping to build better systems focused on low-power constraints and high security protections. Finally, there needs to be a super-fast, high-bandwidth, low latency internet connection or networks. This will ensure all these capabilities work while the user is on the move (more on 5G later).

Moreover, this compute needs to happen within a future form factor that is likely to be even more lightweight and smaller than today. For example, the AR smart glasses of the future have a limited SoC area and power budget. Therefore, the high performance will need to take place in an even smaller power envelope than today’s average premium smartphone. With all these different elements, you can already see how being able to optimize across the entire system is so important. This will ensure that all the components work together cohesively.

Machine Learning on the device

Machine Learning on the device

Pushing Machine Learning to new levels

An area where Total Compute pushes performance is through machine learning (ML). The ML performance of our Cortex CPU products has gradually increased year-on-year. However, to enable the range of digital immersion use-cases and experiences through Total Compute, ML performance needs to be pushed to an even higher level. At the 2019 TechCon, I talked about how Arm will be adding Matrix Multiply (MatMul) to our next-generation CPU, codenamed “Matterhorn”. This will effectively double ML performance over previous generations, representing a significant leap that helps to enable a range of new AI-based use-cases and experiences.

However, it is not just the CPU that is seeing these ML performance boosts. We are investing in ML performance improvements across all our compute domains. The latest Premium (Mali-G77) and Mainstream (Mali-G57) GPUs offer significant ML performance uplifts. Both GPUs provide mobile devices with capabilities to perform ML tasks faster through a 60 percent performance density improvement. Meanwhile, the latest Premium (Ethos-N77) and Mainstream (Ethos-N57) NPUs provide the ML performance and efficiency to unleash AI across the ecosystem. For example, Ethos-N77 delivers up to four TOPS of performance, which then scales to 100s of TOPs in multiprocessor deployments.

On top of all this, there is Arm NN, a common API which maximizes ML performance across all Arm IP. Our performance analysis shows how the implementation of Arm NN led to a performance uplift of up to 9.2x over a period of just six months. This uplift was seen across a big Cortex-A CPU, LITTLE Cortex-A CPU and Mali GPU. This continued commitment to ML improvements across all compute domains lends itself perfectly to a future Total Compute solution.

5G SoC

5G on the SoC

The 5G enabler

Total Compute’s drive for greater compute performance will be supported by the brand-new wave of 5G connectivity. 5G promises to be a transformative technology for the entire mobile ecosystem. It provides far higher network speeds and latency, which is both already ten times faster than 4G. These huge advancements in connectivity enables a new wave of digital immersion through new applications and experiences. At the same time, existing applications, use-cases, and experiences advance, making them far quicker, more immersive, and more convenient for the user while they are on the move. The challenge is that 5G will lead to far more data and information being captured on the device, adding to the already complex and compute-intensive workloads of the future. The combination of high data and performance demands from 5G make a Total Compute solution even more necessary for future designs.

A complete system view

Total Compute is a system-wide approach to design that will enable the next wave of digital immersion. The approach will accelerate compute performance, helping to realize the enormous potential of all the exciting use-cases and experiences in the future. For the user, this means richer, quicker, more fulfilling, more convenient, more immersive, and more intelligent experiences on their devices customized entirely around them.

Our commitment to improved performance across all our compute domains provide developers with capabilities to design more immersive applications for the mobile ecosystem. However, being able to program these applications across the entire system is a challenge. The second pillar of Total Compute – developer access – will solve this conundrum. In the next blog, I will explain how we will address the critical points for improved developer access. This will facilitate developers to have access to and unleash all the performance in the Total Compute system across the CPU, GPU, and NPU.

Read the Total Compute Dummies Guide

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Total Compute: Providing the Compute Performance for Future Digital Immersion

Performance, but not as you know it

Looking at AR as one example

Pushing Machine Learning to new levels

The 5G enabler

A complete system view

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Arm A-Profile Architecture developments 2025

When a barrier does not block: The pitfalls of partial order