What’s the best IP for machine learning workloads – CPU, GPU or NPU?

July 30, 2019

7 minute read time.

At Arm we’re often asked by partners, developers and other interested parties within the complex and huge machine learning (ML) ecosystem which processors are best at performing specific ML actions on different devices. As described in this Arm white paper, the CPU is still the most common denominator for ML experiences from edge to cloud. It remains central to all ML systems, whether through handling ML tasks entirely on its own or partnering with other processors, such as GPUs or NPUs. However, the IP implementations for ML workloads will vary based on the ML requirements of the device, the use case and the specific workloads.

There is no ‘one-size fits-all’ ML solution out there, with plenty of different versions and deployment choices. In this blog, I’ll take you through a choice selection of ML use cases on devices from face unlock on smartphones and PCs to content recommendations on Smart TVs and which processors – CPU, GPU or NPU – typically carry out the different ML workloads during these use cases.

Artificial Intelligence

The CPU, GPU and NPU

Before delving into some of the use cases, it’s first worth taking a general overview of the advantages of ML compute on the CPU, GPU and NPU. As the CPU sits at the center of the compute system, it has the flexibility to run any type of ML workload, often being used as the first-choice ML processor for mobile computing. Although the GPU’s primary function is graphics processing, its parallel data processing capability makes it suitable for running ML workloads. Finally, the NPU is for specialized, hyper-efficient and highly task-specific ML compute.

……and also don’t forget the MCU for ‘Tiny ML’

However, ‘Tiny ML’ tasks on power-constrained devices can be achieved on MCUs. For example, there are increasing demands for always-on use cases, such as keyword detection, natural language processing, and always-on camera for object detection and recognition. Such tasks can be performed on the MCU due to the continuous performance and efficiency improvements on the processor. Moreover, MCUs are now equipped with ML functionality, so it can handle workloads locally without having to spend power, bandwidth and time sending data to the cloud.

ML security

Face unlock

Nowadays face unlock is a common security feature for many smartphones and PC devices. The face recognition algorithm will scan for various unique facial features on the user – like the distance between the eyes, shape of the face, facial hair density and colour of the eyes – before storing them as information in the phone ready for the unlock. If the face fits, the phone will unlock. Face recognition is something that is very well suited to ML. Even though this use case sounds like a complicated process, more suited to an NPU, most face recognition tasks can be performed on the CPU. This operation can run on an Arm Cortex-A CPU using low-level and optimized software functions supported by the Arm Compute Library.

Real-time AR emojis

Real-time AR emojis are a fun feature for users on modern communication apps, such as Snapchat. This transposes AR cartoon features onto your face when taking a photo or video. Real-time AR emojis use the CPU and GPU. The CPU will detect the face, as it does with face unlock. The GPU will then detect the emotion of the face to auto-select the appropriate emoji. Face tracking within a photo or video frame can be done either by CPU or GPU. For fairly basic AR emojis, the CPU should suffice. The NPU will be needed to carry out more complex, full body-tracking actions.

Man using mobile device

Camera auto-mode

Camera auto-mode uses the CPU, GPU and NPU. Both the CPU and GPU will perform the same task, which is detecting the region of interest when using the camera. The NPU then performs the classification, for example, whether it’s flowers, people, food, etc. Once all of these actions are completed, the phone automatically sets the appropriate camera mode. Similarly, the AI Camera feature on smartphones will run AI algorithms across the CPU, GPU and NPU, with multiple algorithms running concurrently at any one time.

Portrait video mode

In portrait video mode, the CPU, GPU and NPU are used for different ML workloads. The CPU will perform directional audio recording (e.g. if a user wants video recording or a video call). However, for the directional audio action, Cortex-M CPUs are typically used due to their greater energy efficiency. The main tasks for the NPU or GPU are semantic segmentation and depth sensing, which is important when building a virtual representation of the environment, especially on mobile.

ML on a SmartTV

ML on Smart TVs

Away from smartphones, smart TVs have important ML use cases for content recommendations and improvement of image quality. The NPU is the key component here, as the processor will recognize the content being played on a content platform. The information is then sent to the cloud to match this particular piece of content with others, so the recommendation can be made to the user. On smart TVs, the CPU and GPU recognize the user through audio ID or face recognition. A similar thing happens for child protection for content on smart TVs, with the NPU recognizing the user, allowing the TV to show appropriate content. If the content being played is not suitable for the audience, then it will stop being played.

ML for mobile developers

Today, many APIs for mobile developers that use ML workloads will run on the CPU. This could change in the future, but most non-specialist mobile developers might not need to look beyond the CPU for processing ML workloads. The APIs for mobile developers often fall into two ML categories – vision and natural language processing. For vision – which analyzes and interprets images and video streams – examples of some base APIs for developers include detecting barcodes, text, faces, and objects. The APIs for developers using natural language processing – which analyzes and generates text, speech and other kinds of language data – include language ID, on-device translation and smart reply. On iOS, mobile developers use Core ML when developing ML models for their iOS apps, while on Android mobile developers use ML Kit.

ML developer

Arm solutions

ML was a key theme throughout the recent Premium IP launch. Due to a 20 percent IPC performance improvement over the previous Cortex-A76 CPU, the new Arm Cortex-A77 CPU delivers advanced ML experiences on devices. Meanwhile, the new Arm Mali-G77 GPU brings a 60 percent improvement to ML performance, significantly boosting inference and neural net (NN) performance for advanced on-device intelligence. On top of all this, the Arm ML processor – our NPU – is designed to unleash high ML performance across the ecosystem.

However, the IP for ML is only one part of the story, as its true power can only be utilized with decent software. The ML software sits on top of the IP, optimizing its power to maximize ML workloads and performance across different computing elements on devices. The end result is the enhancement of key ML use cases outlined previously, as they are faster, more performant and more energy efficient. Our own open-source Arm NN software development kit has been incredibly successful so far, already being shipped in more than 250 million Android devices.

The Arm ML processor and Arm NN, alongside the latest Arm premium Cortex-A CPUs and Mali GPUs, the Arm Compute Library and CMSIS-NN, all form part of Project Trillium, Arm’s own heterogeneous ML compute platform. This represents a suite of Arm products that gives device-makers all the hardware and software choices they need for ML.

Premium mobile experiences

ML tasks on devices

The CPU still remains the “workhorse” for ML tasks, either taking these on itself or deciding which processor can carry out specific actions. While NPUs are going to be important for devices in the future, plenty of ML actions and workloads can already take place on the CPU. In fact, utilizing existing CPUs can democratize the ML experience into cost-sensitive devices, such as mid- to low-tier smartphones.

The key thing for OEMs and silicon vendors is understanding the different ML workloads on devices and at what stage or level they can stay on the CPU or get moved to the GPU or NPU. For developers and their base ML APIs, the CPU will remain the main processor for their ML workloads. At the same time, ML software cannot be forgotten, as this drives the enhancement and optimization of ML workloads on devices. Utilizing the different processors and deciding when to use one or the other will be vital for devices that are increasingly performing ML at the edge across a range of features and applications.

Learn about Arm's mobile computing solutions

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog