Arm Ethos-N ML Inference Processors: Powering Exciting User Experiences on Edge Devices

May 27, 2019

5 minute read time.

OK. Quick survey: How many connected devices do you own?

Whether you’re a gadget addict or just an average Josephine, I’m not sticking my neck out too far if I guess that you own more today than you did five years ago. From smartphones and tablets to personal fitness trackers, smart asthma inhalers and smart doorbells, we’re all busily increasing our connectivity year on year – along with our own personal data explosion. According to a recent report, in the last ten years, the global average of connected devices per capita has leapt from less than two to a projected 6.58 by 2020. That’s an awful lot of devices creating an awful lot of data.

Until recently, that data was routinely shipped to the cloud for processing. But as the amount of data and devices increase exponentially, it’s just not practical – not to mention secure or cost-effective – to keep shifting all that data back and forth.

Fortunately, recent advances in machine learning (ML) mean that more processing, and pre-processing, can now be done on-device than ever before. This brings a range of benefits, from increased safety and security, thanks to the reduced risk of data exposure, to cost and power savings. Infrastructure to transmit data to the cloud and back doesn’t come cheap, so the more processing that can be done on-device, the better.

Power and Efficiency Across the Performance Curve

On-device ML starts with the CPU, which acts as an adept ‘traffic controller’, either single-handedly managing entire ML workloads or distributing selected tasks to specialized Ethos-N processor.

Arm CPUs – and GPUs – are already powering thousands of ML use cases across the performance curve, not least for mobile, where edge ML is already driving features that consumers have come to expect as standard. (Bunny ear selfie, anyone?)

As these processors get ever-more powerful and efficient, they drive even higher performance, which enables more on-device compute power for secure ML at the edge. (See the launch of the third-generation DynamIQ ‘big’ core Arm Cortex-A77 CPU, for example, which can manage compute-intensive tasks without impacting battery life, and the Arm Mali-G77 GPU, which delivers a 60 percent performance improvement for ML.)

But while CPUs and GPUs are ML powerhouses in their own right, where the most intensive and efficient performance is required, they can struggle to meet requirements. For these tasks, the might of a dedicated neural processing unit (NPU), such as the Arm Ethos-N77, comes into its own, delivering the highest throughput and most efficient processing for ML inference at the edge.

NPU Drives New, Exciting User Experiences

So, what makes the Ethos-N77 so special? Well, it’s based on a brand-new architecture, targeting connected devices such as smartphones, smart cameras, augmented and virtual reality (AR/VR) devices and drones, as well as medical and consumer electronics. If you’re interested in how it stacks up numbers-wise, you can’t fail to be impressed by its outstanding performance of up to 4 TOP/s, enabling new use cases that were previously impossible due to limited battery life or thermal constraints. This enables developers to create new user experiences such as 3D face unlock or advanced portrait modes featuring depth control or portrait lighting.

Of course, superb performance is great – but not if it requires you to charge your device every couple of hours or drag a power bank with you wherever you go. To set users free from the tyranny of the charging cable, the ML processor boasts an industry-leading power efficiency of 5 TOPs/W – achieved through state-of-the-art optimizations, such weight and activation compression, as well as Winograd.

Winograd enables 225% greater performance on key convolution filters compared to other NPUs, in a smaller footprint, driving efficient performance while reducing the number of components required in any given design. This in turn lowers cost and power requirements without compromising on user experience.

The architecture consists of fixed-function engines, for the efficient execution of convolution layers, and programmable layer engines, for executing non-convolution layers and implementing selected primitives and operators. These natively supported functionalities are closely aligned with common neural frameworks to reduce network deployment costs allowing for a faster time to market.

Ethos-N77 premium ML inference processor contains 16 compute engines

Efficiency: Provides a massive uplift from CPUs, GPUs, DSPs and accelerators of up to 5 TOPs/W
Network support: Processes a variety of popular neural networks, including convolutional (CNNs) and recurrent (RNNs), for classification, object detection, image enhancements, speech recognition and natural language understanding
Security: Executes with minimum attack surface using the foundation of Arm TrustZone architecture
Scalability: Scales, via multicore, up to eight NPUs and 32 TOPs in a single cluster or 64 NPUs in a mesh configuration
Neural framework support: Integrates closely with existing frameworks: TensorFlow, TensorFlow Lite, Caffe, Caffe 2 and others via ONNX.
Winograd convolution: Accelerates common filters by 225% compared to other NPUs, allowing more performance in less area
Memory compression: Minimizes system memory bandwidth through a variety of compression technologies
Heterogeneous ML compute: Optimized for use with Arm Cortex-A CPUs and Arm Mali GPUs
Enabled by open-source software: Supported by Arm NN to reduce cost and avoid lock-in

Futureproof and Versatile

To make life easy for developers, the Ethos-N77 has an integrated network control unit and DMA which manages the overall execution and traversal of the network, as well as moving data in and out of the main memory in the background.

Onboard memory allows central storage for weights and feature maps, reducing the traffic to external memory and so increasing battery life – another nod to the superlative user experience that consumers have come to expect as standard.

Crucially, the Ethos-N77 is flexible enough to support use cases with higher requirements, running an increased number and size of concurrent features: up to 8 cores can be configured in a single cluster achieving 32 TOP/s of performance, or up to 64 NPUs in a mesh configuration.

Ultimately, the Ethos-N77 boosts performance, drives efficiency, reduces network deployment costs and – through tight coupling of fixed-function and programmable engines – futureproofs the design, allowing firmware to be updated as new features are developed.

Through this combination of power, efficiency and flexibility, the Ethos-N77 is defining the future of ML inference at the edge, empowering developers to meet the requirements of tomorrow’s use cases whilst creating today’s optimal user experience.

Learn more about Ethos-N processors

AI blog

Coaching AI coding agents: A guide for senior engineers

Alex Spinelli

Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
- June 30, 2025
Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
- June 27, 2025
Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Arm Ethos-N ML Inference Processors: Powering Exciting User Experiences on Edge Devices

Power and Efficiency Across the Performance Curve

NPU Drives New, Exciting User Experiences

Futureproof and Versatile

Coaching AI coding agents: A guide for senior engineers

Optimize Llama.cpp with Arm I8MM instruction

Build AI responsibly with the Yellow Teaming methodology and LLM assistant