Powering the Edge: How Will YOU Do ML?

July 30, 2019

4 minute read time.

The Arm ML processor, designed to deliver the highest throughput and most efficient processing for on-device inference, is based on a brand new architecture. Arm's Dylan Zika explains how the development team set about defining requirements and building an ML powerhouse from the ground up.

You’ve got a great little edge device, and you’re keen to add machine learning capabilities to assist local decision-making. So, what do you do next?

Possibly the simplest course of action is to repurpose a CPU, GPU or DSP. A continual drive to improve performance and efficiency has seen the CPU evolve to become a kind of mission control for ML, either single-handedly managing entire ML workloads or distributing selected tasks to specialized ML processors. GPUs offer significant performance, but less flexibility, while DSPs are often cited having an immature programming environment¹.

But where you need a high level of responsiveness or power efficiency, these processors may struggle to meet requirements, and a dedicated neural processing unit (NPU) – such as the Arm ML processor – may be the most appropriate IP to integrate into your heterogeneous solution.

Chart: How much juice do you need to do ML?

The Future is Heterogeneous

Before we began to spec out the ML processor, we did A LOT of research, one element of which was a survey among chip and AI product designers in the global Arm ecosystem. Respondents were drawn from a range of sectors using AI-enabled technologies, including IoT (54 percent), industrial (27 percent), automotive (25 percent) and mobile computing (16 percent).

In one question, respondents were asked, “Thinking about future products or design projects, where do you think AI/ML functionality will be best computed for your device or app?” As the chart below shows, the majority of responses were split across CPU, GPU and a dedicated ML processor, with a slight overall preference for the latter.

Chart: where will your AI be computed?

Thinking about future products or design projects, where do you think AI/ML functionality will be best computed for your device or app?

This corroborated our other research to both validate the holistic approach we’d taken with Project Trillium, Arm’s ML platform – examining how flexible solutions can address use cases on a variety of IP – and underline the need for a dedicated ML processor to address the most challenging applications.

Building a Powerhouse from the Ground Up

Our next step was to deepen our discussion with the ecosystem. We took time to understand exactly what developers were looking for from an NPU, and we found that the majority of use cases fell into three broad groups: vision, voice and vibration.

In many cases, the aim was to drive an exhilarating user experience: How can we help users capture breathtaking memories with real-time photo Bokeh, or provide more accurate and responsive face unlock? How can we untether personal assistants from the cloud and deliver a truly personalized experience?

Other goals were more industrial in nature, ranging from automatic detection of poor operating behavior using anomalies in sensor data to the development of IP suitable for multiple market segments.

Working closely with our partners, we distilled these high-level use cases into requirements, supporting the neural frameworks of choice through open-source software, and identifying key architectures and operators for the processor’s feature set. We developed semi-fixed function hardware to accelerate these operators and included programmable hardware to “futureproof” the design, allowing the firmware to be updated as new features are developed.

Naturally, security is also an essential part of system design. We designed the ML processor to allow several implementation choices to address multiple risk profiles. We also used industry-proven Arm microcontroller technology with standard privilege levels and firmware that clears the SRAMs, making it easier to audit. No other solutions have these security features built in from the start.

Targeting Performance AND Efficiency

Our deep-dive analysis led us to the sweet spot of performance vs power vs area: a processor that achieves a baseline 4 TOP/s in a single instantiation. For more demanding use cases, running a number of features concurrently, performance can be scaled up through multi-processing. Up to eight cores can be configured in a single cluster, achieving 32 TOP/s of performance, or a maximum of 64 cores in a mesh configuration, to reach over 250 TOP/s.

Of course, high performance is great…. but not if it’s draining your device’s battery every time you venture away from the wireless charging matt. For performance to truly be a benefit, it needs to be coupled with efficiency. That’s why the ML processor provides an industry-leading power efficiency of 5 TOPs/W, achieved through state-of-the-art optimizations such as neural compilation, efficient convolutions and bandwidth reduction mechanisms. This helps to lower cost and power requirements without compromising on user experience.

Taking a Deeper Dive

We’re immensely proud of the ML processor. Its optimized design delivers a massive uplift in efficiency compared to CPUs, GPUs and DSPs, and its scalable architecture delivers the computational determinism required for real-time responses – without compromising on flexibility.

If you’d like to dig deeper into the techniques we’ve used to bring this state-of-the-art processor to life, just click on the link below to download our white paper, Powering the Edge: Driving Optimal Performance with the Arm ML Processor.

Download Whitepaper

¹ Machine Learning at Facebook: Understanding Inference at the Edge

0 comments
0 members are here

AI blog

Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Zach Lasiuk

Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
- June 6, 2025
Unlocking audio generation on Arm CPUs to all: Running Stable Audio Open Small with KleidiAI

Gian Marco Iodice

Real-time AI audio on Arm: Generate 10s of sound in ~7s with Stable Audio Open Small, now open-source and ready for mobile.
- May 14, 2025
Deploying PyTorch models on Arm edge devices: A step-by-step tutorial

Cornelius Maroa

As AI adoption in edge computing grows, deploying PyTorch models on ARM devices is becoming essential. This tutorial guides you through the process.
- April 22, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Powering the Edge: How Will YOU Do ML?

The Future is Heterogeneous

Building a Powerhouse from the Ground Up

Targeting Performance AND Efficiency

Taking a Deeper Dive

Build AI responsibly with the Yellow Teaming methodology and LLM assistant

Unlocking audio generation on Arm CPUs to all: Running Stable Audio Open Small with KleidiAI

Deploying PyTorch models on Arm edge devices: A step-by-step tutorial