Powering the Edge: How Will YOU Do ML?

July 30, 2019

4 minute read time.

The Arm ML processor, designed to deliver the highest throughput and most efficient processing for on-device inference, is based on a brand new architecture. Arm's Dylan Zika explains how the development team set about defining requirements and building an ML powerhouse from the ground up.

You’ve got a great little edge device, and you’re keen to add machine learning capabilities to assist local decision-making. So, what do you do next?

Possibly the simplest course of action is to repurpose a CPU, GPU or DSP. A continual drive to improve performance and efficiency has seen the CPU evolve to become a kind of mission control for ML, either single-handedly managing entire ML workloads or distributing selected tasks to specialized ML processors. GPUs offer significant performance, but less flexibility, while DSPs are often cited having an immature programming environment¹.

But where you need a high level of responsiveness or power efficiency, these processors may struggle to meet requirements, and a dedicated neural processing unit (NPU) – such as the Arm ML processor – may be the most appropriate IP to integrate into your heterogeneous solution.

Chart: How much juice do you need to do ML?

The Future is Heterogeneous

Before we began to spec out the ML processor, we did A LOT of research, one element of which was a survey among chip and AI product designers in the global Arm ecosystem. Respondents were drawn from a range of sectors using AI-enabled technologies, including IoT (54 percent), industrial (27 percent), automotive (25 percent) and mobile computing (16 percent).

In one question, respondents were asked, “Thinking about future products or design projects, where do you think AI/ML functionality will be best computed for your device or app?” As the chart below shows, the majority of responses were split across CPU, GPU and a dedicated ML processor, with a slight overall preference for the latter.

Chart: where will your AI be computed?

Thinking about future products or design projects, where do you think AI/ML functionality will be best computed for your device or app?

This corroborated our other research to both validate the holistic approach we’d taken with Project Trillium, Arm’s ML platform – examining how flexible solutions can address use cases on a variety of IP – and underline the need for a dedicated ML processor to address the most challenging applications.

Building a Powerhouse from the Ground Up

Our next step was to deepen our discussion with the ecosystem. We took time to understand exactly what developers were looking for from an NPU, and we found that the majority of use cases fell into three broad groups: vision, voice and vibration.

In many cases, the aim was to drive an exhilarating user experience: How can we help users capture breathtaking memories with real-time photo Bokeh, or provide more accurate and responsive face unlock? How can we untether personal assistants from the cloud and deliver a truly personalized experience?

Other goals were more industrial in nature, ranging from automatic detection of poor operating behavior using anomalies in sensor data to the development of IP suitable for multiple market segments.

Working closely with our partners, we distilled these high-level use cases into requirements, supporting the neural frameworks of choice through open-source software, and identifying key architectures and operators for the processor’s feature set. We developed semi-fixed function hardware to accelerate these operators and included programmable hardware to “futureproof” the design, allowing the firmware to be updated as new features are developed.

Naturally, security is also an essential part of system design. We designed the ML processor to allow several implementation choices to address multiple risk profiles. We also used industry-proven Arm microcontroller technology with standard privilege levels and firmware that clears the SRAMs, making it easier to audit. No other solutions have these security features built in from the start.

Targeting Performance AND Efficiency

Our deep-dive analysis led us to the sweet spot of performance vs power vs area: a processor that achieves a baseline 4 TOP/s in a single instantiation. For more demanding use cases, running a number of features concurrently, performance can be scaled up through multi-processing. Up to eight cores can be configured in a single cluster, achieving 32 TOP/s of performance, or a maximum of 64 cores in a mesh configuration, to reach over 250 TOP/s.

Of course, high performance is great…. but not if it’s draining your device’s battery every time you venture away from the wireless charging matt. For performance to truly be a benefit, it needs to be coupled with efficiency. That’s why the ML processor provides an industry-leading power efficiency of 5 TOPs/W, achieved through state-of-the-art optimizations such as neural compilation, efficient convolutions and bandwidth reduction mechanisms. This helps to lower cost and power requirements without compromising on user experience.

Taking a Deeper Dive

We’re immensely proud of the ML processor. Its optimized design delivers a massive uplift in efficiency compared to CPUs, GPUs and DSPs, and its scalable architecture delivers the computational determinism required for real-time responses – without compromising on flexibility.

If you’d like to dig deeper into the techniques we’ve used to bring this state-of-the-art processor to life, just click on the link below to download our white paper, Powering the Edge: Driving Optimal Performance with the Arm ML Processor.

Download Whitepaper

¹ Machine Learning at Facebook: Understanding Inference at the Edge

AI blog

Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

Ashok Bhat

As part of the new PyTorch 2.9 release, Arm contributed key enhancements to ensure seamless performance and stability on Arm platforms. Learn more about the enhancements in this blog post.
- October 15, 2025
Are you attending PyTorch Conference 2025?

Michelle Yung

Join us on site at the PyTorch Conference 2025 on October 22-23 to learn how Arm empowers developers to build and deploy AI applications easily using PyTorch and ExecuTorch.
- October 15, 2025
Unlocking AI Potential with Kleidi: Seamless Acceleration Workshop Recap

Parichay Das

Explore takeaways from our Kleidi AI workshop led by Arm Ambassador Parichay Das, where participants tackled performance gaps and future AI needs.
- September 25, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Powering the Edge: How Will YOU Do ML?

The Future is Heterogeneous

Building a Powerhouse from the Ground Up

Targeting Performance AND Efficiency

Taking a Deeper Dive

Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

Are you attending PyTorch Conference 2025?

Unlocking AI Potential with Kleidi: Seamless Acceleration Workshop Recap