The Arm ML processor, designed to deliver the highest throughput and most efficient processing for on-device inference, is based on a brand new architecture. Arm's Dylan Zika explains how the development team set about defining requirements and building an ML powerhouse from the ground up.
You’ve got a great little edge device, and you’re keen to add machine learning capabilities to assist local decision-making. So, what do you do next?
Possibly the simplest course of action is to repurpose a CPU, GPU or DSP. A continual drive to improve performance and efficiency has seen the CPU evolve to become a kind of mission control for ML, either single-handedly managing entire ML workloads or distributing selected tasks to specialized ML processors. GPUs offer significant performance, but less flexibility, while DSPs are often cited having an immature programming environment¹.
But where you need a high level of responsiveness or power efficiency, these processors may struggle to meet requirements, and a dedicated neural processing unit (NPU) – such as the Arm ML processor – may be the most appropriate IP to integrate into your heterogeneous solution.
Before we began to spec out the ML processor, we did A LOT of research, one element of which was a survey among chip and AI product designers in the global Arm ecosystem. Respondents were drawn from a range of sectors using AI-enabled technologies, including IoT (54 percent), industrial (27 percent), automotive (25 percent) and mobile computing (16 percent).
In one question, respondents were asked, “Thinking about future products or design projects, where do you think AI/ML functionality will be best computed for your device or app?” As the chart below shows, the majority of responses were split across CPU, GPU and a dedicated ML processor, with a slight overall preference for the latter.
Thinking about future products or design projects, where do you think AI/ML functionality will be best computed for your device or app?
This corroborated our other research to both validate the holistic approach we’d taken with Project Trillium, Arm’s ML platform – examining how flexible solutions can address use cases on a variety of IP – and underline the need for a dedicated ML processor to address the most challenging applications.
Our next step was to deepen our discussion with the ecosystem. We took time to understand exactly what developers were looking for from an NPU, and we found that the majority of use cases fell into three broad groups: vision, voice and vibration.
In many cases, the aim was to drive an exhilarating user experience: How can we help users capture breathtaking memories with real-time photo Bokeh, or provide more accurate and responsive face unlock? How can we untether personal assistants from the cloud and deliver a truly personalized experience?
Other goals were more industrial in nature, ranging from automatic detection of poor operating behavior using anomalies in sensor data to the development of IP suitable for multiple market segments.
Working closely with our partners, we distilled these high-level use cases into requirements, supporting the neural frameworks of choice through open-source software, and identifying key architectures and operators for the processor’s feature set. We developed semi-fixed function hardware to accelerate these operators and included programmable hardware to “futureproof” the design, allowing the firmware to be updated as new features are developed.
Naturally, security is also an essential part of system design. We designed the ML processor to allow several implementation choices to address multiple risk profiles. We also used industry-proven Arm microcontroller technology with standard privilege levels and firmware that clears the SRAMs, making it easier to audit. No other solutions have these security features built in from the start.
Our deep-dive analysis led us to the sweet spot of performance vs power vs area: a processor that achieves a baseline 4 TOP/s in a single instantiation. For more demanding use cases, running a number of features concurrently, performance can be scaled up through multi-processing. Up to eight cores can be configured in a single cluster, achieving 32 TOP/s of performance, or a maximum of 64 cores in a mesh configuration, to reach over 250 TOP/s.
Of course, high performance is great…. but not if it’s draining your device’s battery every time you venture away from the wireless charging matt. For performance to truly be a benefit, it needs to be coupled with efficiency. That’s why the ML processor provides an industry-leading power efficiency of 5 TOPs/W, achieved through state-of-the-art optimizations such as neural compilation, efficient convolutions and bandwidth reduction mechanisms. This helps to lower cost and power requirements without compromising on user experience.
We’re immensely proud of the ML processor. Its optimized design delivers a massive uplift in efficiency compared to CPUs, GPUs and DSPs, and its scalable architecture delivers the computational determinism required for real-time responses – without compromising on flexibility.
If you’d like to dig deeper into the techniques we’ve used to bring this state-of-the-art processor to life, just click on the link below to download our white paper, Powering the Edge: Driving Optimal Performance with the Arm ML Processor.
¹ Machine Learning at Facebook: Understanding Inference at the Edge