OK. Quick survey: How many connected devices do you own?
Whether you’re a gadget addict or just an average Josephine, I’m not sticking my neck out too far if I guess that you own more today than you did five years ago. From smartphones and tablets to personal fitness trackers, smart asthma inhalers and smart doorbells, we’re all busily increasing our connectivity year on year – along with our own personal data explosion. According to a recent report, in the last ten years, the global average of connected devices per capita has leapt from less than two to a projected 6.58 by 2020. That’s an awful lot of devices creating an awful lot of data.
Until recently, that data was routinely shipped to the cloud for processing. But as the amount of data and devices increase exponentially, it’s just not practical – not to mention secure or cost-effective – to keep shifting all that data back and forth.
Fortunately, recent advances in machine learning (ML) mean that more processing, and pre-processing, can now be done on-device than ever before. This brings a range of benefits, from increased safety and security, thanks to the reduced risk of data exposure, to cost and power savings. Infrastructure to transmit data to the cloud and back doesn’t come cheap, so the more processing that can be done on-device, the better.
On-device ML starts with the CPU, which acts as an adept ‘traffic controller’, either single-handedly managing entire ML workloads or distributing selected tasks to specialized ML processors.
Arm CPUs – and GPUs – are already powering thousands of ML use cases across the performance curve, not least for mobile, where edge ML is already driving features that consumers have come to expect as standard. (Bunny ear selfie, anyone?)
As these processors get ever-more powerful and efficient, they drive even higher performance, which enables more on-device compute power for secure ML at the edge. (See the launch of the third-generation DynamIQ ‘big’ core Arm Cortex-A77 CPU, for example, which can manage compute-intensive tasks without impacting battery life, and the Arm Mali-G77 GPU, which delivers a 60 percent performance improvement for ML.)
But while CPUs and GPUs are ML powerhouses in their own right, where the most intensive and efficient performance is required, they can struggle to meet requirements. For these tasks, the might of a dedicated neural processing unit (NPU), such as the Arm ML processor, comes into its own, delivering the highest throughput and most efficient processing for ML inference at the edge.
So, what makes the ML processor so special? Well, it’s based on a brand-new architecture, targeting connected devices such as smartphones, smart cameras, augmented and virtual reality (AR/VR) devices and drones, as well as medical and consumer electronics. If you’re interested in how it stacks up numbers-wise, you can’t fail to be impressed by its outstanding performance of up to 4 TOP/s, enabling new use cases that were previously impossible due to limited battery life or thermal constraints. This enables developers to create new user experiences such as 3D face unlock or advanced portrait modes featuring depth control or portrait lighting.
Of course, superb performance is great – but not if it requires you to charge your device every couple of hours or drag a power bank with you wherever you go. To set users free from the tyranny of the charging cable, the ML processor boasts an industry-leading power efficiency of 5 TOPs/W – achieved through state-of-the-art optimizations, such weight and activation compression, as well as Winograd.
Winograd enables 225% greater performance on key convolution filters compared to other NPUs, in a smaller footprint, driving efficient performance while reducing the number of components required in any given design. This in turn lowers cost and power requirements without compromising on user experience.
The architecture consists of fixed-function engines, for the efficient execution of convolution layers, and programmable layer engines, for executing non-convolution layers and implementing selected primitives and operators. These natively supported functionalities are closely aligned with common neural frameworks to reduce network deployment costs allowing for a faster time to market.
To make life easy for developers, the ML processor has an integrated network control unit and DMA which manages the overall execution and traversal of the network, as well as moving data in and out of the main memory in the background.
Onboard memory allows central storage for weights and feature maps, reducing the traffic to external memory and so increasing battery life – another nod to the superlative user experience that consumers have come to expect as standard.
Crucially, the ML processor is flexible enough to support use cases with higher requirements, running an increased number and size of concurrent features: up to 8 cores can be configured in a single cluster achieving 32 TOP/s of performance, or up to 64 NPUs in a mesh configuration.
Ultimately, the ML processor boosts performance, drives efficiency, reduces network deployment costs and – through tight coupling of fixed-function and programmable engines – futureproofs the design, allowing firmware to be updated as new features are developed.
Through this combination of power, efficiency and flexibility, the ML processor is defining the future of ML inference at the edge, empowering developers to meet the requirements of tomorrow’s use cases whilst creating today’s optimal user experience.
Download the ML processor datasheet