Machine Learning Silicon Isn’t One Size Fits All

These days, just about everyone in the technology industry is talking Artificial Intelligence (AI) and Machine Learning (ML). There’s a huge amount of excitement and a rush to be the first to get it right. What you might have noticed in this dialogue is that almost everyone is talking big, powerful, Neural Network accelerators as an essential part of bringing ML to life on your device – and whilst it’s true that they have a significant role to play, they’re just one part of the story. 

Early ML was performed in the cloud with very large data sets, making significant processing power an absolute essential, but today – particularly in the mobile and smart device sectors – the focus is shifting to what can be achieved at the edge.

There are a number of reasons for this shift, not least latency, reliability and responsiveness – factors that are of considerable importance to the consumer. Edge compute can provide the kind of always-on, always-available type usability that we’ve come to expect from our devices, while significant reductions in latency and bandwidth can be achieved by removing the need to go back and forth to the cloud. Security – a high-profile topic in the industry at the moment – is another excellent justification for performing ML in the palm of your hand, rather than sending your data back and forth across the ether, with all the increased potential for security breaches that implies.

Achieving ML at the Edge

So, if ML at the edge is your goal, how can you make it happen? Well, it all depends on what you’re trying to achieve. A System on Chip (SoC) contains multiple processors that are each suited for many different activities. People often ask which is best for running ML, but the simple answer is… it depends. There’s a spectrum of compute, with varying degrees of power and area, and different combinations of IP can achieve the same results, so the processor you choose to perform these tasks all comes down to the trade-offs you’re prepared to make.

The current trend to push to smaller and smaller workloads, for example, makes super-area-efficient Cortex-M processors ideal for simple tasks like voice activation. Speech processing requires more processing power, meaning you might want to choose a slightly larger CPU to handle it, and image processing yet more, for which the wide execution engines of the GPU might be most appropriate.

However, the launch of Project Trillium, Arm’s Machine Learning (ML) platform, brings with it a further, exciting proposition that enables a new era of ultra-efficient inference at the edge. The platform – comprising the Arm ML processor and the second-generation Arm Object Detection (OD) processor, along with Arm NN open-source software – provides a new class of highly scalable processors that have been specifically designed for machine learning and neural network capabilities.

Arm ML system story

The first-generation ML processor is optimized for the mobile and smart camera markets. Designed from the ground up, it offers the highest performance per mm² available today, typically over 4.6 tera ops (TOPs) per second, with additional optimization providing a further uplift of 2x to 4x in real-world use cases. It’s also extremely energy efficient, providing 3 TOPs per watt – a factor that’s hugely important for mobile and its thermal- and cost-constrained environments.

If your area of interest is object detection, the second-generation OD processor gives a 10 per cent improvement over its predecessor, delivering real-time full HD at 60 frames per second. It can identify object sizes as small as 50x60 pixels and can detect a virtually unlimited number of objects per frame. Each frame can be analyzed to detect objects or people, including gestures, poses and the direction they’re facing.

When used as a pre-processor to detect regions of interest, the OD processor can be combined with Arm Cortex CPUs, Arm Mali GPUs or the Arm Machine Learning processor for additional local processing, significantly reducing the overall compute requirement.

Flexible, scalable, futureproof

To tackle the challenges of multiple markets with a range of performance requirements, these processors are based on a new, highly scalable architecture. Future derivatives of this architecture will meet an enormous range of performance requirements, scaling as low as 2 GOPs for IoT and always-on devices to over 150 TOPs for server-type applications.

In fact, the Project Trillium architecture is the only complete, heterogeneous compute platform for ML. And the beauty of it is that it’s compatible with existing Arm IP, so you can now select a comprehensive Arm ML solution that’s tailored to your requirements, from Arm Cortex-M processors for smart, connected embedded applications to Arm Mali-G72 for demanding on-device use cases, or the ML and OD processors themselves. This flexibility to address all use cases is unique to Arm.

Project Trillium Arm ML diagram

AI, powered by ML, is well on its way to becoming the biggest disruptor the tech industry – and, indeed, the world – has seen for decades, and has been impacting the way we design all of our products for some time. Cortex-A processors have been gaining support for ML workloads across the last few iterations, notably with last year’s launch of DynamIQ flexible architecture, and Mali too is seeing great improvements in ML capability across the tiers, from mainstream to premium GPUs, with the ML-optimized Mali-G72 GPU winning Linley’s award for Processor of the Year 2017.

So, whether your focus is end usability, silicon cost, or integration effort, there is an Arm processor, or a combination of them, for any ML workload.

Learn more about Project Trillium