It would be really amazing to have a personal assistant in my hands that is actually smart, truly understands my words and responds intelligently to resolve day-to-day tasks. The recent advancements in Machine Learning (ML) make me optimistic that such a day is not too far away. ML has quickly moved from identifying cat pictures to solving real world problems significantly beyond the mobile market, in areas such as healthcare, retail, automotive and servers.
Now, the major challenge is to move this power to the edge and solve the privacy, security, bandwidth and latency issues that exist today. The Arm ML processor is a huge step in this direction.
The ML processor is a brand new design for the mobile and adjacent markets – such as smart cameras, AR/VR, drones, medical and consumer electronics – offering a performance of 4.6 TOP/s with an efficiency of 3 TOPs/W. Further compute and memory optimizations lead to a significant gain in performance for different networks.
The architecture consists of fixed-function engines, for the execution of convolution layers; and programmable layer engines, for executing non-convolution layers and implementing selected primitives and operators. The network control unit manages the overall execution and traversal of the network and DMA moves data in and out of the main memory. On-board memory allows central storage for weights and feature maps, reducing the traffic to external memory and, therefore, power.
Thanks to the presence of both fixed-function and programmable engines, the ML processor is extremely powerful, incredibly efficient and flexible enough to adapt to your future challenges, providing raw performance, along with the versatility to execute different neural networks effectively.
To tackle the challenges of multiple markets, with a wide range of performance requirements – from a few GOPs for IoT to tens of TOPs for servers – the ML processor is based on a new, scalable architecture.
The architecture can be scaled down to approximately 2 GOPs of performance for IoT or embedded level applications, or scaled up to 150 TOPs of performance for ADAS, 5G, or server-type applications. These multiple configurations can achieve many times the efficiency of existing solutions.
Compatible with existing Arm CPU, GPU and other IPs, providing a complete, heterogeneous system, the architecture will also be accessible through the popular ML frameworks, such as TensorFlow, TensorFlow Lite, Caffe and Caffe 2.
As more and more workloads move to ML, compute requirements will take on a wide variety of forms. Many ML use cases are already running on Arm, with our enhanced CPUs and GPUs providing a range of performance and efficiency levels. With the introduction of the Arm Machine Learning platform, we aim to extend that choice, providing a heterogeneous environment with the choice and flexibility required to meet each and every use case, enabling intelligent systems at the edge… and perhaps even the personal assistant I dream of.