Ubiquitous on-device artificial intelligence (AI) is the next step in transforming the myriad of mobile computing devices in our everyday lives into a new class of truly “smart” devices, capable of constantly observing, learning, and adapting to their environment. These intelligent devices can make our lives safer and the world around us more energy-efficient.
The second On-Device Intelligence Workshop, held April 9 in conjunction with MLSys 2021, brought researchers and practitioners together to discuss key issues, share new research results and practical tutorial material. Made possible by our colleagues at the Arm ML Research Lab, Harvard, Google, and Facebook, the workshop discussed four primary questions:
We are taking a look back at the presentations given at the workshop.
Today’s AI is too big. Deep neural networks (DNNs) demand extraordinary levels of compute and power for training and inference. This limits the practical deployment of AI in edge devices, and Song aims to improve the efficiency of deep learning. Song presents MCUNet that brings deep learning to Internet of Things (IoT) devices, a framework that jointly designs the efficient neural architecture (TinyNAS) and the light-weight inference engine (TinyEngine). He also discusses TinyTL that enables on-device transfer learning, reducing the memory footprint by 7-13x. Finally, he describes Differentiable Augmentation that enables data-efficient GAN training, generating photo-realistic images using only 100 images, which previously required tens of thousands of images. Song hopes such TinyML techniques can make AI greener, faster, and more sustainable.
Download slides
Venkatesh proposes a novel method for federated learning, customized to the objective of a given edge device. In the proposed method, a server trains a global meta-model by collaborating with devices without actually sharing data. The trained global meta-model is customized locally by each device to meet its specific objective. Different from the conventional federated learning setting, training customized models for each device is hindered by the inherent data biases of the various devices and requirements imposed by the federated architecture. Venkatesh presents an algorithm that locally de-biases model updates, while leveraging distributed data, so that each device can be effectively customized towards its objectives. The method is fully agnostic to device heterogeneity and imbalanced data, scalable to massive number of devices, and allows for arbitrary partial participation. The method has built-in convergence guarantees and on benchmark datasets, demonstrating that it outperforms other state-of-art methods.
Apache TVM is a complete deep learning compilation framework, automatically generating fast binary code for any model, on any device, by exploring a large search space of potential optimizations. TVM itself uses ML to guide its code synthesis process, saving months of engineering time. The code generated can be many times faster than hand-optimized libraries, and in some cases exceeding a speedup of 30x over hand-tuned code.
In his talk, Thierry gives an overview of Apache TVM and how they are using it at OctoML to enable model deployment on mobile and IoT devices. He highlights recent efforts on micro-TVM, TVM’s solution for deploying ML on microcontrollers.
Deploying ML models on edge devices poses a big challenge as capabilities and numeric behavior can differ on each device. Eric discusses the development of the Tensor Operator Set Architecture (TOSA), a set of base operators that serve as the building blocks for complex operations. TOSA operators define the functional and numeric behavior, ensuring that deployed networks behave consistently across a variety of devices.
Deep learning-based models have revolutionized many Natural Language Processing (NLP) tasks (for example, Translation, Conversational AI, Language Modeling). There is a growing need to perform these tasks on low-resource electronic devices (such as mobile phones, tablets, wearables) for privacy and latency reasons. However, the large computational and memory demands of DNNs make it difficult to deploy them on-device as they are. They usually require significant optimizations and sometimes major model architecture changes to fit under tight memory and compute budgets.
In this talk, Ahmed and Kshitiz share the work that Facebook is doing to bring these NLP models to user devices. They talk about efficient building blocks and model architectures that find the right balance between model quality and compute/memory requirements on multiple NLP tasks. Finally, they outline the biggest challenges and open problems in shipping on-device NLP models at Facebook scale.
Compression of neural network models has become an important systems problem for practical ML workflows. While various compression mechanisms and algorithms have been proposed to address the issue, many solutions rely on highly specialized procedures and require substantial domain knowledge to use efficiently. To streamline the compression to a large body of users, Yerlan proposes an extensible open-source library based on the ideas of learning-compression (LC) algorithm—the LC toolkit. The software is written in Python using PyTorch, and currently supports multiple forms of pruning, quantization, and low-rank compressions. These can be applied to the model’s parts individually or in combination to reduce model’s size, computational requirements, or the on-device inference time. The toolkits versatility comes from the separation of the model learning from the model compression in the LC algorithm: once the learning (L) step is given, any compression (C) steps can be used for the model.
"A Flexible, Extensible Software Framework for Model Compression Based on the LC Algorithm", Y. Idelbayev, M. A. Carreira-Perpinan, 2nd On-Device Intelligence Workshop, 2021.
Quantization is a popular technique for accelerating and compressing neural networks by utilizing low-bit arithmetic to represent weights and activations. It remains a hot area for research, with continued work on removing the gap in accuracy between full and low precision models. Researchers in this area tend to rely on custom implementations, rather than approaches built into the popular ML libraries as they are not sufficiently flexible to enable research. Shyam and his team are open-sourcing TorchQuant. It is their MIT licensed library that builds on PyTorch by providing researchers with modular components and implementations that accelerate their research and provide the community with consistent baselines. Using their library, they provide an example of how to quickly evaluate a research hypothesis: the “range-precision” trade-off for quantization-aware training. The library can be found here.
"TorchQuant: A Hackable Quantization Library for Researchers, By Researchers", S. A. Tailor, M. Alizadeh, N. D. Lane, 2nd On-Device Intelligence Workshop, 2021.
In autonomous driving, 3D object detection is essential, providing basic knowledge about the environment. However, as deep learning-based 3D detection methods are usually computation intensive, it is challenging to support real-time 3D detection on edge-computing devices with limited computation and memory resources. To facilitate this, Pu proposes a compiler-aware pruning search framework, to achieve real-time inference of 3D object detection on resource-limited mobile devices. Specifically, a generator is applied to sample better pruning proposals in the search space, and an evaluator is adopted to assess the sampled pruning proposal performance with Bayesian optimization. He demonstrates that the pruning search framework can achieve real-time 3D object detection on mobile (Samsung Galaxy S20 phone) with state-of-the-art detection performance.
"Towards Real-Time 3D Object Detection with Pruning Search and Edge Devices", P. Zhao, W. Niu, G. Yuan, Y, Cai, B. Ren, Y. Wang, X. Lin, 2nd On-Device Intelligence Workshop, 2021.
With the increasing demand to efficiently deploy DNNs on mobile edge devices, it is much more important to reduce unnecessary computation and increase the execution speed. Prior methods towards this goal, including model compression and network architecture search (NAS), are largely performed independently and do not fully consider compiler-level optimization. This is a must-do for mobile acceleration. In this work, Yanyu proposes NPS, a compiler-aware unified network pruning search, and the corresponding comprehensive compiler optimizations supporting different DNNs and pruning schemes, which bridge the gap of weight pruning and NAS. The framework achieves 6.7 millisecond (ms), 5.9ms, and 3.9ms ImageNet inference times with 77%, 75% (MobileNet-V3 level), and 71% (MobileNet-V2 level) Top-1 accuracy respectively on an off-the-shelf mobile phone, consistently outperforming prior work.
"A Compiler-Aware Framework of Network Pruning Search (NPS) Achieving Beyond Real-Time Mobile Acceleration", Y. Li, G. Yuan, Z. Li, W. Niu, P. Zhao, P. Dong, Y. Cai, X. Shen, Z. Zhan, Z. Kong, Q. Jin, B. Ren, Y. Wang, X. In, 2nd On-Device Intelligence Workshop, 2021.
Federated learning allows edge devices to collaboratively learn a shared prediction model while keeping their training data on the device, decoupling the ability to do ML from the need to store data in the cloud. Despite the algorithmic advancements in federated learning, the support for on-device training of algorithms on edge devices remains poor. Akhil presents one of the first explorations of on-device federated learning on various smartphones and embedded devices, using the Flower framework. He evaluates the system costs of on-device federated learning, and discusses how this quantification could be used to design more efficient algorithms.
"On-Device Federated Learning with Flower", A. Mathur, D. J. Beutel, P. P. B. de Gusmao, J. Fernandez-Marques, T. Topal, X. Qiu, T. Parcollet, Y. Gao, N. D. Lane, 2nd On-Device Intelligence Workshop, 2021.
Laser-induced breakdown spectroscopy (LIBS) is a popular, fast elemental analysis technique used to determine the chemical composition of target samples, such as in industrial analysis of metals or in space exploration. Recently, there has been a rise in the use of ML techniques for LIBS data processing. However, ML for LIBS is challenging as:
This on-device retraining of model should not only be fast, but also unsupervised due to the absence of new labeled data in remote LIBS systems. Kshitij introduces a lightweight multi-layer perceptron (MLP) model for LIBS that can be adapted on-device without requiring labels for new input data. It shows 89.3% average accuracy during data streaming, and up to 2.1% better accuracy compared to an MLP model that does not support adaptation. Finally, Kshitij characterizes the inference and retraining performance of our model on Google Pixel2 phone.
"Semi-Supervised On-Device Neural Network Adaption for Remote and Portable Laser-Induced Breakdown Spectroscopy", K. Bhardwaj, 2nd On-Device Intelligence Workshop, 2021.
The ML and systems community strives to achieve higher energy-efficiency through custom DNN accelerators and model compression techniques. This has led to a need for a design space exploration framework that incorporates quantization-aware processing elements into the accelerator design space, while having accurate and fast power, performance, and area models. In this work, Ahmet presents QAPPA, a highly parameterized quantization aware power, performance, and area modeling framework for DNN accelerators. The framework can facilitate future research on design space exploration of DNN accelerators for various design choices. This includes bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, device bandwidth, number of total processing elements in the design, and DNN workloads. Ahmet’s results show that different bit precisions and processing element types lead to significant differences in performance per area and energy. Specifically, he proposes lightweight processing elements achieve up to 4:9 more performance per area and energy improvement when compared to INT16 based implementation.
"QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators", A. Inci, S. G. Virupaksha, A. Jain, V. V. Thallam, R. Ding, D. Marculescu, 2nd On-Device Intelligence Workshop, 2021.
Igor reviews the challenges associated with designing models that can be run on memory and compute constrained devices. He then summarizes some of the model design techniques which are particularly useful for TinyML applications, including pruning, quantization, and black-box / gradient-based neural architecture search.
Colby dives deeper into how to deploy neural network models to MCUs using TensorFlow Lite for microcontrollers, profiling their latency and memory consumption.
Find out more about how we are expanding applications for ML through research, including all the resource you will need to hear our latest work.
Arm ML Research Lab