Neural network architectures for deploying TinyML applications on commodity microcontrollers

June 29, 2021

7 minute read time.

What is TinyML?

Data transmission tends to dominate the energy consumption of an Internet of Things (IoT) device [1], yet a significant portion of the data transmitted is not used. The reality is that the vast majority of data is not that interesting. However, we have to keep looking so we do not miss it when something significant does happens [2]. If IoT systems could be smart enough to only transmit the interesting data, they could last far longer on battery power and be less likely to flood the network with uninteresting data, resulting in efficiency gains. Enter ‘TinyML’, which seeks to deploy Machine Learning (ML) algorithms on ultra low power systems, to enable us to intelligently select which data to transmit, improving energy efficiency.

Challenges of TinyML

Deploying ML inference tasks on TinyML devices comes with a unique set of challenges. Arm Cortex-M class microcontrollers (MCUs) often have severely limited Flash and static random-access memory (SRAM) to store model weights and activations. Even size efficient models designed for mobile devices, like MobileNets [3], require an order of magnitude more memory than is typically available on MCUs. Furthermore, IoT applications often have to keep pace with the rate at which the data is collected, despite MCUs having relatively low computational resources. Finally, the faster we run the model, the more time we can spend in ‘sleep mode’. This is where we shut down the MCU to preserve energy and extend the battery life. To effectively perform ML on MCUs, we need to design optimal neural network architectures that fit the constraints of SRAM memory, Flash memory, and latency. MicroNets [4], our recent work published at MLSys 2021, tackles this challenge via neural architecture search.

Differentiable neural architecture search

We use differentiable neural architecture search (DNAS) to discover accurate neural network architectures while satisfying SRAM, Flash, and latency constraints. DNAS uses gradient descent to rapidly produce optimized models, but prefers constraints to be closed form functions. To accurately model our constraints, we performed an in-depth characterization of neural network performance on MCUs.

Hardware characterization

To characterize the performance of TinyML models, we tested hundreds of models and related their on-device metrics to their architecture. We deployed the neural networks to an STM32f746ZG MCU using the Tensorflow Lite for Microcontrollers (TFLite Micro) inference framework.

SRAM and Flash memory

We determined that the SRAM consumption of the model is dominated by the size of the buffer for intermediate tensors used during neural network execution, with some additional capacity consumed by persistent buffers that must be maintained between inferences. These factors can be determined directly from the architecture of the model. The SRAM overhead of the TFLite Micro framework and the bare-metal OS on top of this is typically minimal.

The Flash consumption is a similar story. Where the weights and biases of the model account for, on average, the majority of the Flash consumption with a significant portion of the overall consumption coming from the quantization parameters. Both of these can be determined from the model architecture. TFLite Micro and the system code are, again, overheads that can be accounted for statically.

Figure 1: Breakdown of SRAM and Flash memory usage of an example Keyword Spotting model.

Latency

We observed an interesting relationship between operation (Op) count and latency. Models with significantly different architectures can differ in Op throughput, which means actual on-device latency is the only way to accurately compare two distinct models. However, despite this, we actually observed that models sampled from the same backbone have very similar Op throughputs. Therefore, we are able to use model op count to predict latency, as long as we sample from the same backbone like we already do with DNAS. Op count is simple to calculate from the model architecture, and therefore, as with SRAM and Flash usage, we can predict it during the architecture search.

Figure 2: The latency of models sampled from two different backbones, measured on two MCUs.

MLPerf Tiny use cases

We target three use cases from MLPerf Tiny, an MLCommons benchmark for TinyML systems: visual wake words [5], keyword spotting [6], and anomaly detection [7].

Visual wake words is a binary image classification dataset where the task is to determine if a person is in the image or not. An example of this use case would be a smart doorbell, where the camera can notify you when a person is at the door.

Keyword spotting is widely used in commercial IoT devices with households around the world using “Alexa” or “ok Google” to wake their devices multiple times a day. The dataset that we used has 12 labels, including 10 words to identify, as well as silence and unknown categories.

Anomaly detection is an unsupervised audio task where the goal is to identify faults or abnormal behavior in a machine through the sounds it produces. This use case has wide applicability in industrial settings, where an inexpensive microcontroller can be used to automatically detect faults in machinery for predictive maintenance.

Results

Using DNAS we produced a set of MicroNets which achieve state-of-the-art performance in all metrics on all three target use cases. Our models trade off accuracy and size to best meet the requirements of any application. All of our models achieve real-time performance on their target MCU. The results are shown below and compared against a number of baseline models.

The models are open source and available here.

Figure 3: Keyword Spotting MicroNets Results

Figure 4: Visual Wake Words MicroNets Results. TFLM refers to the stock example model provided by Tensorflow Lite for Microcontrollers.

Figure 5: Anomaly Detection MicroNets Results

Conclusion

TinyML has the potential to revolutionize IoT and democratize AI, but the hardware constraints of microcontrollers make it difficult to deploy accurate models. The Arm ML Research Lab has been working on this topic for a number of years, to develop compact and accurate models that run efficiently on MCUs [8][9][10] and also to enable distributed learning [11]. If you’re interested in these topics, please get in touch!

MicroNets demonstrates that differentiable neural architecture search can be used to rapidly find state-of-the-art models that can be easily deployed to commodity MCUs. This technique can be extended to other TinyML use cases and enable more efficient deployment of IoT applications. We plan to conduct future research on efficient TinyML model and system design.

The MicroNets paper was published at MLSys 2021 and is available here. You can watch the presentation from the conference here.

Questions? Contact Paul Whatmough

Want to reference the work?

“MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers”, C. Banbury, C. Zhou, I. Fedorov, R. M. Navarro, U. Thakker, D. Gope, V. J. Reddi, M. Mattina, P. Whatmough, MLSys, 2021.

References

[1] Bouguera, Taoufik et al. “Energy Consumption Model for Sensor Nodes Based on LoRa and LoRaWAN.” Sensors (Basel, Switzerland) vol. 18,7 2104. 30 Jun. 2018, doi:10.3390/s18072104

[2] Mckinsey Global Institute. “The Internet of Things: Mapping the Value Beyond the Hype.” mckinsey.com

[3] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861, 2017.

[4] Banbury, Colby, et al. "Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers." Proceedings of Machine Learning and Systems 3, 2021.

[5] Chowdhery, A., Warden, P., Shlens, J., Howard, A., and Rhodes, R. Visual wake words dataset. arXiv preprint arXiv:1906.05721, 2019.

[6] Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.

[7] Purohit, H., Tanabe, R., Ichige, K., Endo, T., Nikaido, Y., Suefusa, K., and Kawaguchi, Y. Mimii dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. arXiv preprint arXiv:1909.09347, 2019

[8]I. Fedorov, R. P. Adams, M. Matting, P. N. Whatmough, ”SpArSe: Sparse Architecture Search for CNNs on Resource Constrained Microcontrollers”, Advances in Neural Information Processing Systems (NeurIPS), 2019

[9] I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Yang, A. Mandell, Y. Gan, M. Mattina, P. N. Whatmough, “TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids”, InterSpeech, 2020

[10] U. Thakker, P. N. Whatmough, Z. Liu, M. Mattina, J. Beu, “Doping: A technique for Extreme Compression of LSTM Models using Sparse Structured Additive Matrices”, Machine Learning and Systems (MLSys), 2021

[11] D. A. E. Acar, Y. Zhao, R. M. Navarro, M. Mattina, P. N. Whatmough, V. Saligrama, “Federated learning based on dynamic regularization”, International Conference on Learning Representations (ICLR), 2021

1 comment
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024