By Oscar Andersson, Sebastian Larsson, Igor Fedorov, and Patrik Laurell
Only a few years ago it was unthinkable to run image recognition software with high accuracy on an edge device with less than a megabyte of memory. The rapid development of TensorFlow Lite for Microcontrollers (TFLM), increased hardware capabilities such as Arm’s Ethos-U55 and Cortex-M55. Also specialized models for tiny devices have made the unthinkable not only possible but easy. It is therefore our pleasure to introduce this new and improved image recognition demo for microcontrollers. By combining Arm’s Cortex-M processors with TFLM, CMSIS-NN optimizations (Common Microcontroller Software Interface Standard - Neural Networks), a neural network model, and Arm’s open-source project Mbed OS, we have created a demo that is easy to follow and learn from. In addition, it was an opportunity to become early adopters of the new and improved Mbed command-line interface, Mbed CLI 2.
In this project, the STMicroelectronics Discovery STM32F746NG board was used [2], which is powered by Arm’s Cortex-M7 processor and is equipped with a 4.3” LCD and a 1.3 MP camera [3].
Arm’s open-source operating system Mbed OS offers everything you need for the deployment of your Internet-of-Things (IoT) project to a Cortex-M processor. It is easy to use and contains features needed for your IoT project, including security, storage, and drivers for sensors and I/O devices [4]. In November 2020, Mbed released version 6.5 along with its new command-line tool called Mbed CLI 2, succeeding its predecessor Mbed CLI 1. Mbed CLI 2, also known as Mbed tools, uses Ninja and CMake as opposed to a custom build system used previously [5]. CMake has the advantage of being compiler and platform independent, which makes it easier for teams to develop on different platforms [6]. Senior Product Manager Andy Powers says, “An additional benefit of using >Ninja and CMake is improvements in build time up to 40% compared to Mbed CLI 1”.
Machine learning on MCUs (Micro Controller Unit) is challenging because of the many constraints that the hardware/software system places on the machine learning model. The main constraints are memory usage, latency, and power consumption. The MCU used in this project has 1MB of flash and 340kB of SRAM. This severely limits the number of parameters which the machine learning model can use and the size of any intermediate tensors generated by the model during inference. We focus our attention on one specific kind of machine learning model: the neural network. Neural networks achieve state-of-the-art results on many real-world tasks such as visual wake words, audio keyword spotting, and anomaly detection [6]. At the same time, many off-the-shelf neural networks [8][9][10] are parameterized by millions of weights and produce giant feature maps, making them incompatible with resource-constrained MCUs.
Arm ML Research Group is active in enabling so-called TinyML across a number of applications, including audio [11] and image [12] use cases. For this project, we used a new technique for automatically generating compact, performant neural networks called Differentiable Neural Architecture Search (DNAS) [12]. Designing new hardware-friendly neural networks is typically a time-consuming manual process. However, DNAS has allowed us to leverage machine learning itself to traverse the search space and find the optimal balance of classification accuracy, memory usage, and inference latency.
The model was trained on the CIFAR-10 dataset [13] and can classify 10 different labels, {plane, car, bird, cat, deer, dog, frog, horse, ship, truck}, with an accuracy of 89%. It is worth noting that the CIFAR-10 test set accuracy does not necessarily reflect real-world usage, as the conditions are not the same.
CMSIS-NN implements performance optimizations of common neural network functions such as 2D-convolution and matrix multiplication for fully connected layers. To compare the impact of CMSIS-NN to the reference kernels in TFLM, we deployed the demo with the reference kernels and then the CMSIS-NN kernels. The comparison in inference time showed a 3.1x speedup when using CMSIS-NN, as can be seen in Table 1. Using Arm Compiler armclang was faster than GNU Compiler Collection (GCC), both with and without CMSIS-NN, and achieved a similar speed up of 2.9x.
Table 1: Performance in inference per second using CMSIS-NN optimized kernels vs. reference kernels on Arm Cortex-M7.
To deploy this image recognition demo;
1) Clone the repo https://github.com/ARM-software/ML-examples
2) Follow the steps in README.md, inside the tflm-cmsisnn-mbed-image-recognition folder
Even if you do not have the same Discovery board model used in this project, you can still run the model test on any mbed-enabled device that can fit the final binary in memory. This will classify 50 random images from the CIFAR-10 dataset and provide accuracy as well as layer-by-layer information of the model. You can find the instructions for this in README.md.
Looking ahead we can expect a dramatic increase in machine learning processing power from the introduction of devices including a microNPU (Neural Processing Unit) working alongside a microcontroller. Arm’s Cortex-M55 coupled with Ethos-U55 offers up to 480x performance increase compared to previous microcontrollers.
Arm’s new compute technologies extend the performance of Arm’s AI for endpoint devices, offering silicon providers a more diverse range of hardware choices and empowering developers to deliver this next revolution in computing. For more information on Cortex-M55 and Ethos-U55, please read this blog post.
Special thanks to Fredrik Knutsson, Felix Johnny Thomasmathibalan, Andy Powers, Jaeden Amero, and Paul Whatmough for contributing their precious time and vast knowledge.
[1] https://www.tensorflow.org/lite/microcontrollers
[2] https://www.st.com/en/evaluation-tools/32f746gdiscovery.html#overview
[3] https://www.element14.com/community/docs/DOC-67585?ICID=knode-STM32F4-cameramore
[4] https://os.mbed.com/mbed-os/
[5] https://os.mbed.com/docs/mbed-os/v6.12/build-tools/index.html
[6] https://cmake.org/
[7] https://community.arm.com/developer/research/b/articles/posts/neural-network-architectures-for-deploying-tinyml-applications-on-commodity-microcontrollers
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR, abs/1704.04861.
[9] K. He, X. Zhang, S. Ren, and J. Sun (2015). Deep Residual Learning for Image Recognition. CoRR, abs/1512.03385.
[10] S. Zagoruyko and N. Komodakis (2016). Wide Residual Networks. CoRR, abs/1605.07146.
[11] I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Yang, A. Mandell, Y. Gan, M. Mattina, P. N. Whatmough, “TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids”, InterSpeech, 2020
[12] C. R. Banbury, C. Zhou, I. Fedorov, R. M. Navarro, U. Thakker, D. Gope, V. J. Reddi, M. Mattina, and P. N. Whatmough (2020), “MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers”, MLSys, 2021.
[13] https://www.cs.toronto.edu/~kriz/cifar.html