Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. CMSIS-NN is a collection of efficient neural network kernels developed to maximize the performance and minimize the memory footprint of neural networks on Arm Cortex-M processor cores targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.
CMSIS-NN library consists of two parts: NNFunctions and NNSupportFunctions. NNFunctions include the functions that implement popular neural network layer types, such as convolution, depthwise separable convolution, fully-connected (i.e. inner-product), pooling and activation. These functions are used by the application code to implement the neural network inference applications. The kernel APIs are also kept simple, so that it can be easily retargeted for any machine learning framework. NNSupportFunctions include different utility functions, such as data conversion and activation function tables, which are used in NNFunctions. These utility functions can also be used by the application code to construct more complex NN modules, e.g. Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU).
For some kernels (e.g. fully-connected and convolution), different versions of the kernel functions are implemented. A basic version is provided that works universally ‘as-is’ for any layer parameters. We have also implemented other versions which include further optimization techniques but with either transformed inputs or with some limitations on the layer parameters. Ideally, a simple script can be used to parse the network topology and automatically determine the appropriate functions to be used.
We tested the CMSIS-NN kernels on a convolutional neural network (CNN), trained on the CIFAR-10 dataset, consisting of 60,000 32x32 color images divided into 10 output classes. The network topology is based on the built-in example provided in Caffe, with three convolution layers and one fully-connected layer. The layer parameters and the detailed runtime results using the CMSIS-NN kernels are shown in the table below. The runtime is measured while running on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz.
The entire image classification takes about 99.1 ms per image (the equivalent of 10.1 images per second). The compute throughput of the CPU is about 249 MOps per second for running this network. The pre-quantized network achieves an accuracy of 80.3% on the CIFAR-10 test set. The 8-bit quantized network running on Arm Cortex-M7 core achieves 79.9% accuracy. Maximum memory footprint using the CMSIS-NN kernels is ~133 KB, where convolutions are implemented with partial im2col to save memory, followed by matrix-multiplication. Memory footprint without partial im2col would be ~332 KB and the neural network would not fit on the board.
To quantify the benefits of CMSIS-NN kernels over existing solutions, we also implemented a baseline version using a 1D convolution function (arm_conv from CMSIS-DSP), Caffe-like pooling and ReLU. For the CNN application, table below summarizes the comparison results of the baseline functions and the CMSIS-NN kernels. The CMSIS-NN kernels achieve 2.6X to 5.4X improvement in runtime/throughput over the baseline functions. The energy efficiency improvement is also in line with the throughput improvement.
Efficient NN kernels are key in enabling inference on Arm Cortex-M based CPUs. CMSIS-NN provides optimized functions to accelerate key NN layers, such as convolution, pooling and activations. In addition, CMSIS-NN also helps to reduce the memory footprint which is key for memory constrained microcontrollers. More details are in our whitepaper, which you can download from the Cornell University Library site using the button below.
The CMSIS-NN kernels are available at GitHub page. The application code can directly use these kernels to implement neural network algorithms on Arm Cortex-M CPUs. Alternatively, these kernels can be used as primitives by machine learning frameworks to deploy trained models.
For further resources, detailed product information and tutorials to help tackle the challenges of ML at the edge, please visit our new Machine Learning developer site.
[CTAToken URL = "https://arxiv.org/abs/1801.06601" target="_blank" text="CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs White Paper" class ="green"]
What exiting news! Any benchmark on perf for M7/M4/M3 at different clock speed?
Also, one big issue in an embedded system is memory space both code & data, usually, SoC integrates a large amount of Flash so program memory is not a problem, but fitting data into low amount of SRAM is quite challenging. Typically, do you think this can feet into a 256K SRAM SoC? When talking about the memory footprint of CMSIS-NN you said: " Maximum memory footprint using the CMSIS-NN kernels is ~133 KB", is this the standalone footprint of only cmsis or total data footprint of your given example?