Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. CMSIS-NN is a collection of efficient neural network kernels developed to maximize the performance and minimize the memory footprint of neural networks on Arm Cortex-M processor cores targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.
CMSIS-NN library consists of two parts: NNFunctions and NNSupportFunctions. NNFunctions include the functions that implement popular neural network layer types, such as convolution, depthwise separable convolution, fully-connected (i.e. inner-product), pooling and activation. These functions are used by the application code to implement the neural network inference applications. The kernel APIs are also kept simple, so that it can be easily retargeted for any machine learning framework. NNSupportFunctions include different utility functions, such as data conversion and activation function tables, which are used in NNFunctions. These utility functions can also be used by the application code to construct more complex NN modules, e.g. Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU).
For some kernels (e.g. fully-connected and convolution), different versions of the kernel functions are implemented. A basic version is provided that works universally ‘as-is’ for any layer parameters. We have also implemented other versions which include further optimization techniques but with either transformed inputs or with some limitations on the layer parameters. Ideally, a simple script can be used to parse the network topology and automatically determine the appropriate functions to be used.
We tested the CMSIS-NN kernels on a convolutional neural network (CNN), trained on the CIFAR-10 dataset, consisting of 60,000 32x32 color images divided into 10 output classes. The network topology is based on the built-in example provided in Caffe, with three convolution layers and one fully-connected layer. The layer parameters and the detailed runtime results using the CMSIS-NN kernels are shown in the table below. The runtime is measured while running on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz.
The entire image classification takes about 99.1 ms per image (the equivalent of 10.1 images per second). The compute throughput of the CPU is about 249 MOps per second for running this network. The pre-quantized network achieves an accuracy of 80.3% on the CIFAR-10 test set. The 8-bit quantized network running on Arm Cortex-M7 core achieves 79.9% accuracy. Maximum memory footprint using the CMSIS-NN kernels is ~133 KB, where convolutions are implemented with partial im2col to save memory, followed by matrix-multiplication. Memory footprint without partial im2col would be ~332 KB and the neural network would not fit on the board.
To quantify the benefits of CMSIS-NN kernels over existing solutions, we also implemented a baseline version using a 1D convolution function (arm_conv from CMSIS-DSP), Caffe-like pooling and ReLU. For the CNN application, table below summarizes the comparison results of the baseline functions and the CMSIS-NN kernels. The CMSIS-NN kernels achieve 2.6X to 5.4X improvement in runtime/throughput over the baseline functions. The energy efficiency improvement is also in line with the throughput improvement.
Efficient NN kernels are key in enabling inference on Arm Cortex-M based CPUs. CMSIS-NN provides optimized functions to accelerate key NN layers, such as convolution, pooling and activations. In addition, CMSIS-NN also helps to reduce the memory footprint which is key for memory constrained microcontrollers. More details are in our whitepaper, which you can download from the Cornell University Library site using the button below.
The CMSIS-NN kernels are available at GitHub page. The application code can directly use these kernels to implement neural network algorithms on Arm Cortex-M CPUs. Alternatively, these kernels can be used as primitives by machine learning frameworks to deploy trained models.
For further resources, detailed product information and tutorials to help tackle the challenges of ML at the edge, please visit our new Machine Learning developer site.
CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs White Paper
Hello to all,
I found this link very helpful: https://github.com/ARM-software/CMSIS_5/issues/325
There is a python script to convert the images to the format to feed the CMSIS example program, and some info about the format.
I have realized a simple program on an STM32F769I-DISCO to run the CMSIS cifar-10 neural network example. 
The program that I maked, read from the SDcard the "test_batch.bin" file and run the "cifar" function for each image, then visualize it on the screen. A green border around the image means that the image was recognized by the software.
My program run with the assumption that the input image format for the cifar function was similar, in the format, to the binary format of "test_batch.bin" file. Precisely: 1024 byte for each color, starting from red channel and then green and blue.
But the matchs the program's found are below the 10 percent. So there is a problem!Visualizing the data used by the example, the array "input_data", I get an image with wrong color. My assumption was wrong!
So please which is the format that I can use to feed the cifar function?
Hi Vias Chandra, I am very glad to see NN running on Cortex-M MCUs. and I try the example in the Keil simulator, It's wonderful.
My question is:
How to test the NN with a new image?
I explore the code and find the input image is below:
q7_t input_data[CONV1_IM_CH * CONV1_IM_DIM * CONV1_IM_DIM] = IMG_DATA;
and the IMG_DATA is define in a .h file, it's a 32*32*3 = 3072 array of int8.
how could i convert an image (32 * 32 * 3) to this array, using which tool or what algorithm? so i can write some python script to convert the image.
Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions.
Hi Vikas Chandra,
As you menthioned, you have tested CMSIS-NN on on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz. The readme of your posted github (https://github.com/ARM-software/CMSIS_5) says it supports for Armv8-M Architecture (Mainline and Baseline) as well as devices Cortex-M23 and Cortex-M33. Could you give me some accurate information on the hardware supportiong of CMSIS-NN?