Deploying Convolutional Neural Network on Cortex-M with CMSIS-NN


This blog is for embedded software developers who want to apply Machine Learning (ML) on Arm Cortex-M. We will show you how to deploy a trained Neural Network (NN) model (using Caffe as an example) on those constrained platforms with the Arm CMSIS-NN software library. We will walk through following steps:

  1. Basic concept of NN
  2. CMSIS-NN introduction
  3. CIFAR-10
  4. Quantization
  5. Embedded code implementation
  6. Summary

Basic concept of NN

ML is moving to the edge. People want to have edge computing capability on embedded devices to provide more advanced services, like voice recognition for smart speakers and face detection for surveillance cameras. Convolutional Neural Networks (CNNs) are one of the main ways to do image recognition and image classification. CNNs use a variation of multilayer perception that require minimal pre-processing, based on their shared-weights architecture and translation invariance characteristics.

Input image

Above is an example that shows the original 256x256 image input on the left-hand side and how it progresses through each layer to calculate the probability on the right-hand side. I encourage you to read the beginner's guide to understanding convolutional neural networks for additional detail.

In this blog, we will focus on a popular CNN as an example of how to deploy on a Cortex-M platform.

CMSIS-NN introduction

The Arm Cortex-M processor family is a range of scalable, energy-efficient and easy-to-use processors that meet the needs of smart and connected embedded applications. One of the real benefits of Cortex-M is the software ecosystem. Cortex Microcontroller Software Interface Standard (CMSIS) is a vendor-independent hardware abstraction layer for the Cortex-M processor series and defines generic tool interfaces. CMSIS-DSP (Digital Signal Processing)* is an important component that provides a DSP library collection with more than 60 functions for various data types: fixed-point (fractional q7, q15, q31) and single precision floating-point (32-bit). The library is optimized for the SIMD instruction set, and programmers can focus on high-level algorithms and rely on the library for audio/image/communication, or any DSP-related low-level firmware implementation.

The embedded world is putting more and more intelligence into end devices, such as smart speakers and surveillance cameras. “Always-on” hardware can provide solutions on the edge without the involvement of cloud services - avoiding concerns surrounding the availability of an internet connection, and around personal privacy. Based on the Cortex-M Digital Signal Processing (DSP) capabilities, ML has a proven 5x boost on the Cortex-M platform with the new CMSIS-NN software framework. If you'd like to know more about it, you can read this paper on CMSIS-NN for Arm Cortex-M CPUs.

Now, we will show you how easy it is to adapt your NN model for the Cortex-M platform.


To demonstrate how CMSIS-NN works, let’s start from today’s CNN example, cifar-10. The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train ML and computer vision algorithms. It is one of the most widely used datasets for ML research.

Neural Network Model Definition

Follow this tutorial, and you will easily be able to train the dataset on your PC. Above is an example model from Caffe.

In this case, the neural network consists of three convolution layers, interspersed with ReLU activation and max pooling layers, followed by a fully-connected layer at the end. The input of the network is a 32x32 pixel color image, which will be classified into one of the 10 output classes.

Neural network convulsion layers

Logically, you can imagine each layer as a software framework API. The programmer needs to pick the appropriate CMSIS-NN API, and apply suitable data and parameters. We will describe how to convert the parameters later.


Typically the ML models are trained with floating-point data on GPU graphic cards or servers, but using similar precision on more constrained platforms, like embedded devices, is more of a challenge. Fortunately, several research papers have proved that quantizing the data into integers can usually be performed without any loss of performance (i.e. accuracy). Quantizing models includes quantizing both the weights and the activation data (or layer input/outputs) which can help them to run faster and use less power.

In this guide, we quantize the floating point weights/activation data to Qm.n format, in which m,n are fixed within a layer but can vary across different network layers. Here is a Python example that performs the quantization of weights across these three steps:

  • Find weight min/max
  • Find Qx.y
  • Quantize the weight data

min_wt = weight.min() 
max_wt = weight.max()

#find number of integer bits to represent this range
int_bits = int(np.ceil(np.log2(max(abs(min_wt),abs(max_wt))))) 
frac_bits = 7-int_bits #remaining bits are fractional bits (1-bit for sign)

#floating point weights are scaled and rounded to [-128,127], which are used in 
#the fixed-point operations on the actual hardware (i.e., microcontroller)
quant_weight = np.round(weight*(2**frac_bits))

#To quantify the impact of quantized weights, scale them back to
# original range to run inference using quantized weights
weight = quant_weight/(2**frac_bits)

You can easily follow the same flow to quantize the activation data to Qm.n format, then the total computation flow will be:

  • Weight Qx.y * Activation Qx.y + Bias Qx.y -> Output Qx.y

Once the weights/activation data has been quantized, the next step is to export the data into the header file for compiling the embedded code.

#define CONV1_WT {-9,-1,2,6,-4,6,4,-11,8, ...}

#define CONV1_BIAS {-49,-18,-7,-20,-12,-15, ...}

#define CONV2_WT {-3,-9,-16,-14,8,-17, ...}

#define CONV2_BIAS {55,50,34,43,-37,35, ...}

#define CONV3_WT {15,10,3,1,-20,-11,5, ...}

#define CONV3_BIAS {18,36,-46,-45,64,8, ...}

#define IP1_WT {38,-13,5,-20,15,-4,-3, ...}

#define IP1_BIAS {30,-121,-51,77,40,20, ...}

On this example, the model implementation needs 32.3 KB to store weights, 40 KB for activations and 3.1 KB for storing the im2col data.

You can learn more details from a similar Python script.


Now we have trained the network layer and quantized the weights/activation, it’s time to deploy our network.

If you're familiar with developing on Cortex-M microcontrollers, just pick your favorite development environment and codebase, and make sure you add the CMSIS-NN header file in your project. If you aren’t familiar with coding on Cortex-M, I suggest you read this guide: Getting Started with MDK.

Now, let’s map the network layer with the CMSIS-NN API.


Layer 1

arm_convolve_HWC_q7_RGB(), a dedicated API for input tensor dimension equal to 3. The four groups of the parameters are: input, filter kernel, bias and output. Please make sure you set up the right kernel size/padding/stride to be the same as the trained model.


Layer 2

This layer includes two of API, arm_maxpool_q7_HWC() & arm_relu_q7(). As with layer1, please check the pooling kernel setting with the trained model.


Layer 3

Use arm_convolve_HWC_q7_fast() when the input tensor dimension is a multiple of 4 as this can make good use of the SIMD32 read and swap behavior on 8-bit operations.


Layer 4

Use arm_avepool_q7_HWC() for bufferA size non-zero case.


Layer 5

Use convolve API arm_convolve_HWC_q7_fast() again.


Layer 6

Use arm_avepool_q7_HWC() again.


Layer 7

Use arm_fully_connected_q7_opt() API, this optimized function is designed to work with interleaved weight matrix, check this article for detail.

The final code will look like this example on GitHub.

void run_nn() {
	q7_t* buffer1 = scratch_buffer;
	q7_t* buffer2 = buffer1 + 32768;
	arm_convolve_HWC_q7_RGB(input_data, CONV1_IN_DIM, CONV1_IN_CH, conv1_wt, CONV1_OUT_CH, CONV1_KER_DIM, CONV1_PAD, CONV1_STRIDE, conv1_bias, CONV1_BIAS_LSHIFT, CONV1_OUT_RSHIFT, buffer1, CONV1_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_maxpool_q7_HWC(buffer1, POOL1_IN_DIM, POOL1_IN_CH, POOL1_KER_DIM, POOL1_PAD, POOL1_STRIDE, POOL1_OUT_DIM, col_buffer, buffer2);
	arm_relu_q7(buffer2, RELU1_OUT_DIM*RELU1_OUT_DIM*RELU1_OUT_CH);
	arm_convolve_HWC_q7_fast(buffer2, CONV2_IN_DIM, CONV2_IN_CH, conv2_wt, CONV2_OUT_CH, CONV2_KER_DIM, CONV2_PAD, CONV2_STRIDE, conv2_bias, CONV2_BIAS_LSHIFT, CONV2_OUT_RSHIFT, buffer1, CONV2_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_relu_q7(buffer1, RELU2_OUT_DIM*RELU2_OUT_DIM*RELU2_OUT_CH);
	arm_avepool_q7_HWC(buffer1, POOL2_IN_DIM, POOL2_IN_CH, POOL2_KER_DIM, POOL2_PAD, POOL2_STRIDE, POOL2_OUT_DIM, col_buffer, buffer2);
	arm_convolve_HWC_q7_fast(buffer2, CONV3_IN_DIM, CONV3_IN_CH, conv3_wt, CONV3_OUT_CH, CONV3_KER_DIM, CONV3_PAD, CONV3_STRIDE, conv3_bias, CONV3_BIAS_LSHIFT, CONV3_OUT_RSHIFT, buffer1, CONV3_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_relu_q7(buffer1, RELU3_OUT_DIM*RELU3_OUT_DIM*RELU3_OUT_CH);
	arm_avepool_q7_HWC(buffer1, POOL3_IN_DIM, POOL3_IN_CH, POOL3_KER_DIM, POOL3_PAD, POOL3_STRIDE, POOL3_OUT_DIM, col_buffer, buffer2);
	arm_fully_connected_q7_opt(buffer2, ip1_wt, IP1_IN_DIM, IP1_OUT_DIM, IP1_BIAS_LSHIFT, IP1_OUT_RSHIFT, ip1_bias, output_data, (q15_t*)col_buffer);

Now you can add your camera hardware or other image driver into your code. Here is the real-demo on the STMicro Cortex-M7 board.

We break down the operation count of each layer during the run-time. Obviously, the convolution layer consumes 97% of the computing resource.

Layer Type Filter Shape Ouput Shape Ops (k)
Layer 1 Convolution 5 x 5 x 3 x 32 32 x 32 x 32 4900
Layer 2 Max Pooling N/A 16 x 16 x 32 737
Layer 3 Convolution 5 x 5 x 32 x 32 16 x 16 x 32 13100
Layer 4 Max Pooling N/A  8 x 8 x 32 18.4
Layer 5 Convolution 5 x 5 x 32 x 64 8 x 8 x 64 6600
Layer 6 Max Pooling N/A 4 x 4 x 64 9.2
Layer 7 Fully Connected 4 x 4 x 64 x 10 10 20
Total 87 KB 55 KB 24700


Machine Learning Convolutional Neural Network operation has a proven 5x boost on the Cortex-M platform using the CMSIS-NN software framework. Please refer to Arm Developer link below for more information on Arm ML solutions and don’t hesitate to comment below if you have any further questions.

Read about Arm ML solutions

*: The library is available for all Cortex-M cores. Implementations optimized for the SIMD instruction set are available for Arm Cortex-M4, Cortex-M7, and Cortex-M33.