Deploying a Convolutional Neural Network on Cortex-M with CMSIS-NN

July 24, 2018

7 minute read time.

Overview

This blog is for embedded software developers who want to apply Machine Learning (ML) on Arm Cortex-M. We will show you how to deploy a trained Neural Network (NN) model (using Caffe as an example) on those constrained platforms with the Arm CMSIS-NN software library. We will walk through following steps:

Basic concept of NN
CMSIS-NN introduction
CIFAR-10
Quantization
Embedded code implementation
Summary

Basic concept of NN

ML is moving to the edge. People want to have edge computing capability on embedded devices to provide more advanced services, like voice recognition for smart speakers and face detection for surveillance cameras. Convolutional Neural Networks (CNNs) are one of the main ways to do image recognition and image classification. CNNs use a variation of multilayer perception that require minimal pre-processing, based on their shared-weights architecture and translation invariance characteristics.

Input image

Above is an example that shows the original 256x256 image input on the left-hand side and how it progresses through each layer to calculate the probability on the right-hand side. I encourage you to read the beginner's guide to understanding convolutional neural networks for additional detail.

In this blog, we will focus on a popular CNN as an example of how to deploy on a Cortex-M platform.

CMSIS-NN introduction

The Arm Cortex-M processor family is a range of scalable, energy-efficient and easy-to-use processors that meet the needs of smart and connected embedded applications. One of the real benefits of Cortex-M is the software ecosystem. Cortex Microcontroller Software Interface Standard (CMSIS) is a vendor-independent hardware abstraction layer for the Cortex-M processor series and defines generic tool interfaces. CMSIS-DSP (Digital Signal Processing)* is an important component that provides a DSP library collection with more than 60 functions for various data types: fixed-point (fractional q7, q15, q31) and single precision floating-point (32-bit). The library is optimized for the SIMD instruction set, and programmers can focus on high-level algorithms and rely on the library for audio/image/communication, or any DSP-related low-level firmware implementation.

The embedded world is putting more and more intelligence into end devices, such as smart speakers and surveillance cameras. “Always-on” hardware can provide solutions on the edge without the involvement of cloud services - avoiding concerns surrounding the availability of an internet connection, and around personal privacy. Based on the Cortex-M Digital Signal Processing (DSP) capabilities, ML has a proven 5x boost on the Cortex-M platform with the new CMSIS-NN software framework. If you'd like to know more about it, you can read this paper on CMSIS-NN for Arm Cortex-M CPUs.

Now, we will show you how easy it is to adapt your NN model for the Cortex-M platform.

CIFAR-10

To demonstrate how CMSIS-NN works, let’s start from today’s CNN example, cifar-10. The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train ML and computer vision algorithms. It is one of the most widely used datasets for ML research.

Neural Network Model Definition

Follow this tutorial, and you will easily be able to train the dataset on your PC. Above is an example model from Caffe.

In this case, the neural network consists of three convolution layers, interspersed with ReLU activation and max pooling layers, followed by a fully-connected layer at the end. The input of the network is a 32x32 pixel color image, which will be classified into one of the 10 output classes.

Neural network convulsion layers

Logically, you can imagine each layer as a software framework API. The programmer needs to pick the appropriate CMSIS-NN API, and apply suitable data and parameters. We will describe how to convert the parameters later.

Quantization

Typically the ML models are trained with floating-point data on GPU graphic cards or servers, but using similar precision on more constrained platforms, like embedded devices, is more of a challenge. Fortunately, several research papers have proved that quantizing the data into integers can usually be performed without any loss of performance (i.e. accuracy). Quantizing models includes quantizing both the weights and the activation data (or layer input/outputs) which can help them to run faster and use less power.

In this guide, we quantize the floating point weights/activation data to Qm.n format, in which m,n are fixed within a layer but can vary across different network layers. Here is a Python example that performs the quantization of weights across these three steps:

Find weight min/max
Find Qx.y
Quantize the weight data

min_wt = weight.min() 
max_wt = weight.max()

#find number of integer bits to represent this range
int_bits = int(np.ceil(np.log2(max(abs(min_wt),abs(max_wt))))) 
frac_bits = 7-int_bits #remaining bits are fractional bits (1-bit for sign)

#floating point weights are scaled and rounded to [-128,127], which are used in 
#the fixed-point operations on the actual hardware (i.e., microcontroller)
quant_weight = np.round(weight*(2**frac_bits))

#To quantify the impact of quantized weights, scale them back to
# original range to run inference using quantized weights
weight = quant_weight/(2**frac_bits)

You can easily follow the same flow to quantize the activation data to Qm.n format, then the total computation flow will be:

Weight Qx.y * Activation Qx.y + Bias Qx.y -> Output Qx.y

Once the weights/activation data has been quantized, the next step is to export the data into the header file for compiling the embedded code.

#define CONV1_WT {-9,-1,2,6,-4,6,4,-11,8, ...}

#define CONV1_BIAS {-49,-18,-7,-20,-12,-15, ...}

#define CONV2_WT {-3,-9,-16,-14,8,-17, ...}

#define CONV2_BIAS {55,50,34,43,-37,35, ...}

#define CONV3_WT {15,10,3,1,-20,-11,5, ...}

#define CONV3_BIAS {18,36,-46,-45,64,8, ...}

#define IP1_WT {38,-13,5,-20,15,-4,-3, ...}

#define IP1_BIAS {30,-121,-51,77,40,20, ...}

On this example, the model implementation needs 32.3 KB to store weights, 40 KB for activations and 3.1 KB for storing the im2col data.

You can learn more details from a similar Python script.

Implementation

Now we have trained the network layer and quantized the weights/activation, it’s time to deploy our network.

If you're familiar with developing on Cortex-M microcontrollers, just pick your favorite development environment and codebase, and make sure you add the CMSIS-NN header file in your project. If you aren’t familiar with coding on Cortex-M, I suggest you read this guide: Getting Started with MDK.

Now, let’s map the network layer with the CMSIS-NN API.

Layer1

Layer 1

arm_convolve_HWC_q7_RGB(), a dedicated API for input tensor dimension equal to 3. The four groups of the parameters are: input, filter kernel, bias and output. Please make sure you set up the right kernel size/padding/stride to be the same as the trained model.

Layer2

Layer 2

This layer includes two of API, arm_maxpool_q7_HWC() & arm_relu_q7(). As with layer1, please check the pooling kernel setting with the trained model.

Layer3

Use arm_convolve_HWC_q7_fast() when the input tensor dimension is a multiple of 4 as this can make good use of the SIMD32 read and swap behavior on 8-bit operations.

Layer4

Layer 4

Use arm_avepool_q7_HWC() for bufferA size non-zero case.

Layer5

Layer 5

Use convolve API arm_convolve_HWC_q7_fast() again.

Layer6

Layer 6

Use arm_avepool_q7_HWC() again.

Layer7

Layer 7

Use arm_fully_connected_q7_opt() API, this optimized function is designed to work with interleaved weight matrix, check this article for detail.

The final code will look like this example on GitHub.

void run_nn() {
	q7_t* buffer1 = scratch_buffer;
	q7_t* buffer2 = buffer1 + 32768;
	arm_convolve_HWC_q7_RGB(input_data, CONV1_IN_DIM, CONV1_IN_CH, conv1_wt, CONV1_OUT_CH, CONV1_KER_DIM, CONV1_PAD, CONV1_STRIDE, conv1_bias, CONV1_BIAS_LSHIFT, CONV1_OUT_RSHIFT, buffer1, CONV1_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_maxpool_q7_HWC(buffer1, POOL1_IN_DIM, POOL1_IN_CH, POOL1_KER_DIM, POOL1_PAD, POOL1_STRIDE, POOL1_OUT_DIM, col_buffer, buffer2);
	arm_relu_q7(buffer2, RELU1_OUT_DIM*RELU1_OUT_DIM*RELU1_OUT_CH);
	arm_convolve_HWC_q7_fast(buffer2, CONV2_IN_DIM, CONV2_IN_CH, conv2_wt, CONV2_OUT_CH, CONV2_KER_DIM, CONV2_PAD, CONV2_STRIDE, conv2_bias, CONV2_BIAS_LSHIFT, CONV2_OUT_RSHIFT, buffer1, CONV2_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_relu_q7(buffer1, RELU2_OUT_DIM*RELU2_OUT_DIM*RELU2_OUT_CH);
	arm_avepool_q7_HWC(buffer1, POOL2_IN_DIM, POOL2_IN_CH, POOL2_KER_DIM, POOL2_PAD, POOL2_STRIDE, POOL2_OUT_DIM, col_buffer, buffer2);
	arm_convolve_HWC_q7_fast(buffer2, CONV3_IN_DIM, CONV3_IN_CH, conv3_wt, CONV3_OUT_CH, CONV3_KER_DIM, CONV3_PAD, CONV3_STRIDE, conv3_bias, CONV3_BIAS_LSHIFT, CONV3_OUT_RSHIFT, buffer1, CONV3_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_relu_q7(buffer1, RELU3_OUT_DIM*RELU3_OUT_DIM*RELU3_OUT_CH);
	arm_avepool_q7_HWC(buffer1, POOL3_IN_DIM, POOL3_IN_CH, POOL3_KER_DIM, POOL3_PAD, POOL3_STRIDE, POOL3_OUT_DIM, col_buffer, buffer2);
	arm_fully_connected_q7_opt(buffer2, ip1_wt, IP1_IN_DIM, IP1_OUT_DIM, IP1_BIAS_LSHIFT, IP1_OUT_RSHIFT, ip1_bias, output_data, (q15_t*)col_buffer);
}

Now you can add your camera hardware or other image driver into your code. Here is the real-demo on the STMicro Cortex-M7 board.

We break down the operation count of each layer during the run-time. Obviously, the convolution layer consumes 97% of the computing resource.

	Layer Type	Filter Shape	Ouput Shape	Ops (k)
Layer 1	Convolution	5 x 5 x 3 x 32	32 x 32 x 32	4900
Layer 2	Max Pooling	N/A	16 x 16 x 32	737
Layer 3	Convolution	5 x 5 x 32 x 32	16 x 16 x 32	13100
Layer 4	Max Pooling	N/A	8 x 8 x 32	18.4
Layer 5	Convolution	5 x 5 x 32 x 64	8 x 8 x 64	6600
Layer 6	Max Pooling	N/A	4 x 4 x 64	9.2
Layer 7	Fully Connected	4 x 4 x 64 x 10	10	20
Total		87 KB	55 KB	24700

Summary

Machine Learning Convolutional Neural Network operation has a proven 5x boost on the Cortex-M platform using the CMSIS-NN software framework. Please refer to Arm Developer link below for more information on Arm ML solutions and don’t hesitate to comment below if you have any further questions.

Read about Arm ML solutions

*: The library is available for all Cortex-M cores. Implementations optimized for the SIMD instruction set are available for Arm Cortex-M4, Cortex-M7, and Cortex-M33.

bob_tyson over 3 years ago

I am trying to run with basic convolution with greater than 2 channels but the outputs are not matching with the hand calculation can someone please help

here is my post explained: Failed to match conv layer outputs with hand calculation

Thanks in advance.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
vaibhav541 over 5 years ago

Can we train our network too using this?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
William UK over 5 years ago

Dear odinlmshen, Good day. We planning to implement to implement image classification on actual industrial application. Is there any possibility to connect a better camera with higher resolution to the M7 board for better image capturing result? BR. William Ho
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Hongchuang over 5 years ago

Hello,

Can you help me with the problem? Thanks!

https://github.com/ARM-software/ML-examples/issues/30
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
rafakath over 5 years ago

hi,
While converting .pkl file into header file by running script code_gen.py with MNIST dataset I am getting below error.

Traceback (most recent call last):
File "code_gen.py", line 362, in <module>
generate_parameters(my_model, cmd_args.out_dir+'/parameter.h')
File "code_gen.py", line 94, in generate_parameters
f.write("#define "+layer.upper()+"_OUT_DIM "+str(caffe_model.layer_shape[layer][2])+"\n\n")
IndexError: tuple index out of range

I am new to Caffe, kindly give any suggestion.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Future Architecture Technologies: POE2 and vMTE

Martin Weidmann

This blog post introduces two future technologies, Permission Overlay Extension version 2 (POE2) and Virtual Tagging Extension (vMTE).
- October 23, 2025
Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog