Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Graphics and Gaming
    • High Performance Computing
    • Innovation
    • Multimedia
    • Open Source Software and Platforms
    • Physical
    • Processors
    • Security
    • System
    • Software Tools
    • TrustZone for Armv8-M
    • 中文社区
  • Blog
    • Announcements
    • Artificial Intelligence
    • Automotive
    • Healthcare
    • HPC
    • Infrastructure
    • Innovation
    • Internet of Things
    • Machine Learning
    • Mobile
    • Smart Homes
    • Wearables
  • Forums
    • All developer forums
    • IP Product forums
    • Tool & Software forums
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Processors
  • Developer Community
  • IP Products
  • Processors
  • Jump...
  • Cancel
Processors
Processors blog Deploying a Convolutional Neural Network on Cortex-M with CMSIS-NN
  • Blogs
  • Leaderboard
  • Forums
  • Videos & Files
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
  • New
More blogs in Processors
  • DesignStart blog

  • Machine Learning IP blog

  • Processors blog

  • TrustZone for Armv8-M blog

Tell us what you think
Tags
  • Neural Network
  • Machine Learning (ML)
  • Cortex-M
  • CMSIS
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Deploying a Convolutional Neural Network on Cortex-M with CMSIS-NN

odinlmshen
odinlmshen
July 24, 2018

Overview

This blog is for embedded software developers who want to apply Machine Learning (ML) on Arm Cortex-M. We will show you how to deploy a trained Neural Network (NN) model (using Caffe as an example) on those constrained platforms with the Arm CMSIS-NN software library. We will walk through following steps:

  1. Basic concept of NN
  2. CMSIS-NN introduction
  3. CIFAR-10
  4. Quantization
  5. Embedded code implementation
  6. Summary

Basic concept of NN

ML is moving to the edge. People want to have edge computing capability on embedded devices to provide more advanced services, like voice recognition for smart speakers and face detection for surveillance cameras. Convolutional Neural Networks (CNNs) are one of the main ways to do image recognition and image classification. CNNs use a variation of multilayer perception that require minimal pre-processing, based on their shared-weights architecture and translation invariance characteristics.

Input image

Above is an example that shows the original 256x256 image input on the left-hand side and how it progresses through each layer to calculate the probability on the right-hand side. I encourage you to read the beginner's guide to understanding convolutional neural networks for additional detail.

In this blog, we will focus on a popular CNN as an example of how to deploy on a Cortex-M platform.

CMSIS-NN introduction

The Arm Cortex-M processor family is a range of scalable, energy-efficient and easy-to-use processors that meet the needs of smart and connected embedded applications. One of the real benefits of Cortex-M is the software ecosystem. Cortex Microcontroller Software Interface Standard (CMSIS) is a vendor-independent hardware abstraction layer for the Cortex-M processor series and defines generic tool interfaces. CMSIS-DSP (Digital Signal Processing)* is an important component that provides a DSP library collection with more than 60 functions for various data types: fixed-point (fractional q7, q15, q31) and single precision floating-point (32-bit). The library is optimized for the SIMD instruction set, and programmers can focus on high-level algorithms and rely on the library for audio/image/communication, or any DSP-related low-level firmware implementation.

The embedded world is putting more and more intelligence into end devices, such as smart speakers and surveillance cameras. “Always-on” hardware can provide solutions on the edge without the involvement of cloud services - avoiding concerns surrounding the availability of an internet connection, and around personal privacy. Based on the Cortex-M Digital Signal Processing (DSP) capabilities, ML has a proven 5x boost on the Cortex-M platform with the new CMSIS-NN software framework. If you'd like to know more about it, you can read this paper on CMSIS-NN for Arm Cortex-M CPUs.

Now, we will show you how easy it is to adapt your NN model for the Cortex-M platform.

CIFAR-10

To demonstrate how CMSIS-NN works, let’s start from today’s CNN example, cifar-10. The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train ML and computer vision algorithms. It is one of the most widely used datasets for ML research.

Neural Network Model Definition

Follow this tutorial, and you will easily be able to train the dataset on your PC. Above is an example model from Caffe.

In this case, the neural network consists of three convolution layers, interspersed with ReLU activation and max pooling layers, followed by a fully-connected layer at the end. The input of the network is a 32x32 pixel color image, which will be classified into one of the 10 output classes.

Neural network convulsion layers

Logically, you can imagine each layer as a software framework API. The programmer needs to pick the appropriate CMSIS-NN API, and apply suitable data and parameters. We will describe how to convert the parameters later.

Quantization

Typically the ML models are trained with floating-point data on GPU graphic cards or servers, but using similar precision on more constrained platforms, like embedded devices, is more of a challenge. Fortunately, several research papers have proved that quantizing the data into integers can usually be performed without any loss of performance (i.e. accuracy). Quantizing models includes quantizing both the weights and the activation data (or layer input/outputs) which can help them to run faster and use less power.

In this guide, we quantize the floating point weights/activation data to Qm.n format, in which m,n are fixed within a layer but can vary across different network layers. Here is a Python example that performs the quantization of weights across these three steps:

  • Find weight min/max
  • Find Qx.y
  • Quantize the weight data

min_wt = weight.min() 
max_wt = weight.max()

#find number of integer bits to represent this range
int_bits = int(np.ceil(np.log2(max(abs(min_wt),abs(max_wt))))) 
frac_bits = 7-int_bits #remaining bits are fractional bits (1-bit for sign)

#floating point weights are scaled and rounded to [-128,127], which are used in 
#the fixed-point operations on the actual hardware (i.e., microcontroller)
quant_weight = np.round(weight*(2**frac_bits))

#To quantify the impact of quantized weights, scale them back to
# original range to run inference using quantized weights
weight = quant_weight/(2**frac_bits)

You can easily follow the same flow to quantize the activation data to Qm.n format, then the total computation flow will be:

  • Weight Qx.y * Activation Qx.y + Bias Qx.y -> Output Qx.y

Once the weights/activation data has been quantized, the next step is to export the data into the header file for compiling the embedded code.

#define CONV1_WT {-9,-1,2,6,-4,6,4,-11,8, ...}

#define CONV1_BIAS {-49,-18,-7,-20,-12,-15, ...}

#define CONV2_WT {-3,-9,-16,-14,8,-17, ...}

#define CONV2_BIAS {55,50,34,43,-37,35, ...}

#define CONV3_WT {15,10,3,1,-20,-11,5, ...}

#define CONV3_BIAS {18,36,-46,-45,64,8, ...}

#define IP1_WT {38,-13,5,-20,15,-4,-3, ...}

#define IP1_BIAS {30,-121,-51,77,40,20, ...}

On this example, the model implementation needs 32.3 KB to store weights, 40 KB for activations and 3.1 KB for storing the im2col data.

You can learn more details from a similar Python script.

Implementation

Now we have trained the network layer and quantized the weights/activation, it’s time to deploy our network.

If you're familiar with developing on Cortex-M microcontrollers, just pick your favorite development environment and codebase, and make sure you add the CMSIS-NN header file in your project. If you aren’t familiar with coding on Cortex-M, I suggest you read this guide: Getting Started with MDK.

Now, let’s map the network layer with the CMSIS-NN API.

Layer1

Layer 1

arm_convolve_HWC_q7_RGB(), a dedicated API for input tensor dimension equal to 3. The four groups of the parameters are: input, filter kernel, bias and output. Please make sure you set up the right kernel size/padding/stride to be the same as the trained model.

Layer2

Layer 2

This layer includes two of API, arm_maxpool_q7_HWC() & arm_relu_q7(). As with layer1, please check the pooling kernel setting with the trained model.

Layer3

Layer 3

Use arm_convolve_HWC_q7_fast() when the input tensor dimension is a multiple of 4 as this can make good use of the SIMD32 read and swap behavior on 8-bit operations.

Layer4

Layer 4

Use arm_avepool_q7_HWC() for bufferA size non-zero case.

Layer5

Layer 5

Use convolve API arm_convolve_HWC_q7_fast() again.

 Layer6

Layer 6

Use arm_avepool_q7_HWC() again.

Layer7

Layer 7

Use arm_fully_connected_q7_opt() API, this optimized function is designed to work with interleaved weight matrix, check this article for detail.

The final code will look like this example on GitHub.

void run_nn() {
	q7_t* buffer1 = scratch_buffer;
	q7_t* buffer2 = buffer1 + 32768;
	arm_convolve_HWC_q7_RGB(input_data, CONV1_IN_DIM, CONV1_IN_CH, conv1_wt, CONV1_OUT_CH, CONV1_KER_DIM, CONV1_PAD, CONV1_STRIDE, conv1_bias, CONV1_BIAS_LSHIFT, CONV1_OUT_RSHIFT, buffer1, CONV1_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_maxpool_q7_HWC(buffer1, POOL1_IN_DIM, POOL1_IN_CH, POOL1_KER_DIM, POOL1_PAD, POOL1_STRIDE, POOL1_OUT_DIM, col_buffer, buffer2);
	arm_relu_q7(buffer2, RELU1_OUT_DIM*RELU1_OUT_DIM*RELU1_OUT_CH);
	arm_convolve_HWC_q7_fast(buffer2, CONV2_IN_DIM, CONV2_IN_CH, conv2_wt, CONV2_OUT_CH, CONV2_KER_DIM, CONV2_PAD, CONV2_STRIDE, conv2_bias, CONV2_BIAS_LSHIFT, CONV2_OUT_RSHIFT, buffer1, CONV2_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_relu_q7(buffer1, RELU2_OUT_DIM*RELU2_OUT_DIM*RELU2_OUT_CH);
	arm_avepool_q7_HWC(buffer1, POOL2_IN_DIM, POOL2_IN_CH, POOL2_KER_DIM, POOL2_PAD, POOL2_STRIDE, POOL2_OUT_DIM, col_buffer, buffer2);
	arm_convolve_HWC_q7_fast(buffer2, CONV3_IN_DIM, CONV3_IN_CH, conv3_wt, CONV3_OUT_CH, CONV3_KER_DIM, CONV3_PAD, CONV3_STRIDE, conv3_bias, CONV3_BIAS_LSHIFT, CONV3_OUT_RSHIFT, buffer1, CONV3_OUT_DIM, (q15_t*)col_buffer, NULL);
	arm_relu_q7(buffer1, RELU3_OUT_DIM*RELU3_OUT_DIM*RELU3_OUT_CH);
	arm_avepool_q7_HWC(buffer1, POOL3_IN_DIM, POOL3_IN_CH, POOL3_KER_DIM, POOL3_PAD, POOL3_STRIDE, POOL3_OUT_DIM, col_buffer, buffer2);
	arm_fully_connected_q7_opt(buffer2, ip1_wt, IP1_IN_DIM, IP1_OUT_DIM, IP1_BIAS_LSHIFT, IP1_OUT_RSHIFT, ip1_bias, output_data, (q15_t*)col_buffer);
}

Now you can add your camera hardware or other image driver into your code. Here is the real-demo on the STMicro Cortex-M7 board.

We break down the operation count of each layer during the run-time. Obviously, the convolution layer consumes 97% of the computing resource.

Layer Type Filter Shape Ouput Shape Ops (k)
Layer 1 Convolution 5 x 5 x 3 x 32 32 x 32 x 32 4900
Layer 2 Max Pooling N/A 16 x 16 x 32 737
Layer 3 Convolution 5 x 5 x 32 x 32 16 x 16 x 32 13100
Layer 4 Max Pooling N/A  8 x 8 x 32 18.4
Layer 5 Convolution 5 x 5 x 32 x 64 8 x 8 x 64 6600
Layer 6 Max Pooling N/A 4 x 4 x 64 9.2
Layer 7 Fully Connected 4 x 4 x 64 x 10 10 20
Total 87 KB 55 KB 24700


Summary

Machine Learning Convolutional Neural Network operation has a proven 5x boost on the Cortex-M platform using the CMSIS-NN software framework. Please refer to Arm Developer link below for more information on Arm ML solutions and don’t hesitate to comment below if you have any further questions.

Read about Arm ML solutions

*: The library is available for all Cortex-M cores. Implementations optimized for the SIMD instruction set are available for Arm Cortex-M4, Cortex-M7, and Cortex-M33.

Anonymous
  • vaibhav541
    Offline vaibhav541 5 months ago

    Can we train our network too using this?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • William UK
    Offline William UK over 1 year ago

    Dear odinlmshen,  Good day. We planning to implement to implement image classification on actual industrial application.  Is there any possibility to connect a better camera with higher resolution to the M7 board for better image capturing result?  BR. William Ho 

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Hongchuang
    Offline Hongchuang over 1 year ago

    Hello,

    Can you help me with the problem? Thanks!

    https://github.com/ARM-software/ML-examples/issues/30

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • rafakath
    Offline rafakath over 1 year ago

    hi,
    While converting .pkl file into header file by running script code_gen.py with MNIST dataset I am getting below error.

    Traceback (most recent call last):
    File "code_gen.py", line 362, in <module>
    generate_parameters(my_model, cmd_args.out_dir+'/parameter.h')
    File "code_gen.py", line 94, in generate_parameters
    f.write("#define "+layer.upper()+"_OUT_DIM "+str(caffe_model.layer_shape[layer][2])+"\n\n")
    IndexError: tuple index out of range

    I am new to Caffe, kindly give any suggestion.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • qwb
    Offline qwb over 1 year ago

    Hello.Before the input to the first layer,there was a pre-prossing,how did it done with the code in nnexample_cifar10?how can i define INPUT_MEAN_SHIFT and INPUT_RIGHT_SHIFT?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
>
Processors blog
  • Learning from StaffPad: How to deploy apps with the Universal Windows Platform

    Ben Clark
    Ben Clark
    This blog explores how StaffPad, a music notation app, was deployed on Windows on Arm devices through the Universal Windows Platform.
    • January 18, 2021
  • Time to get excited about the growing Windows on Arm Ecosystem

    Rahoul Varma
    Rahoul Varma
    This blog highlights the latest developments with the growing Windows on Arm ecosystem.
    • November 30, 2020
  • Parallel heterogenous computing for IoT-boards and nanocomputers with Armv8 and AArch64 hardware architecture

    Arthur Ratz
    Arthur Ratz
    Read this guest blog by Arthur Ratz about computing for IoT-boards and nanocomputers with Armv8-A and AArch64. This is a guest blog contribution from Arthur Ratz Build and run a modern parallel code…
    • November 20, 2020