New CMSIS-NN Neural Network Kernels Boost Efficiency in Microcontrollers by ~5x

January 23, 2018

3 minute read time.

Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. CMSIS-NN is a collection of efficient neural network kernels developed to maximize the performance and minimize the memory footprint of neural networks on Arm Cortex-M processor cores targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.

CMSIS-NN Library

CMSIS-NN library consists of two parts: NNFunctions and NNSupportFunctions. NNFunctions include the functions that implement popular neural network layer types, such as convolution, depthwise separable convolution, fully-connected (i.e. inner-product), pooling and activation. These functions are used by the application code to implement the neural network inference applications. The kernel APIs are also kept simple, so that it can be easily retargeted for any machine learning framework. NNSupportFunctions include different utility functions, such as data conversion and activation function tables, which are used in NNFunctions. These utility functions can also be used by the application code to construct more complex NN modules, e.g. Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU).

For some kernels (e.g. fully-connected and convolution), different versions of the kernel functions are implemented. A basic version is provided that works universally ‘as-is’ for any layer parameters. We have also implemented other versions which include further optimization techniques but with either transformed inputs or with some limitations on the layer parameters. Ideally, a simple script can be used to parse the network topology and automatically determine the appropriate functions to be used.

Neural Network Application Code diagram

Testing on a convolutional neural network

We tested the CMSIS-NN kernels on a convolutional neural network (CNN), trained on the CIFAR-10 dataset, consisting of 60,000 32x32 color images divided into 10 output classes. The network topology is based on the built-in example provided in Caffe, with three convolution layers and one fully-connected layer. The layer parameters and the detailed runtime results using the CMSIS-NN kernels are shown in the table below. The runtime is measured while running on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz.

Detailed runtime results

The entire image classification takes about 99.1 ms per image (the equivalent of 10.1 images per second). The compute throughput of the CPU is about 249 MOps per second for running this network. The pre-quantized network achieves an accuracy of 80.3% on the CIFAR-10 test set. The 8-bit quantized network running on Arm Cortex-M7 core achieves 79.9% accuracy. Maximum memory footprint using the CMSIS-NN kernels is ~133 KB, where convolutions are implemented with partial im2col to save memory, followed by matrix-multiplication. Memory footprint without partial im2col would be ~332 KB and the neural network would not fit on the board.

To quantify the benefits of CMSIS-NN kernels over existing solutions, we also implemented a baseline version using a 1D convolution function (arm_conv from CMSIS-DSP), Caffe-like pooling and ReLU. For the CNN application, table below summarizes the comparison results of the baseline functions and the CMSIS-NN kernels. The CMSIS-NN kernels achieve 2.6X to 5.4X improvement in runtime/throughput over the baseline functions. The energy efficiency improvement is also in line with the throughput improvement.

Baseline v new kernel runtime

Summary

Efficient NN kernels are key in enabling inference on Arm Cortex-M based CPUs. CMSIS-NN provides optimized functions to accelerate key NN layers, such as convolution, pooling and activations. In addition, CMSIS-NN also helps to reduce the memory footprint which is key for memory constrained microcontrollers. More details are in our whitepaper, which you can download from the Cornell University Library site using the button below.

The CMSIS-NN kernels are available at GitHub page. The application code can directly use these kernels to implement neural network algorithms on Arm Cortex-M CPUs. Alternatively, these kernels can be used as primitives by machine learning frameworks to deploy trained models.

For further resources, detailed product information and tutorials to help tackle the challenges of ML at the edge, please visit our new Machine Learning developer site.

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs White Paper

Top Comments

Vikas Chandra over 7 years ago in reply to franklin +1

Hi Franklin, Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions...

Parents

franklin over 7 years ago

Hi Vikas Chandra,

As you menthioned, you have tested CMSIS-NN on on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz. The readme of your posted github (https://github.com/ARM-software/CMSIS_5) says it supports for Armv8-M Architecture (Mainline and Baseline) as well as devices Cortex-M23 and Cortex-M33. Could you give me some accurate information on the hardware supportiong of CMSIS-NN?

Thanks.

Franklin
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Vikas Chandra over 7 years ago in reply to franklin

Hi Franklin,

Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions.

Thanks,
Vikas
- Cancel
- Up +1 Down
- Reply
- More
- Cancel

Comment

Vikas Chandra over 7 years ago in reply to franklin

Hi Franklin,

Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions.

Thanks,
Vikas
- Cancel
- Up +1 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

Smarter Write Barriers for Arm64 in .NET CoreCLR

Alan Hayward

Enhanced Arm64 write barriers streamline GC scanning and improve .NET runtime behaviour, delivering faster performance for modern, memory-intensive applications.
- December 5, 2025
Part4: Arm SME2 Introduction

Zenon (Zhilong) Xiu

Part 4 of the series describes the new SME2 multi-vector and Lookup table features and instructions
- November 4, 2025
Future Architecture Technologies: POE2 and vMTE

Martin Weidmann

This blog post introduces two future technologies, Permission Overlay Extension version 2 (POE2) and Virtual Tagging Extension (vMTE).
- October 23, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

New CMSIS-NN Neural Network Kernels Boost Efficiency in Microcontrollers by ~5x

CMSIS-NN Library

Testing on a convolutional neural network

Summary

Top Comments

Smarter Write Barriers for Arm64 in .NET CoreCLR

Part4: Arm SME2 Introduction

Future Architecture Technologies: POE2 and vMTE