New CMSIS-NN Neural Network Kernels Boost Efficiency in Microcontrollers by ~5x

January 23, 2018

3 minute read time.

Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. CMSIS-NN is a collection of efficient neural network kernels developed to maximize the performance and minimize the memory footprint of neural networks on Arm Cortex-M processor cores targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.

CMSIS-NN Library

CMSIS-NN library consists of two parts: NNFunctions and NNSupportFunctions. NNFunctions include the functions that implement popular neural network layer types, such as convolution, depthwise separable convolution, fully-connected (i.e. inner-product), pooling and activation. These functions are used by the application code to implement the neural network inference applications. The kernel APIs are also kept simple, so that it can be easily retargeted for any machine learning framework. NNSupportFunctions include different utility functions, such as data conversion and activation function tables, which are used in NNFunctions. These utility functions can also be used by the application code to construct more complex NN modules, e.g. Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU).

For some kernels (e.g. fully-connected and convolution), different versions of the kernel functions are implemented. A basic version is provided that works universally ‘as-is’ for any layer parameters. We have also implemented other versions which include further optimization techniques but with either transformed inputs or with some limitations on the layer parameters. Ideally, a simple script can be used to parse the network topology and automatically determine the appropriate functions to be used.

Neural Network Application Code diagram

Testing on a convolutional neural network

We tested the CMSIS-NN kernels on a convolutional neural network (CNN), trained on the CIFAR-10 dataset, consisting of 60,000 32x32 color images divided into 10 output classes. The network topology is based on the built-in example provided in Caffe, with three convolution layers and one fully-connected layer. The layer parameters and the detailed runtime results using the CMSIS-NN kernels are shown in the table below. The runtime is measured while running on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz.

Detailed runtime results

The entire image classification takes about 99.1 ms per image (the equivalent of 10.1 images per second). The compute throughput of the CPU is about 249 MOps per second for running this network. The pre-quantized network achieves an accuracy of 80.3% on the CIFAR-10 test set. The 8-bit quantized network running on Arm Cortex-M7 core achieves 79.9% accuracy. Maximum memory footprint using the CMSIS-NN kernels is ~133 KB, where convolutions are implemented with partial im2col to save memory, followed by matrix-multiplication. Memory footprint without partial im2col would be ~332 KB and the neural network would not fit on the board.

To quantify the benefits of CMSIS-NN kernels over existing solutions, we also implemented a baseline version using a 1D convolution function (arm_conv from CMSIS-DSP), Caffe-like pooling and ReLU. For the CNN application, table below summarizes the comparison results of the baseline functions and the CMSIS-NN kernels. The CMSIS-NN kernels achieve 2.6X to 5.4X improvement in runtime/throughput over the baseline functions. The energy efficiency improvement is also in line with the throughput improvement.

Baseline v new kernel runtime

Summary

Efficient NN kernels are key in enabling inference on Arm Cortex-M based CPUs. CMSIS-NN provides optimized functions to accelerate key NN layers, such as convolution, pooling and activations. In addition, CMSIS-NN also helps to reduce the memory footprint which is key for memory constrained microcontrollers. More details are in our whitepaper, which you can download from the Cornell University Library site using the button below.

The CMSIS-NN kernels are available at GitHub page. The application code can directly use these kernels to implement neural network algorithms on Arm Cortex-M CPUs. Alternatively, these kernels can be used as primitives by machine learning frameworks to deploy trained models.

For further resources, detailed product information and tutorials to help tackle the challenges of ML at the edge, please visit our new Machine Learning developer site.

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs White Paper

5 comments
0 members are here

Top Comments

Vikas Chandra over 7 years ago in reply to franklin +1

Hi Franklin, Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions...

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

New CMSIS-NN Neural Network Kernels Boost Efficiency in Microcontrollers by ~5x

CMSIS-NN Library

Testing on a convolutional neural network

Summary

Top Comments

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC