Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog New CMSIS-NN Neural Network Kernels Boost Efficiency in Microcontrollers by ~5x
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Cortex-M7
  • Kernel Developers
  • Neural Network
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • Mbed
  • Cortex-M
  • CMSIS
  • Internet of Things (IoT)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

New CMSIS-NN Neural Network Kernels Boost Efficiency in Microcontrollers by ~5x

Vikas Chandra
Vikas Chandra
January 23, 2018
3 minute read time.

Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. CMSIS-NN is a collection of efficient neural network kernels developed to maximize the performance and minimize the memory footprint of neural networks on Arm Cortex-M processor cores targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.

CMSIS-NN Library

CMSIS-NN library consists of two parts: NNFunctions and NNSupportFunctions. NNFunctions include the functions that implement popular neural network layer types, such as convolution, depthwise separable convolution, fully-connected (i.e. inner-product), pooling and activation. These functions are used by the application code to implement the neural network inference applications. The kernel APIs are also kept simple, so that it can be easily retargeted for any machine learning framework. NNSupportFunctions include different utility functions, such as data conversion and activation function tables, which are used in NNFunctions. These utility functions can also be used by the application code to construct more complex NN modules, e.g. Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU).

For some kernels (e.g. fully-connected and convolution), different versions of the kernel functions are implemented. A basic version is provided that works universally ‘as-is’ for any layer parameters. We have also implemented other versions which include further optimization techniques but with either transformed inputs or with some limitations on the layer parameters. Ideally, a simple script can be used to parse the network topology and automatically determine the appropriate functions to be used.

Neural Network Application Code diagram

Testing on a convolutional neural network

We tested the CMSIS-NN kernels on a convolutional neural network (CNN), trained on the CIFAR-10 dataset, consisting of 60,000 32x32 color images divided into 10 output classes. The network topology is based on the built-in example provided in Caffe, with three convolution layers and one fully-connected layer.  The layer parameters and the detailed runtime results using the CMSIS-NN kernels are shown in the table below. The runtime is measured while running on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz.

Detailed runtime results

The entire image classification takes about 99.1 ms per image (the equivalent of 10.1 images per second). The compute throughput of the CPU is about 249 MOps per second for running this network. The pre-quantized network achieves an accuracy of 80.3% on the CIFAR-10 test set. The 8-bit quantized network running on Arm Cortex-M7 core achieves 79.9% accuracy.  Maximum memory footprint using the CMSIS-NN kernels is ~133 KB, where convolutions are implemented with partial im2col to save memory, followed by matrix-multiplication. Memory footprint without partial im2col would be ~332 KB and the neural network would not fit on the board.

To quantify the benefits of CMSIS-NN kernels over existing solutions, we also implemented a baseline version using a 1D convolution function (arm_conv from CMSIS-DSP), Caffe-like pooling and ReLU. For the CNN application, table below summarizes the comparison results of the baseline functions and the CMSIS-NN kernels. The CMSIS-NN kernels achieve 2.6X to 5.4X improvement in runtime/throughput over the baseline functions. The energy efficiency improvement is also in line with the throughput improvement.

Baseline v new kernel runtime

Summary

Efficient NN kernels are key in enabling inference on Arm Cortex-M based CPUs. CMSIS-NN provides optimized functions to accelerate key NN layers, such as convolution, pooling and activations. In addition, CMSIS-NN also helps to reduce the memory footprint which is key for memory constrained microcontrollers. More details are in our whitepaper, which you can download from the Cornell University Library site using the button below. 

The CMSIS-NN kernels are available at GitHub page. The application code can directly use these kernels to implement neural network algorithms on Arm Cortex-M CPUs. Alternatively, these kernels can be used as primitives by machine learning frameworks to deploy trained models.

For further resources, detailed product information and tutorials to help tackle the challenges of ML at the edge, please visit our new Machine Learning developer site. 

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs White Paper

Anonymous

Top Comments

  • Vikas Chandra
    Vikas Chandra over 7 years ago in reply to franklin +1
    Hi Franklin, Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions...
Parents
  • franklin
    franklin over 7 years ago

    Hi Vikas Chandra,

    As you menthioned, you have tested CMSIS-NN on  on a STMicoelectronics NUCLEO-F746ZG mbed board with an Arm Cortex-M7 core running at 216 MHz. The readme of your posted github (https://github.com/ARM-software/CMSIS_5) says it  supports for Armv8-M Architecture (Mainline and Baseline) as well as devices Cortex-M23 and Cortex-M33. Could you give me some accurate information on the hardware supportiong of CMSIS-NN?

    Thanks.

    Franklin

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Vikas Chandra
    Vikas Chandra over 7 years ago in reply to franklin

    Hi Franklin,

    Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions.

    Thanks,
    Vikas

    • Cancel
    • Up +1 Down
    • Reply
    • More
    • Cancel
Comment
  • Vikas Chandra
    Vikas Chandra over 7 years ago in reply to franklin

    Hi Franklin,

    Our code significantly leverages the DSP/SIMD functions in M4, M7, M33. The CMSIS-NN library can still be compiled for M0, M3, M23 etc but it will run slowly due to lack of SIMD instructions.

    Thanks,
    Vikas

    • Cancel
    • Up +1 Down
    • Reply
    • More
    • Cancel
Children
No Data
Architectures and Processors blog
  • When a barrier does not block: The pitfalls of partial order

    Wathsala Vithanage
    Wathsala Vithanage
    Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
    • September 15, 2025
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025