Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog How to Achieve High-Accuracy Keyword Spotting on Cortex-M Processors
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • White Paper
  • Cortex-M7
  • Neural Network
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • Cortex-M
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

How to Achieve High-Accuracy Keyword Spotting on Cortex-M Processors

Vikas Chandra
Vikas Chandra
January 18, 2018
4 minute read time.

It IS possible to optimize neural network architectures to fit within the memory and compute constraints of microcontrollers – without sacrificing accuracy. We explain how, and explore the potential of depthwise separable convolutional neural networks for implementing keyword spotting on Cortex-M processors.

Keyword spotting (KWS) is a critical component for enabling speech-based user interactions on smart devices. It requires real-time response and high accuracy to ensure a good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech-processing algorithms.

Keyword spotting neural network pipeline

Keyword spotting neural network pipeline

Due to its always-on nature, the KWS application has a highly constrained power budget. Although it is, of course, able to run on a dedicated DSP or a powerful CPU, it’s more than capable of running on an Arm Cortex-M microcontroller, which is frequently used at the IoT edge to handle other tasks. This also helps to minimize cost.

However, the deployment of a neural-network-based KWS on Cortex-M-based microcontrollers comes with following challenges:

  1. Limited memory footprint

    Typical Cortex-M systems have a maximum of a few hundred KB of available memory. The entire neural network model, including input/output, weights and activations, has to fit within this minimal memory budget.

  2. Limited compute resources

    Since KWS is always on, the real-time requirement limits the total number of operations per neural network inference.

These are the typical neural network architectures for KWS inference:

  • Deep Neural Network (DNN)

    A DNN is a standard, feed-forward neural network made of a stack of fully-connected layers and non-linear activation layers

  • Convolutional Neural Network (CNN)

    One main drawback of a DNN-based KWS is that it fails to efficiently model the local, temporal and spectral correlation in the speech features. CNNs exploit this correlation by treating the input time-domain and spectral-domain features as an image and performing 2D convolution operations over it.

  • Recurrent Neural Network (RNN)

    RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition, language modeling and translation. RNNs not only exploit the temporal relation between the input signal, but also capture the long-term dependencies, using a ‘gating’ mechanism.

  • Convolutional Recurrent Neural Network (CRNN)

    A convolutional recurrent neural network is a hybrid of a CNN and an RNN that exploits the local temporal/spatial correlation. A CRNN models starts with a convolution layer, followed by an RNN to encode the signal and a dense fully-connected.

  • Depthwise Separable Convolutional Neural Network (DS-CNN)

    Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard 3D convolution operation and has been used to achieve compact network architectures for computer vision.

    DS-CNN first convolves each channel in the input feature map with a separate 2D filter and then uses pointwise convolutions (i.e. 1x1) to combine the outputs in the depth dimension. By decomposing the standard 3D convolutions to 2D and then 1D, the number of parameters and operations are reduced, making deeper and wider architecture a possibility, even in resource-constrained microcontroller devices.

Memory footprint and execution time are the two most important factors when running keyword spotting on Cortex-M processors, and they should be considered when designing and optimizing neural networks for that purpose. Based on typical Cortex-M system configurations, here are three sets of constraints for neural networks (shown below) targeting small, medium and large Cortex-M systems.

Neural network classes for KWS models

Neural network (NN) classes for KWS models, assuming 10 inferences per second and 8-bit weights/activations

To tune the models to fit within the memory and compute constraints, a hyperparameter search needs be performed. The table below shows the neural network architectures and corresponding hyperparameters which need to be optimized.

Neural network hyperparameters search space

Neural network hyperparameters search space

An exhaustive search of feature extraction and NN model hyperparameters, followed by manual selection to narrow down the search space, is iteratively performed. The final best performing models for each neural network architecture, along with their memory requirements and operations, are summarized in the figure below. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.

Memory vs ops/inference_best NN models

Memory vs. ops/inference of the best NN models

The KWS application was deployed on a Cortex-M7-based STM32F746G-DISCO development board (shown below), using a DNN model with 8-bit weights and 8-bit activations, with KWS running at 10 inferences per second. Each inference – including memory copying, MFCC feature extraction and DNN execution – takes about 12 ms. To save power, the microcontroller can be put into Wait-for-Interrupt (WFI) mode for the remaining time. The entire KWS application occupies ~70 KB memory, including ~66 KB for weights, ~1 KB for activations and ~2 KB for audio I/O and MFCC features.

Deployment of KWS on Cortex-M7 dev board

Deployment of KWS on Cortex-M7 development board

In summary, Arm Cortex-M processors achieve state-of-the-art accuracies on the keyword spotting application by tuning the network architecture to limit the memory and compute requirements. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.

The code, model definitions and pre-trained models are available on GitHub.

Our new Machine Learning developer site provides a one-stop repository of resources, detailed product information and tutorials to help tackle the challenges of ML at the edge.

This blog is based on the whitepaper Hello Edge: Keyword Spotting on Microcontrollers, which was originally hosted on the Cornell University Library site. To download a copy of the Arm whitepaper, please use the button below.

Download The Power of Speech whitepaper

Anonymous
  • eera_l
    eera_l over 5 years ago

    Thanks for the interesting article! How did you run inferences on live audio on the the Cortex M-7?

    I deployed the model on the board following the instructions on the GitHub repo, but the screen of the Cortex has gone blank and I can't seem to be able to run any inferences on it or to get a GUI like the one in the last picture of the article.

    Thanks :)

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • wayne175
    wayne175 over 6 years ago in reply to Vikas Chandra

    How can change model? There are only two model (DNN and DS CNN) in the code link.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Vikas Chandra
    Vikas Chandra over 7 years ago in reply to Dennix

    You can find the KWS deployment code here: github.com/.../Deployment

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Dennix
    Dennix over 7 years ago

    Good article, but where can I found a full example(KWS application) for the microcontroller?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025