How to Achieve High-Accuracy Keyword Spotting on Cortex-M Processors

January 18, 2018

4 minute read time.

It IS possible to optimize neural network architectures to fit within the memory and compute constraints of microcontrollers – without sacrificing accuracy. We explain how, and explore the potential of depthwise separable convolutional neural networks for implementing keyword spotting on Cortex-M processors.

Keyword spotting (KWS) is a critical component for enabling speech-based user interactions on smart devices. It requires real-time response and high accuracy to ensure a good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech-processing algorithms.

Keyword spotting neural network pipeline

Keyword spotting neural network pipeline

Due to its always-on nature, the KWS application has a highly constrained power budget. Although it is, of course, able to run on a dedicated DSP or a powerful CPU, it’s more than capable of running on an Arm Cortex-M microcontroller, which is frequently used at the IoT edge to handle other tasks. This also helps to minimize cost.

However, the deployment of a neural-network-based KWS on Cortex-M-based microcontrollers comes with following challenges:

Limited memory footprint

Typical Cortex-M systems have a maximum of a few hundred KB of available memory. The entire neural network model, including input/output, weights and activations, has to fit within this minimal memory budget.
Limited compute resources
Since KWS is always on, the real-time requirement limits the total number of operations per neural network inference.

These are the typical neural network architectures for KWS inference:

Deep Neural Network (DNN)
A DNN is a standard, feed-forward neural network made of a stack of fully-connected layers and non-linear activation layers
Convolutional Neural Network (CNN)
One main drawback of a DNN-based KWS is that it fails to efficiently model the local, temporal and spectral correlation in the speech features. CNNs exploit this correlation by treating the input time-domain and spectral-domain features as an image and performing 2D convolution operations over it.
Recurrent Neural Network (RNN)
RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition, language modeling and translation. RNNs not only exploit the temporal relation between the input signal, but also capture the long-term dependencies, using a ‘gating’ mechanism.
Convolutional Recurrent Neural Network (CRNN)
A convolutional recurrent neural network is a hybrid of a CNN and an RNN that exploits the local temporal/spatial correlation. A CRNN models starts with a convolution layer, followed by an RNN to encode the signal and a dense fully-connected.
Depthwise Separable Convolutional Neural Network (DS-CNN)
Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard 3D convolution operation and has been used to achieve compact network architectures for computer vision.

DS-CNN first convolves each channel in the input feature map with a separate 2D filter and then uses pointwise convolutions (i.e. 1x1) to combine the outputs in the depth dimension. By decomposing the standard 3D convolutions to 2D and then 1D, the number of parameters and operations are reduced, making deeper and wider architecture a possibility, even in resource-constrained microcontroller devices.

Memory footprint and execution time are the two most important factors when running keyword spotting on Cortex-M processors, and they should be considered when designing and optimizing neural networks for that purpose. Based on typical Cortex-M system configurations, here are three sets of constraints for neural networks (shown below) targeting small, medium and large Cortex-M systems.

Neural network classes for KWS models

Neural network (NN) classes for KWS models, assuming 10 inferences per second and 8-bit weights/activations

To tune the models to fit within the memory and compute constraints, a hyperparameter search needs be performed. The table below shows the neural network architectures and corresponding hyperparameters which need to be optimized.

Neural network hyperparameters search space

An exhaustive search of feature extraction and NN model hyperparameters, followed by manual selection to narrow down the search space, is iteratively performed. The final best performing models for each neural network architecture, along with their memory requirements and operations, are summarized in the figure below. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.

Memory vs ops/inference_best NN models

Memory vs. ops/inference of the best NN models

The KWS application was deployed on a Cortex-M7-based STM32F746G-DISCO development board (shown below), using a DNN model with 8-bit weights and 8-bit activations, with KWS running at 10 inferences per second. Each inference – including memory copying, MFCC feature extraction and DNN execution – takes about 12 ms. To save power, the microcontroller can be put into Wait-for-Interrupt (WFI) mode for the remaining time. The entire KWS application occupies ~70 KB memory, including ~66 KB for weights, ~1 KB for activations and ~2 KB for audio I/O and MFCC features.

Deployment of KWS on Cortex-M7 dev board

Deployment of KWS on Cortex-M7 development board

In summary, Arm Cortex-M processors achieve state-of-the-art accuracies on the keyword spotting application by tuning the network architecture to limit the memory and compute requirements. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.

The code, model definitions and pre-trained models are available on GitHub.

Our new Machine Learning developer site provides a one-stop repository of resources, detailed product information and tutorials to help tackle the challenges of ML at the edge.

This blog is based on the whitepaper Hello Edge: Keyword Spotting on Microcontrollers, which was originally hosted on the Cornell University Library site. To download a copy of the Arm whitepaper, please use the button below.

Download The Power of Speech whitepaper

4 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

How to Achieve High-Accuracy Keyword Spotting on Cortex-M Processors

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC