How to Achieve High-Accuracy Keyword Spotting on Cortex-M Processors

January 18, 2018

4 minute read time.

It IS possible to optimize neural network architectures to fit within the memory and compute constraints of microcontrollers – without sacrificing accuracy. We explain how, and explore the potential of depthwise separable convolutional neural networks for implementing keyword spotting on Cortex-M processors.

Keyword spotting (KWS) is a critical component for enabling speech-based user interactions on smart devices. It requires real-time response and high accuracy to ensure a good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech-processing algorithms.

Keyword spotting neural network pipeline

Keyword spotting neural network pipeline

Due to its always-on nature, the KWS application has a highly constrained power budget. Although it is, of course, able to run on a dedicated DSP or a powerful CPU, it’s more than capable of running on an Arm Cortex-M microcontroller, which is frequently used at the IoT edge to handle other tasks. This also helps to minimize cost.

However, the deployment of a neural-network-based KWS on Cortex-M-based microcontrollers comes with following challenges:

Limited memory footprint

Typical Cortex-M systems have a maximum of a few hundred KB of available memory. The entire neural network model, including input/output, weights and activations, has to fit within this minimal memory budget.
Limited compute resources
Since KWS is always on, the real-time requirement limits the total number of operations per neural network inference.

These are the typical neural network architectures for KWS inference:

Deep Neural Network (DNN)
A DNN is a standard, feed-forward neural network made of a stack of fully-connected layers and non-linear activation layers
Convolutional Neural Network (CNN)
One main drawback of a DNN-based KWS is that it fails to efficiently model the local, temporal and spectral correlation in the speech features. CNNs exploit this correlation by treating the input time-domain and spectral-domain features as an image and performing 2D convolution operations over it.
Recurrent Neural Network (RNN)
RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition, language modeling and translation. RNNs not only exploit the temporal relation between the input signal, but also capture the long-term dependencies, using a ‘gating’ mechanism.
Convolutional Recurrent Neural Network (CRNN)
A convolutional recurrent neural network is a hybrid of a CNN and an RNN that exploits the local temporal/spatial correlation. A CRNN models starts with a convolution layer, followed by an RNN to encode the signal and a dense fully-connected.
Depthwise Separable Convolutional Neural Network (DS-CNN)
Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard 3D convolution operation and has been used to achieve compact network architectures for computer vision.

DS-CNN first convolves each channel in the input feature map with a separate 2D filter and then uses pointwise convolutions (i.e. 1x1) to combine the outputs in the depth dimension. By decomposing the standard 3D convolutions to 2D and then 1D, the number of parameters and operations are reduced, making deeper and wider architecture a possibility, even in resource-constrained microcontroller devices.

Memory footprint and execution time are the two most important factors when running keyword spotting on Cortex-M processors, and they should be considered when designing and optimizing neural networks for that purpose. Based on typical Cortex-M system configurations, here are three sets of constraints for neural networks (shown below) targeting small, medium and large Cortex-M systems.

Neural network classes for KWS models

Neural network (NN) classes for KWS models, assuming 10 inferences per second and 8-bit weights/activations

To tune the models to fit within the memory and compute constraints, a hyperparameter search needs be performed. The table below shows the neural network architectures and corresponding hyperparameters which need to be optimized.

Neural network hyperparameters search space

An exhaustive search of feature extraction and NN model hyperparameters, followed by manual selection to narrow down the search space, is iteratively performed. The final best performing models for each neural network architecture, along with their memory requirements and operations, are summarized in the figure below. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.

Memory vs ops/inference_best NN models

Memory vs. ops/inference of the best NN models

The KWS application was deployed on a Cortex-M7-based STM32F746G-DISCO development board (shown below), using a DNN model with 8-bit weights and 8-bit activations, with KWS running at 10 inferences per second. Each inference – including memory copying, MFCC feature extraction and DNN execution – takes about 12 ms. To save power, the microcontroller can be put into Wait-for-Interrupt (WFI) mode for the remaining time. The entire KWS application occupies ~70 KB memory, including ~66 KB for weights, ~1 KB for activations and ~2 KB for audio I/O and MFCC features.

Deployment of KWS on Cortex-M7 dev board

Deployment of KWS on Cortex-M7 development board

In summary, Arm Cortex-M processors achieve state-of-the-art accuracies on the keyword spotting application by tuning the network architecture to limit the memory and compute requirements. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.

The code, model definitions and pre-trained models are available on GitHub.

Our new Machine Learning developer site provides a one-stop repository of resources, detailed product information and tutorials to help tackle the challenges of ML at the edge.

This blog is based on the whitepaper Hello Edge: Keyword Spotting on Microcontrollers, which was originally hosted on the Cornell University Library site. To download a copy of the Arm whitepaper, please use the button below.

[CTAToken URL = "https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwiIld6wh-HfAhUKThoKHWrnB9AQFjAAegQICRAC&url=https%3A%2F%2Fdeveloper.arm.com%2Ftechnologies%2Fmachine-learning-on-arm%2Fdeveloper-material%2Fwhite-papers%2Fthe-power-of-speech&usg=AOvVaw2W2_V_oPqfy-Pzd3y3Oop8" target="_blank" text="Download The Power of Speech whitepaper" class ="green"]

Parents

eera_l over 5 years ago

Thanks for the interesting article! How did you run inferences on live audio on the the Cortex M-7?

I deployed the model on the board following the instructions on the GitHub repo, but the screen of the Cortex has gone blank and I can't seem to be able to run any inferences on it or to get a GUI like the one in the last picture of the article.

Thanks :)
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

eera_l over 5 years ago

Thanks for the interesting article! How did you run inferences on live audio on the the Cortex M-7?

I deployed the model on the board following the instructions on the GitHub repo, but the screen of the Cortex has gone blank and I can't seem to be able to run any inferences on it or to get a GUI like the one in the last picture of the article.

Thanks :)
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

Part 2: Arm Scalable Matrix Extension (SME) Instructions

Zenon Xiu (修志龙）

This blog is the second half of a two-part blog for SME Instructions. See link to Part 1 in the note at the top of this blog post.
- June 24, 2024
Part 1: Arm Scalable Matrix Extension (SME) Introduction

Zenon Xiu (修志龙）

This blog series provides an introduction to the Arm Scalable Matrix Extension (SME) including SVE and SVE2.
- May 23, 2024
MPAM-Style cache partitioning with ATP-Engine and gem5

Hristo Belchev

Upstream gem5 and ATP-Engine MPAM-style cache partitioning are discussed, with experiments for the feature being proposed and analyzed.
- April 24, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

How to Achieve High-Accuracy Keyword Spotting on Cortex-M Processors

Part 2: Arm Scalable Matrix Extension (SME) Instructions

Part 1: Arm Scalable Matrix Extension (SME) Introduction

MPAM-Style cache partitioning with ATP-Engine and gem5