It IS possible to optimize neural network architectures to fit within the memory and compute constraints of microcontrollers – without sacrificing accuracy. We explain how, and explore the potential of depthwise separable convolutional neural networks for implementing keyword spotting on Cortex-M processors.
Keyword spotting (KWS) is a critical component for enabling speech-based user interactions on smart devices. It requires real-time response and high accuracy to ensure a good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech-processing algorithms.
Keyword spotting neural network pipeline
Due to its always-on nature, the KWS application has a highly constrained power budget. Although it is, of course, able to run on a dedicated DSP or a powerful CPU, it’s more than capable of running on an Arm Cortex-M microcontroller, which is frequently used at the IoT edge to handle other tasks. This also helps to minimize cost.
However, the deployment of a neural-network-based KWS on Cortex-M-based microcontrollers comes with following challenges:
Limited memory footprint
Typical Cortex-M systems have a maximum of a few hundred KB of available memory. The entire neural network model, including input/output, weights and activations, has to fit within this minimal memory budget.
Since KWS is always on, the real-time requirement limits the total number of operations per neural network inference.
These are the typical neural network architectures for KWS inference:
A DNN is a standard, feed-forward neural network made of a stack of fully-connected layers and non-linear activation layers
One main drawback of a DNN-based KWS is that it fails to efficiently model the local, temporal and spectral correlation in the speech features. CNNs exploit this correlation by treating the input time-domain and spectral-domain features as an image and performing 2D convolution operations over it.
RNNs have shown superior performance in many sequence modeling tasks, especially speech recognition, language modeling and translation. RNNs not only exploit the temporal relation between the input signal, but also capture the long-term dependencies, using a ‘gating’ mechanism.
A convolutional recurrent neural network is a hybrid of a CNN and an RNN that exploits the local temporal/spatial correlation. A CRNN models starts with a convolution layer, followed by an RNN to encode the signal and a dense fully-connected.
Recently, depthwise separable convolution has been proposed as an efficient alternative to the standard 3D convolution operation and has been used to achieve compact network architectures for computer vision.
DS-CNN first convolves each channel in the input feature map with a separate 2D filter and then uses pointwise convolutions (i.e. 1x1) to combine the outputs in the depth dimension. By decomposing the standard 3D convolutions to 2D and then 1D, the number of parameters and operations are reduced, making deeper and wider architecture a possibility, even in resource-constrained microcontroller devices.
Memory footprint and execution time are the two most important factors when running keyword spotting on Cortex-M processors, and they should be considered when designing and optimizing neural networks for that purpose. Based on typical Cortex-M system configurations, here are three sets of constraints for neural networks (shown below) targeting small, medium and large Cortex-M systems.
Neural network (NN) classes for KWS models, assuming 10 inferences per second and 8-bit weights/activations
To tune the models to fit within the memory and compute constraints, a hyperparameter search needs be performed. The table below shows the neural network architectures and corresponding hyperparameters which need to be optimized.
Neural network hyperparameters search space
An exhaustive search of feature extraction and NN model hyperparameters, followed by manual selection to narrow down the search space, is iteratively performed. The final best performing models for each neural network architecture, along with their memory requirements and operations, are summarized in the figure below. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.
Memory vs. ops/inference of the best NN models
The KWS application was deployed on a Cortex-M7-based STM32F746G-DISCO development board (shown below), using a DNN model with 8-bit weights and 8-bit activations, with KWS running at 10 inferences per second. Each inference – including memory copying, MFCC feature extraction and DNN execution – takes about 12 ms. To save power, the microcontroller can be put into Wait-for-Interrupt (WFI) mode for the remaining time. The entire KWS application occupies ~70 KB memory, including ~66 KB for weights, ~1 KB for activations and ~2 KB for audio I/O and MFCC features.
Deployment of KWS on Cortex-M7 development board
In summary, Arm Cortex-M processors achieve state-of-the-art accuracies on the keyword spotting application by tuning the network architecture to limit the memory and compute requirements. The DS-CNN architecture provides the best accuracy while requiring significantly lower memory and compute resources.
The code, model definitions and pre-trained models are available on GitHub.
Our new Machine Learning developer site provides a one-stop repository of resources, detailed product information and tutorials to help tackle the challenges of ML at the edge.
This blog is based on the whitepaper Hello Edge: Keyword Spotting on Microcontrollers, which was originally hosted on the Cornell University Library site. To download a copy of the Arm whitepaper, please use the button below.
Download The Power of Speech whitepaper
Thanks for the interesting article! How did you run inferences on live audio on the the Cortex M-7?
I deployed the model on the board following the instructions on the GitHub repo, but the screen of the Cortex has gone blank and I can't seem to be able to run any inferences on it or to get a GUI like the one in the last picture of the article.