Fast and accurate keyword spotting using Transformers

January 10, 2022

3 minute read time.

Smart voice assistants like Google Assistant and Siri are interesting applications of low-footprint machine learning. Here, neural networks are one component of the computational workload. A voice assistant pipeline, an example of which is shown in figure 1, consists of several stages. The first stage is typically to perform some sort of wake-word detection, where the assistant listens for a trigger phrase. When the trigger phrase is detected, the next stage in the process is to detect the presence of keywords that belong to a small dictionary. Examples of keywords could be common phrases, such as ”play” or ”pause”. These give specific instructions, for example, when listening to music. Since these two tasks are less compute-intensive as compared to general automatic speech recognition (ASR), they can be performed on-device with low latency. If no keyword is detected, the voice data is typically sent to a server where ASR is performed. However, on-device ASR is now becoming feasible for devices with enough processing power. On-device keyword spotting is also useful in scenarios when no internet connection is available or when data privacy is a concern.

Example speech processing pipeline

Figure 1: An example speech processing pipeline for smart assistants.

Transformer networks

Previous state-of-the-art keyword spotting models have depended on convolutional or recurrent neural networks that have been optimized for low latency. However, Transformer networks are now becoming increasingly popular for tackling problems in both natural language processing and computer vision. The Transformer is a neural network architecture that leverages self-attention, which means that features are dynamically calculated by allowing different parts of the input to attend to each other. In a recent paper, presented at InterSpeech 2021, we investigate to what extent the Transformer is suitable for the keyword spotting task, with regard to both accuracy and latency.

Figure 2: Our proposed approach to keyword spotting.

Time-domain attention is all you need

Our keyword spotting pipeline is shown in Figure 2. Here, the raw audio waveform is pre-processed by dividing the signal into a set of time slots and extracting the mel-frequency cepstrum coefficients (MFCCs) for each slot. Each set of MFCCs are then treated as an input token to the Transformer model, which computes self-attention between different tokens. This allows the model to extract audio features based on how different time slots interact with each other. It also makes the features more descriptive than those of traditional neural networks that have previously been used for keyword spotting. The Transformer outputs a global feature vector that is fed into a multi-layer perceptron (MLP), which classifies the audio into one of the keywords in the dictionary or as a non-keyword. We have named our model Keyword Transformer (KWT), and present three different versions, KWT-1, 2 and 3. Here, each number indicates increasing complexity, which allows for a trade-off between accuracy and latency.

Latency and accuracy charts

Figure 3: Latency and accuracy on a single thread on a mobile phone

The future of tiny Machine Learning

Experimental results show that KWT works better than initially expected for keyword spotting. It achieves state-of-the-art classification accuracy on the Google Speech Commands dataset; 98.6% and 97.7% on the 12- and 35-word tasks respectively, outperforming all previous methods. We have also converted our model to TensorFlow Lite format and measured the inference latency on a OnePlus 6 mobile device based on the Snapdragon 845 (4x Arm Cortex-A75, 4x Arm Cortex-A55). Figure 3 shows the measured accuracy and latency for the 12- and 35-word tasks, for KWT and a set of other popular keyword spotting models. These show that KWT is also competitive with regard to latency. This indicates that accelerating Transformers might soon become an important workload for keyword spotting and other edge applications. There is also the possibility of making them even faster using sparse attention techniques and model compression.

If you would like to train a KWT model yourself or use one of our pre-trained models, the code is available on GitHub.

Read the Paper Access the Code Contact Axel

Interested in Keyword Spotting?

Read a previous post on keyword spotting here: High Accuracy Keyword Spotting on Cortex-M Processors

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

Fast and accurate keyword spotting using Transformers

Transformer networks

Time-domain attention is all you need

The future of tiny Machine Learning

Interested in Keyword Spotting?

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond