Smart voice assistants like Google Assistant and Siri are interesting applications of low-footprint machine learning. Here, neural networks are one component of the computational workload. A voice assistant pipeline, an example of which is shown in figure 1, consists of several stages. The first stage is typically to perform some sort of wake-word detection, where the assistant listens for a trigger phrase. When the trigger phrase is detected, the next stage in the process is to detect the presence of keywords that belong to a small dictionary. Examples of keywords could be common phrases, such as ”play” or ”pause”. These give specific instructions, for example, when listening to music. Since these two tasks are less compute-intensive as compared to general automatic speech recognition (ASR), they can be performed on-device with low latency. If no keyword is detected, the voice data is typically sent to a server where ASR is performed. However, on-device ASR is now becoming feasible for devices with enough processing power. On-device keyword spotting is also useful in scenarios when no internet connection is available or when data privacy is a concern.
Figure 1: An example speech processing pipeline for smart assistants.
Previous state-of-the-art keyword spotting models have depended on convolutional or recurrent neural networks that have been optimized for low latency. However, Transformer networks are now becoming increasingly popular for tackling problems in both natural language processing and computer vision. The Transformer is a neural network architecture that leverages self-attention, which means that features are dynamically calculated by allowing different parts of the input to attend to each other. In a recent paper, presented at InterSpeech 2021, we investigate to what extent the Transformer is suitable for the keyword spotting task, with regard to both accuracy and latency.
Figure 2: Our proposed approach to keyword spotting.
Our keyword spotting pipeline is shown in Figure 2. Here, the raw audio waveform is pre-processed by dividing the signal into a set of time slots and extracting the mel-frequency cepstrum coefficients (MFCCs) for each slot. Each set of MFCCs are then treated as an input token to the Transformer model, which computes self-attention between different tokens. This allows the model to extract audio features based on how different time slots interact with each other. It also makes the features more descriptive than those of traditional neural networks that have previously been used for keyword spotting. The Transformer outputs a global feature vector that is fed into a multi-layer perceptron (MLP), which classifies the audio into one of the keywords in the dictionary or as a non-keyword. We have named our model Keyword Transformer (KWT), and present three different versions, KWT-1, 2 and 3. Here, each number indicates increasing complexity, which allows for a trade-off between accuracy and latency.
Figure 3: Latency and accuracy on a single thread on a mobile phone
Experimental results show that KWT works better than initially expected for keyword spotting. It achieves state-of-the-art classification accuracy on the Google Speech Commands dataset; 98.6% and 97.7% on the 12- and 35-word tasks respectively, outperforming all previous methods. We have also converted our model to TensorFlow Lite format and measured the inference latency on a OnePlus 6 mobile device based on the Snapdragon 845 (4x Arm Cortex-A75, 4x Arm Cortex-A55). Figure 3 shows the measured accuracy and latency for the 12- and 35-word tasks, for KWT and a set of other popular keyword spotting models. These show that KWT is also competitive with regard to latency. This indicates that accelerating Transformers might soon become an important workload for keyword spotting and other edge applications. There is also the possibility of making them even faster using sparse attention techniques and model compression.
If you would like to train a KWT model yourself or use one of our pre-trained models, the code is available on GitHub.
Read the Paper Access the Code Contact Axel
Read a previous post on keyword spotting here: High Accuracy Keyword Spotting on Cortex-M Processors