Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Research Collaboration and Enablement
Research Collaboration and Enablement
Research Articles Fast and accurate keyword spotting using Transformers
  • Research Articles
  • Arm Research - Most active
  • Resources
  • Arm Research Events
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
Research Collaboration and Enablement requires membership for participation - click to join
More blogs in Research Collaboration and Enablement
  • Research Articles

Tags
  • Arm Research
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Fast and accurate keyword spotting using Transformers

Axel Berg
Axel Berg
January 10, 2022
3 minute read time.

Smart voice assistants like Google Assistant and Siri are interesting applications of low-footprint machine learning. Here, neural networks are one component of the computational workload. A voice assistant pipeline, an example of which is shown in figure 1, consists of several stages. The first stage is typically to perform some sort of wake-word detection, where the assistant listens for a trigger phrase. When the trigger phrase is detected, the next stage in the process is to detect the presence of keywords that belong to a small dictionary. Examples of keywords could be common phrases, such as ”play” or ”pause”. These give specific instructions, for example, when listening to music. Since these two tasks are less compute-intensive as compared to general automatic speech recognition (ASR), they can be performed on-device with low latency. If no keyword is detected, the voice data is typically sent to a server where ASR is performed. However, on-device ASR is now becoming feasible for devices with enough processing power. On-device keyword spotting is also useful in scenarios when no internet connection is available or when data privacy is a concern. 

Example speech processing pipeline

Figure 1: An example speech processing pipeline for smart assistants. 

Transformer networks 

Previous state-of-the-art keyword spotting models have depended on convolutional or recurrent neural networks that have been optimized for low latency. However, Transformer networks are now becoming increasingly popular for tackling problems in both natural language processing and computer vision. The Transformer is a neural network architecture that leverages self-attention, which means that features are dynamically calculated by allowing different parts of the input to attend to each other. In a recent paper, presented at InterSpeech 2021, we investigate to what extent the Transformer is suitable for the keyword spotting task, with regard to both accuracy and latency. 

Keyword spotting approach

Figure 2: Our proposed approach to keyword spotting. 

Time-domain attention is all you need 

Our keyword spotting pipeline is shown in Figure 2. Here, the raw audio waveform is pre-processed by dividing the signal into a set of time slots and extracting the mel-frequency cepstrum coefficients (MFCCs) for each slot. Each set of MFCCs are then treated as an input token to the Transformer model, which computes self-attention between different tokens. This allows the model to extract audio features based on how different time slots interact with each other. It also makes the features more descriptive than those of traditional neural networks that have previously been used for keyword spotting. The Transformer outputs a global feature vector that is fed into a multi-layer perceptron (MLP), which classifies the audio into one of the keywords in the dictionary or as a non-keyword. We have named our model Keyword Transformer (KWT), and present three different versions, KWT-1, 2 and 3. Here, each number indicates increasing complexity, which allows for a trade-off between accuracy and latency. 

 Latency and accuracy charts

Figure 3: Latency and accuracy on a single thread on a mobile phone 

The future of tiny Machine Learning 

Experimental results show that KWT works better than initially expected for keyword spotting. It achieves state-of-the-art classification accuracy on the Google Speech Commands dataset; 98.6% and 97.7% on the 12- and 35-word tasks respectively, outperforming all previous methods. We have also converted our model to TensorFlow Lite format and measured the inference latency on a OnePlus 6 mobile device based on the Snapdragon 845 (4x Arm Cortex-A75, 4x Arm Cortex-A55). Figure 3 shows the measured accuracy and latency for the 12- and 35-word tasks, for KWT and a set of other popular keyword spotting models. These show that KWT is also competitive with regard to latency. This indicates that accelerating Transformers might soon become an important workload for keyword spotting and other edge applications. There is also the possibility of making them even faster using sparse attention techniques and model compression.  

If you would like to train a KWT model yourself or use one of our pre-trained models, the code is available on GitHub.  

Read the Paper     Access the Code     Contact Axel

Interested in Keyword Spotting?

Read a previous post on keyword spotting here: High Accuracy Keyword Spotting on Cortex-M Processors 

Anonymous
Research Articles
  • Overcoming resistance

    Andrew Pickard
    Andrew Pickard
    Improving the characteristics of interconnects as device dimensions scale smaller.
    • September 22, 2022
  • Hands-on experience at Singapore Management University

    Andrew Pickard
    Andrew Pickard
    SMU has been working on the SAP Next-Gen student project, to develop innovative sustainability solutions using SAP software and real-world IoT devices from Arm's partner ecosystem.
    • May 30, 2022
  • Cryptography: what is under the mask?

    Andrew Pickard
    Andrew Pickard
    Sorbonne Université has been using Arm processor source code for modelling and verification on the hardware at the micro-architectural level.
    • May 26, 2022