***All content in this blog written by Vikrant Tomar, Sam Myer and Kirsten Joe of Fluent.ai***
The rise of deep learning and related Artificial Intelligence (AI) technologies in recent years has fueled a proliferation of voice user interfaces in devices around us. Examples of such devices include Siri, Google Home, and Amazon Alexa. However, most current devices are internet connected and cause privacy concerns. Processing the speech data offline will enable these devices to respect user privacy while also minimizing latency. The work being done at Fluent.ai aims to achieve this by optimizing complex speech recognition and understanding models for low-power devices like the Arm Cortex-M series.
Speech recognition technologies have traditionally relied on cloud computing because of their high computational requirements. These systems typically follow a two-step process. First, they transcribe user speech into text. Then they use Natural Language Processing (NLP) to derive meaning. While this type of approach gives the benefit of being able to search over the Internet, it presents significant shortcomings. There are the concerns about privacy with the virtual assistants listening in on user conversations, the inability to use these technologies in environments without Internet access, and limited language and accent support. For example, there have been news reports of contract workers listening to the private voice data of internet-connected virtual assistant users. For these reasons, there has been growing market demand for a more flexible, more secure speech recognition solution that can work offline on small devices.
This is the problem that Fluent.ai wants to solve: How to take speech recognition off the Cloud, embed it on small footprint platforms, and at the same time provide high accuracy and robustness for any language, any accent and any environment. Fluent.ai is focused on the next wave of voice user interfaces, which would be led by low-power, not-always-connected devices. This type of device is often battery powered, and the Arm Cortex-M series of MCUs provide an ideal platform for them, due to their power and cost-efficient implementations. The Fluent.ai low-footprint speech understanding algorithms work fully offline on embedded systems, including Arm Cortex-M systems. Using ground-breaking deep neural networks-based speech-to-intent technology, Fluent.ai solutions directly map user speech to their intended action. This mapping completely removes the need for speech-to-text transcription and a separate NLP process.
This acoustic-only approach offers several advantages. The resulting models are smaller and end-to-end optimized. This results in smaller yet highly accurate and noise-robust models. Furthermore, this approach allows Fluent.ai to quickly develop models in any language, reducing time and cost to market for our partners. Finally, Fluent.ai has developed the only truly multilingual models in the industry that can recognize multiple languages concurrently. This feature allows users to switch seamlessly between languages with no need to configure language settings in between.
Fluent.ai offers two main lines of products: WakeWord and Fluent.ai Air--for automatic intent recognition. In summary, the Fluent.ai unique speech to intent Air models offer fully offline, low-power and low-latency speech understanding systems that can be trained to recognize any language, accent, and combination of languages and accents in a single small-footprint software model.
Fluent µCore is the Fluent.ai proprietary inference engine that is built on top of the Arm CMSIS-DSP and CMSIS-NN libraries. Fluent µCore is optimized for low footprint devices and low latency. Fluent µCore includes several innovations, like selective compilation for optimal code size, quantizing the network weights from 32-bit floats to 8-bit integers, and real-time processing with streaming NN computations.
The Fluent µCore consists of the Fluent WakeWord engine (WW) and the Automatic Intent Recognition (Air) engine. Fluent WW continuously listens for one or more wake phrases. Incoming audio is streamed through the feature extraction and WakeWord neural network, to detect whether the audio contained a wake phrase. When a wake phrase is detected, the µCore starts listening for the user's command or query, for example, “Turn off the lights”. During this stage, the input speech is evaluated through the intent recognition neural network. If the intent network detects that the user has spoken a valid intent, the system outputs a representation of user intent.
Fig 1. Fluent WakeWord engine (WW) and Automatic Intent Recognition (Air) engine
One of the main challenges in building an always-listening low-power speech understanding system is processing utterances or speech in real time. During training, the entire utterance is available, and the length of the utterance is finite and known ahead of time. However, during inference on a microcontroller, audio data is received in a stream, one frame at a time. This introduces unique issues, for example, decoding time must be minimized for a good user experience. When continuously listening, the audio does not have a fixed length or a predetermined end. Neural networks must be computed over undetermined durations, often by applying the network in overlapping windows. Any inference algorithm must be able to do this efficiently, while maintaining the same accuracy that is achieved during training. Furthermore, when training on GPUs, which are designed for batch processing, utterance activations for the entire network can be stored in memory. However, memory is a scarce resource on microcontrollers, and it is not feasible to store activations for the entire network.
One potential method to deal with such unbounded temporal data is unidirectional Recurrent Neural Networks (RNNs). However, RNNs can be computationally intensive. Convolutional networks are more efficient computationally, but convolutions in existing libraries are primarily designed with a focus on image recognition, not temporal data like speech. Fluent µCore has been designed to address these issues. It allows us to process neural networks on streaming features. Fluent µCore takes advantage of the Arm CMSIS-NN library to utilize assembler optimizations on Arm platforms. Furthermore, Arm CMSIS-NN is open-source with Apache 2.0 license, making it an ideal candidate to use on other platforms too. Fluent µCore is built in a modular fashion, so that the underlying CMSIS library can be easily changed for an updated version, if needed.
Each layer in the neural network is represented as an object, with an array of weights, an activation buffer, and a processing function. On Arm Cortex chips, the weights are loaded from flash as needed. The process function takes as input a vector for a single frame of audio, performs calculations, updates its buffer, and emits an output vector. These layers can then be combined in a sequential network.
Fig 2. Fluent µCore and CMSIS-NN
When performing windowed operations, like convolution or pooling, on streaming data, we only need to keep a buffer with width that is equal to the kernel width. This method greatly reduces memory usage, because we do not need to keep features and activations for the entire utterance, only for the most recent frames. Also, this method does not require that convolutional networks have a fixed input size. Normally, when applying convolutional networks to a time series, the entire network needs to be repeatedly applied to overlapping windows, resulting in redundant operations. Streaming decoding eliminates the need for these redundant calculations, potentially saving CPU cycles. Because processing is constantly happening while the user is speaking the utterance, this method also has lower decoding time. Our library is also able to generate C++ code which only includes the operations necessary for a given network, reducing the final binary size.
Fig. 3 A comparison of the Fluent WakeWord model, using the Fluent µCore, and. Tensorflow Lite for Microcontrollers.
To investigate the effectiveness of Fluent µCore and Arm CMSIS, we benchmarked µCore against Tensorflow Lite for Microcontrollers (tflite-micro), using a wake word recognition model. This prototype model is designed by Fluent, and is a 150kb, 8-bit quantized network with multiple convolutional layers. The Fluent streaming decoding produces output every 80ms. The results of this comparison are shown in Figure 3.
We tested tflite-micro in two ways: Applying the window at 80ms intervals, and applying a slightly larger window at 400 ms intervals. The larger interval requires less computation but at the cost of higher latency. When the user speaks the wake word, they will have to wait up to 400 ms for the system to react. In either case, the amount of RAM that is needed to store activations in the µCore was a fraction of the RAM that TFLite needed. Additionally, even when using a 400 ms interval, tflite-micro required over three times as many CPU cycles. In conclusion, the smaller memory and MIPS requirements of Fluent µCore make it possible to run more capable networks on memory and CPU-constrained devices. This reduces costs for both our OEM/ODM partners and end users.
The Fluent.ai embedded speech recognition software is an ideal solution for consumer electronic devices utilizing Arm processors. The power-efficiency of Fluent µCore, and the inherent privacy-protecting nature of our machine learning algorithms, make it a perfect fit for smart devices in the home and office. Smart watches, fitness trackers, and smart home appliances like microwaves, washing machines, air conditioners, and factory robot automation are all prime examples of target applications for Fluent.ai speech understanding technology. Furthermore, the ability of Fluent to build multiple languages into a single model means that users can switch seamlessly between languages when interacting with their device, without the need to configure language settings in between. This also brings cost savings, ease of business, and market advantages to our OEMs and ODM partners. Not only they can use a single SKU solution to address multiple markets, but they can also effortlessly address markets with a high language density.
[CTAToken URL = "https://www.youtube.com/watch?v=QDo_tOyKqRw&feature=youtu.be" target="_blank" text="Watch Fluent.ai's speech recognition tech talk" class ="green"]