It was one thing when some of Amazon’s voice-enabled Alexa devices picked up children’s voices and then ordered goods online. It was another thing altogether when families watching television coverage of that story found that their Amazon devices ordered those same products because they heard the reference on the news report. Ah, the unintended consequences of powerful voice recognition and artificial intelligence!
This anecdote highlights the power of speech recognition technology, 60 years after Bell Labs’ Audrey device and 50 years after IBM showed off its Shoebox machine.
Speech recognition has vastly improved over the decades, thanks to electronics innovation and artificial intelligence advances. Yet, even amazing applications like voice-activated assistants are, in some ways, still in their adolescence, partly because of the complexity of human language and speech.
Consider this: speech is a fundamental form of human connection that allows us to communicate, articulate, vocalize, recognize, understand, and interpret. But here’s where the complexity comes in: There are thousands of languages and even more dialects. Each of us has a unique vocabulary: Researchers from an independent American-Brazilian research project found that native English-speaking adults understood an average of 22,000 to 32,000 vocabulary words and learned about one word a day. Non-native English-speaking adults knew an average of 11,000 to 22,000 English words and learned about 2.5 words a day.
While English speakers might use upwards of 30,000 words, most embedded speech-recognition systems use a vocabulary of fewer than 10,000 words. Accents and dialects increase the vocabulary size needed for a recognition system to be able to correctly capture and process a wide range of speakers within a single language.
You can see that the state of speech-recognition and artificial intelligence still has a way to go to match human capability. To close that gap, we’ll be looking for advancements in voice recognition technologies that resolve existing accuracy and security issues and can fully operate as an embedded solution.
With the continually improving computing power and compact size of mobile processors, large vocabulary engines that promote the use of natural speech are now available as an embedded option for OEMs. The footprint for such an engine has been shrunk and optimized, making it an even more attractive option for these OEMs as they start to leverage artificial intelligence more. Effective speaker recognition requires the segmentation of the audio stream, detection and/or tracking of speakers, and identification of those speakers. The recognition engine provides fusion functionality that leads to a fused result that is used to make decisions more readily. For the engine to function at its full potential and to allow users to speak naturally and be understood—even in a noisy environment—pre-processing techniques are integrated to help improve the quality of the audio input to the recognition system.
The other key to improved voice recognition technology is distributed computing. We’ve gotten to this amazing point in voice-recognition (yes, even considering the accidental Amazon orders!) thanks to the cloud, but there are limitations to cloud technology when it comes to its application in a real-time enterprise environment that requires user privacy, security, and reliable connectivity. The world is moving quickly to a new model of collaborative embedded-cloud operation—called an embedded glue layer—that promotes uninterrupted connectivity and directly addresses emerging cloud challenges for the enterprise.
With an embedded glue layer, capturing and processing user voice or visual data can be performed locally and without complete dependence on the cloud. In its simplest form, the glue layer acts as an embedded service and collaborates with the cloud-based service to provide native on-device processing. The glue layer allows for mission-critical voice tasks—where user or enterprise security, privacy and protection are required—to be processed natively on the device as well as ensuring continuous availability. Non-mission-critical tasks, such as natural language processing, can be processed in the cloud using low-bandwidth, textual data as the mode of bilateral transmission. The embedded recognition glue layer provides nearly the same level of scope as a cloud-based service, albeit as a native process.
This approach to voice recognition technology will not only revolutionize applications but devices as well, and it’s on our doorstep, just like those packages.
This white paper from Recognition Technologies and Arm offers excellent technical insight into the architecture and design approach that’s making the gateway a more powerful, efficient place for voice recognition. And read more about Arm's artificial intelligence technologies.
Chect out Recognition Technologies and Arm White Paper