AI’s role in next-generation voice recognition

August 15, 2017

3 minute read time.

It was one thing when some of Amazon’s voice-enabled Alexa devices picked up children’s voices and then ordered goods online. It was another thing altogether when families watching television coverage of that story found that their Amazon devices ordered those same products because they heard the reference on the news report. Ah, the unintended consequences of powerful voice recognition and artificial intelligence!

This anecdote highlights the power of speech recognition technology, 60 years after Bell Labs’ Audrey device and 50 years after IBM showed off its Shoebox machine.

Speech recognition has vastly improved over the decades, thanks to electronics innovation and artificial intelligence advances. Yet, even amazing applications like voice-activated assistants are, in some ways, still in their adolescence, partly because of the complexity of human language and speech.

Comprehending complexity

Consider this: speech is a fundamental form of human connection that allows us to communicate, articulate, vocalize, recognize, understand, and interpret. But here’s where the complexity comes in: There are thousands of languages and even more dialects. Each of us has a unique vocabulary: Researchers from an independent American-Brazilian research project found that native English-speaking adults understood an average of 22,000 to 32,000 vocabulary words and learned about one word a day. Non-native English-speaking adults knew an average of 11,000 to 22,000 English words and learned about 2.5 words a day.

While English speakers might use upwards of 30,000 words, most embedded speech-recognition systems use a vocabulary of fewer than 10,000 words. Accents and dialects increase the vocabulary size needed for a recognition system to be able to correctly capture and process a wide range of speakers within a single language.

You can see that the state of speech-recognition and artificial intelligence still has a way to go to match human capability. To close that gap, we’ll be looking for advancements in voice recognition technologies that resolve existing accuracy and security issues and can fully operate as an embedded solution.

Voice recognition meets artificial intelligence

With the continually improving computing power and compact size of mobile processors, large vocabulary engines that promote the use of natural speech are now available as an embedded option for OEMs. The footprint for such an engine has been shrunk and optimized, making it an even more attractive option for these OEMs as they start to leverage artificial intelligence more. Effective speaker recognition requires the segmentation of the audio stream, detection and/or tracking of speakers, and identification of those speakers. The recognition engine provides fusion functionality that leads to a fused result that is used to make decisions more readily. For the engine to function at its full potential and to allow users to speak naturally and be understood—even in a noisy environment—pre-processing techniques are integrated to help improve the quality of the audio input to the recognition system.

The other key to improved voice recognition technology is distributed computing. We’ve gotten to this amazing point in voice-recognition (yes, even considering the accidental Amazon orders!) thanks to the cloud, but there are limitations to cloud technology when it comes to its application in a real-time enterprise environment that requires user privacy, security, and reliable connectivity. The world is moving quickly to a new model of collaborative embedded-cloud operation—called an embedded glue layer—that promotes uninterrupted connectivity and directly addresses emerging cloud challenges for the enterprise.

With an embedded glue layer, capturing and processing user voice or visual data can be performed locally and without complete dependence on the cloud. In its simplest form, the glue layer acts as an embedded service and collaborates with the cloud-based service to provide native on-device processing. The glue layer allows for mission-critical voice tasks—where user or enterprise security, privacy and protection are required—to be processed natively on the device as well as ensuring continuous availability. Non-mission-critical tasks, such as natural language processing, can be processed in the cloud using low-bandwidth, textual data as the mode of bilateral transmission. The embedded recognition glue layer provides nearly the same level of scope as a cloud-based service, albeit as a native process.

This approach to voice recognition technology will not only revolutionize applications but devices as well, and it’s on our doorstep, just like those packages.

This white paper from Recognition Technologies and Arm offers excellent technical insight into the architecture and design approach that’s making the gateway a more powerful, efficient place for voice recognition. And read more about Arm's artificial intelligence technologies.

Chect out Recognition Technologies and Arm White Paper

0 comments
0 members are here

Internet of Things (IoT) blog

Building vision-enabled devices to capture the emerging wave in IoT

Diya Soubra

IoT devices will drive an explosion in use cases with vision. Read more about the different use cases and what Arm technology is involved here.
- December 9, 2024
The power of SystemReady for custom-built OS distributions

Pere Garcia

Arm developed the SystemReady Devicetree band as part of the SystemReady program, learn more in this blog post.
- November 22, 2024
Software, Tools, and Ecosystem for ML Edge Devices

Reinhard Keil

Learn how Arm and our Partners enable developers and the IoT software ecosystem to deliver smart, energy efficient ML edge devices.
- July 17, 2024

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

AI’s role in next-generation voice recognition

Comprehending complexity

Voice recognition meets artificial intelligence

Building vision-enabled devices to capture the emerging wave in IoT

The power of SystemReady for custom-built OS distributions

Software, Tools, and Ecosystem for ML Edge Devices