We live in a world where voice-based products have significantly changed the way we communicate with technology. It is predicted that smart assistant devices in the home will grow by 1.6 billion units in 2022 in the US alone. Soon, you will be able to ask your robot to clean the house while asking your voice assistant to book your favourite restaurant for dinner!
Many of these voice-based devices today process the captured data in the cloud. Is this a necessity or is it a temporary solution while we understand how to better leverage the compute power in the devices itself?
Alessandro Grande, Ecosystem Manager at Arm recently sat down with our new Innovator and Chief Technology Officer (CTO) at Snips, Joseph Dureau, to learn about Snips intelligent voice assistant technology and how they’re pioneering the future of voice-activated devices that process data on the device.
Connect with Joseph
It will probably come as no surprise that we believe voice interfaces will have a role to play in almost any industry or organization because of the unique advantages — they’re intuitive, hands and eyes-free, and accessible. With these characteristics comes a lot of potentials. Today, existing voice assistants that process our voices have a fundamental shortcoming: they are heavily centralized, by doing most of the computing in the cloud, raising critical concerns regarding privacy, security, bandwidth, and of course, dependence on cloud connectivity.
At Snips, we believe in an end-to-end, private-by-design solution. We run our entire Voice Platform on the device — instead of collecting user data and processing it in the cloud. In addition, our voice assistant technology is white label, so that our clients can be in complete control of their brand and experience.
Our journey started 18 months ago; back then, we didn’t think we could go so far in terms of vocabulary size on standard IoT hardware, but the team kept pushing the limits. We are now doing better than Google Cloud Speech API with one of the most complex use cases, on a Raspberry Pi.
There are two main ingredients to our secret sauce:
Our developer community plays a fundamental role in the development of our solution. Snips’ engineers work to constantly improve our product, incorporating the feedback we get from the community of 30,000 developers who design with Snips. Whenever a developer builds a new voice app or integrates other technology with Snips, they provide us with other proof-points that voice assistants can reliably run on the device, at a fraction of the computing power of the cloud. These new efforts also help to build a bigger set of examples and use cases for our applications, most of which are shared openly on the web.
In addition, thanks to our community, many prospects hear about Snips through word-of-mouth. They directly start experimenting with our solution, benefiting from the help and support from our vibrant community. When they contact us, they are already familiar with the Snips solution, they know it fits their needs, and directly want to start talking business. Our developer community is also a key enabler on the business front.
One of the best community projects I’ve seen using Snips is Project Alice by Laurent Chervet, our resident “supermaker.” Laurent earned this title, which we bestowed upon him, for his unrelenting commitment to building with Snips and being a source of inspiration, troubleshooting, and support for other developers that are getting started with our platform. Laurent’s house is now completely controlled by voice, and he is still shipping one new feature per day, always pushing the boundaries of what Snips can do.
Our Spoken Language Understanding engine can run on a wide variety of hardware. Our tiniest solution runs on the Arm Cortex-M4 processors at 100MHz. It’s the lightest MCU platform we’ve been able to integrate on. It is quite prevalent in the small IoT space, along with the Cortex-M7 processor. Our solution, called Snips Commands, is able to identify a wake word and understand voice commands like “play”, “pause”, or “heavy wash.”
On an application processor, our Snips Flow solution can understand queries expressed in natural languages, like “it’s dark in here”, “give me a recipe for pasta and zucchini,” or “throw me some Aretha Franklin on the radio.” The minimal requirement for Snips Flow is a dual-core chip at 1.2GHz. For large vocabulary use cases, we typically require a quad-core Cortex-A53 processor, which corresponds to a Raspberry Pi 3, or an NXP i.MX8.
On this kind of hardware, we can achieve cloud-level performance, even on large vocabulary use cases, while keeping all the processing on the device. Last fall, we published a benchmark comparing Snips Flow running on a Raspberry Pi 3 to major cloud Speech APIs, for a music use case. The data revealed that Snips can achieve cloud-level accuracy on the device.
The way we support our activity and growth is by selling our software to any device manufacturer who wants to add a voice interface into their product. Nevertheless, this business model doesn’t prevent us from having an ambitious open source and publication strategy. We believe that open source is, and will remain, a key element of our success.
We started by fully releasing our Natural Language Understanding (NLU) engine under an Apache 2.0 license, as an open source, a private-by-design alternative to Dialogflow, Amazon Lex, and other NLU cloud services. This handful of providers made cloud-based NLU a commodity, powering all chatbots and voice assistants on the planet. Meanwhile, our solution runs on-device or on-premise with same or superior performances, minimal footprint, and all while being faster than a round-trip to the cloud. We made the decision to open source this technology both to capture a segment of this immense industry and to reduce the dependency of AI assistants worldwide on these centralized providers. It turns out this decision was a success: Snips’ NLU library quickly trended on GitHub, and we regularly got approached by companies who wanted our support to add a voice interface to their chatbot.
More recently, we open sourced Tract, our embedded neural network inference library. This is a low-level component of our platform, that powers our wake word detector, and soon our user identification engine. While there are existing open source alternatives out there, we see the field of embedded neural network inferences a growing field, yet still massively fragmented. We wanted to be open about how we approached the problem, about the specific issues related to voice that we had to face and be proactive in driving the community towards better solutions.
Beyond strict open source, making our technology freely available for anyone who wants to tinker with it is critical for us. It is a way to make our technology visible, a way to get feedback and to reach as many people as possible.
In the past year, we’ve tried to make it even easier for developers to experiment with Snips, launching two developer kits that contain all the necessary components to deploy a voice assistant, including a Raspberry Pi. The most recent kit is available for purchase in partnership with Seeed Studio and contains an extension satellite kit running on a Raspberry Pi Zero that we think will empower everyone to experiment with our technology all over their homes.
I love learning from people that are more knowledgeable than me. I have great memories from the times I spent in academia, at the Nasa Jet Propulsion Laboratory working on climate modelling. After that at the London School of Economics studying statistics, and since then working on artificial intelligence with my amazing co-workers at Snips.
What drives me today is putting my knowledge to use in building products that everyone uses on a daily basis, in the fascinating space of artificial intelligence, while pushing the boundaries of what’s possible.
Flying kites makes me happy, cargo boats fascinate me, and I always fall for a good Paris-Brest!
This is a very vibrant and dynamic landscape. Artificial intelligence is changing the way we interact with our surroundings through voice. The next generation of voice interfaces will process data locally because it’s the best way to build a trusted, transparent and intimate relationship between humans and their devices. Interfaces will also be distributed and multimodal; some devices will have the ability to sense, some to understand, some others to act and provide feedback. They will need to learn to work together. At Snips, we deeply believe that voice interfaces will be personal. It’s only by being aware of context, identity, and a user’s past activity that interactions will become fluid and natural. This personal aspect will also need to be pervasive, so our environment seamlessly adapts to us wherever we go.
On the other end of the spectrum, there is a lot of work that is pushing the limits of what can be done on a microcontroller. Our next generation of products for microcontrollers will be a miniaturized version of what we run on application processors, while still understanding natural language. I’m excited to see the progress we’ll make over the coming years.
To keep up-to-date with Innovator-based projects, and the ways you can benefit from their work, sign up to the Innovator Program newsletter.
To learn about how Arm and its partners, including Snips, are enabling a step-change increase in on-device processing capability, read our white paper, How to migrate intelligence from the cloud to embedded devices at the edge, by clicking the link below.
Download white paper
Arm recently announced Arm Helium technology, an architectural extension that will provide even more possibility for future voice and sound devices. Helium will deliver up to 15x performance uplift for machine learning applications running on future Cortex-M processors, expanding the potential for Snips to pioneer. Learn more about this new technology and what it means for the industry and voice-based devices.