From driver-assistance features keeping us safe on the road to asking our phones to set a reminder, neural networks are powering an increasing proportion of the computer interaction in our lives.
One drawback of all these neural network models, however, is that, typically, they are ‘one size fits all’. A neural network model learns to minimize error over all the data. But for many use cases, the only kind of error that matters to me as a user is the error on my data. If you are part of the majority you might never notice a problem, but neural networks will frequently perform worse for minorities, resulting in a particularly insidious form of technological inequality.
We have recently begun investigating ways to solve this problem. Fundamentally, the capacity of a neural network model capacity is a finite resource. We try to train models to equally represent data from a wide range of potential users:
In practice, however, many deployments will mostly be used by one user (or with one microphone, or in one location):
No dataset can be perfectly fair or balanced and minorities will always be under-represented. However, when I talk to my phone, I want it to recognize my accent. If it does so at the expense of being less accurate on a strong Australian accent, that’s fine with me. The same is true of many kinds of image, audio and video tasks.
What if there were a way to adapt a model to devote more of its capacity to minimizing the error on the examples it actually sees in real-world use? Could it learn to be more accurate?
This might result in a different version of the model for each user:
Alternatively, instead of increasing the accuracy, could we achieve higher model compression if we knew more about the real-world distribution of inputs?
There are many ways to attempt to solve this problem. We recently completed some research into one approach based on edge distillation.
The biggest challenge to learning on the edge is that nobody wants to sit down and provide a written transcript of every command they give their phone, or sort through their entire photo album, tagging every family by hand. In most situations, there are no “correct” labels available to learn from.
We looked at a technique that side-steps this issue by using an on-device teacher. The principle here is to deploy not one, but two neural network models onto the device:
Because the teacher model is less capacity-constrained than the runtime model, it can better capture data from all users and not just the majority.
During normal use, the device uses the runtime model to give real-time feedback to the user and saves samples of its inputs locally.
During downtime (for example, while charging or overnight), the device uses the teacher model to generate more accurate predictions for these sampled inputs. It then trains the real-time model to match the teacher’s predictions. All the data remains on-device, ensuring privacy and eliminating the need for an active internet connection.
We evaluated this using keyword recognition on the Google Speech Commands dataset. Three baseline neural networks were evaluated to investigate how robust different architectures are to this approach. In each case, the teacher was a much larger and more accurate recurrent neural network model that is unsuitable for real-time use for this low-power, always-on application.
Across all speakers, using on-device adaptation reduced error rates by an average of 15%. For some speakers this was far higher, suggesting some users benefit from this much more than others.
Crucially, very little data was used for each speaker – as little as 20 samples in total (two from each class) and these benefits were attained after ~100 training steps.
This means that for an always-on application such as speech recognition, the real-time model can run on an Arm Cortex-M CPU or Ethos NPU. The training can happen on an attached Arm Cortex-A CPU making use of its floating-point units to perform power-efficient training and optimization.
The seamless interaction and cooperation between Arm IP required for this is directly enabled by Arm’s Total Compute approach. This maximizes utilization of accelerator IP while providing the on-device training and optimization capabilities required to continually improve that solution.
There are many other approaches to on-device model adaptation and we are following up on several promising leads, so expect to hear more from us on this topic soon.
Whichever approach turns out to be the best, on-device learning is here to stay – and Arm Total Compute provides the flexibility and performance to implement that in whatever form it may take.
Learn more about Total Compute and Arm Research ML Lab. Please do reach out to me if you have any questions.
Contact Mark O'Connor