Adapting Models to the Real World: On-Device Training for Edge Model Adaptation

July 15, 2020

4 minute read time.

From driver-assistance features keeping us safe on the road to asking our phones to set a reminder, neural networks are powering an increasing proportion of the computer interaction in our lives.

One drawback of all these neural network models, however, is that, typically, they are ‘one size fits all’. A neural network model learns to minimize error over all the data. But for many use cases, the only kind of error that matters to me as a user is the error on my data. If you are part of the majority you might never notice a problem, but neural networks will frequently perform worse for minorities, resulting in a particularly insidious form of technological inequality.

We have recently begun investigating ways to solve this problem. Fundamentally, the capacity of a neural network model capacity is a finite resource. We try to train models to equally represent data from a wide range of potential users:

A diagram representing train models to equally represent data from a wide range of potential users.

In practice, however, many deployments will mostly be used by one user (or with one microphone, or in one location):

A diagram showing that many deployments will mostly be used by one user.

No dataset can be perfectly fair or balanced and minorities will always be under-represented. However, when I talk to my phone, I want it to recognize my accent. If it does so at the expense of being less accurate on a strong Australian accent, that’s fine with me. The same is true of many kinds of image, audio and video tasks.

What if there were a way to adapt a model to devote more of its capacity to minimizing the error on the examples it actually sees in real-world use? Could it learn to be more accurate?

Adapting a model to devote more of its capacity to minimizing the error on the examples it actually sees in real-world use.

This might result in a different version of the model for each user:

Different models present for each user when adapted.

Alternatively, instead of increasing the accuracy, could we achieve higher model compression if we knew more about the real-world distribution of inputs?

Instead of increasing the accuracy, we could achieve higher model compression if we knew more about the real-world distribution of inputs.

There are many ways to attempt to solve this problem. We recently completed some research into one approach based on edge distillation.

Learning without Labels

The biggest challenge to learning on the edge is that nobody wants to sit down and provide a written transcript of every command they give their phone, or sort through their entire photo album, tagging every family by hand. In most situations, there are no “correct” labels available to learn from.

We looked at a technique that side-steps this issue by using an on-device teacher. The principle here is to deploy not one, but two neural network models onto the device:

The runtime model, which is highly optimized for low-latency inference.
A teacher model, which is larger and more accurate but much too slow to run in real-time.

Because the teacher model is less capacity-constrained than the runtime model, it can better capture data from all users and not just the majority.

During normal use, the device uses the runtime model to give real-time feedback to the user and saves samples of its inputs locally.

During downtime (for example, while charging or overnight), the device uses the teacher model to generate more accurate predictions for these sampled inputs. It then trains the real-time model to match the teacher’s predictions. All the data remains on-device, ensuring privacy and eliminating the need for an active internet connection.

How Well Does This Work?

We evaluated this using keyword recognition on the Google Speech Commands dataset. Three baseline neural networks were evaluated to investigate how robust different architectures are to this approach. In each case, the teacher was a much larger and more accurate recurrent neural network model that is unsuitable for real-time use for this low-power, always-on application.

A diagram showing the relative error reduction in using on-device adaption. Across all speakers, using on-device adaptation reduced error rates by an average of 15%. For some speakers this was far higher, suggesting some users benefit from this much more than others.

Adaptive Learning Enabled by Total Compute

Crucially, very little data was used for each speaker – as little as 20 samples in total (two from each class) and these benefits were attained after ~100 training steps.

This means that for an always-on application such as speech recognition, the real-time model can run on an Arm Cortex-M CPU or Ethos NPU. The training can happen on an attached Arm Cortex-A CPU making use of its floating-point units to perform power-efficient training and optimization.

The seamless interaction and cooperation between Arm IP required for this is directly enabled by Arm’s Total Compute approach. This maximizes utilization of accelerator IP while providing the on-device training and optimization capabilities required to continually improve that solution.

There are many other approaches to on-device model adaptation and we are following up on several promising leads, so expect to hear more from us on this topic soon.

Whichever approach turns out to be the best, on-device learning is here to stay – and Arm Total Compute provides the flexibility and performance to implement that in whatever form it may take.

Learn more about Total Compute and Arm Research ML Lab. Please do reach out to me if you have any questions.

Contact Mark O'Connor

Research Articles

Workshop for HOL4 users

Kurt Lorenzen

Tue 25th - Wed 26th June 2024 About A workshop to bring together developers/users of the HOL4 interactive theorem prover. The hope is to: Understand the landscape of current HOL4 usage by hosting presentations…
- April 17, 2024
‘More-Than-Moore’: Japanese Semiconductor Research

Becky Ellis

Work to push the limits of semiconductor miniaturization is thriving in academia, using Arm Academic Access IP at Kyoto University.
- April 11, 2024
Accelerating at the Edge

Andrew Pickard

Telecom Paris have developed an accelerator for deep neural networks to run on Edge devices, so that data no longer needs to be processed in the cloud.
- April 6, 2024

Research Articles

Adapting Models to the Real World: On-Device Training for Edge Model Adaptation

Learning without Labels

How Well Does This Work?

Adaptive Learning Enabled by Total Compute

Workshop for HOL4 users

‘More-Than-Moore’: Japanese Semiconductor Research

Accelerating at the Edge