Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Graphics and Gaming
    • High Performance Computing
    • Innovation
    • Multimedia
    • Open Source Software and Platforms
    • Physical
    • Processors
    • Security
    • System
    • Software Tools
    • TrustZone for Armv8-M
    • 中文社区
  • Blog
    • Announcements
    • Artificial Intelligence
    • Automotive
    • Healthcare
    • HPC
    • Infrastructure
    • Innovation
    • Internet of Things
    • Machine Learning
    • Mobile
    • Smart Homes
    • Wearables
  • Forums
    • All developer forums
    • IP Product forums
    • Tool & Software forums
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Research
  • Developer Community
  • Jump...
Arm Research
Research Articles Adapting Models to the Real World: On-Device Training for Edge Model Adaptation
  • Research Articles
  • Leaderboard
  • Resources
  • Arm Research Events
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
Arm Research requires membership for participation - click to join
More blogs in Arm Research
  • Research Articles

Tags
  • Arm Research
  • Neural Network
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Adapting Models to the Real World: On-Device Training for Edge Model Adaptation

Mark O'Connor
Mark O'Connor
July 15, 2020

From driver-assistance features keeping us safe on the road to asking our phones to set a reminder, neural networks are powering an increasing proportion of the computer interaction in our lives.

One drawback of all these neural network models, however, is that, typically, they are ‘one size fits all’. A neural network model learns to minimize error over all the data. But for many use cases, the only kind of error that matters to me as a user is the error on my data. If you are part of the majority you might never notice a problem, but neural networks will frequently perform worse for minorities, resulting in a particularly insidious form of technological inequality.

We have recently begun investigating ways to solve this problem. Fundamentally, the capacity of a neural network model capacity is a finite resource. We try to train models to equally represent data from a wide range of potential users:

A diagram representing train models to equally represent data from a wide range of potential users.

In practice, however, many deployments will mostly be used by one user (or with one microphone, or in one location):

A diagram showing that many deployments will mostly be used by one user.

No dataset can be perfectly fair or balanced and minorities will always be under-represented. However, when I talk to my phone, I want it to recognize my accent. If it does so at the expense of being less accurate on a strong Australian accent, that’s fine with me. The same is true of many kinds of image, audio and video tasks.

What if there were a way to adapt a model to devote more of its capacity to minimizing the error on the examples it actually sees in real-world use? Could it learn to be more accurate?

Adapting a model to devote more of its capacity to minimizing the error on the examples it actually sees in real-world use.

This might result in a different version of the model for each user:

Different models present for each user when adapted.

Alternatively, instead of increasing the accuracy, could we achieve higher model compression if we knew more about the real-world distribution of inputs?

Instead of increasing the accuracy, we could achieve higher model compression if we knew more about the real-world distribution of inputs.

There are many ways to attempt to solve this problem. We recently completed some research into one approach based on edge distillation.

Learning without Labels

The biggest challenge to learning on the edge is that nobody wants to sit down and provide a written transcript of every command they give their phone, or sort through their entire photo album, tagging every family by hand. In most situations, there are no “correct” labels available to learn from.

We looked at a technique that side-steps this issue by using an on-device teacher. The principle here is to deploy not one, but two neural network models onto the device:

  1. The runtime model, which is highly optimized for low-latency inference.
  2. A teacher model, which is larger and more accurate but much too slow to run in real-time.

Because the teacher model is less capacity-constrained than the runtime model, it can better capture data from all users and not just the majority.

During normal use, the device uses the runtime model to give real-time feedback to the user and saves samples of its inputs locally.

During downtime (for example, while charging or overnight), the device uses the teacher model to generate more accurate predictions for these sampled inputs. It then trains the real-time model to match the teacher’s predictions. All the data remains on-device, ensuring privacy and eliminating the need for an active internet connection.

How Well Does This Work?

We evaluated this using keyword recognition on the Google Speech Commands dataset. Three baseline neural networks were evaluated to investigate how robust different architectures are to this approach. In each case, the teacher was a much larger and more accurate recurrent neural network model that is unsuitable for real-time use for this low-power, always-on application.

A diagram showing the relative error reduction in  using on-device adaption.Across all speakers, using on-device adaptation reduced error rates by an average of 15%. For some speakers this was far higher, suggesting some users benefit from this much more than others.

Adaptive Learning Enabled by Total Compute

Crucially, very little data was used for each speaker – as little as 20 samples in total (two from each class) and these benefits were attained after ~100 training steps.

This means that for an always-on application such as speech recognition, the real-time model can run on an Arm Cortex-M CPU or Ethos NPU. The training can happen on an attached Arm Cortex-A CPU making use of its floating-point units to perform power-efficient training and optimization.

The seamless interaction and cooperation between Arm IP required for this is directly enabled by Arm’s Total Compute approach. This maximizes utilization of accelerator IP while providing the on-device training and optimization capabilities required to continually improve that solution.

There are many other approaches to on-device model adaptation and we are following up on several promising leads, so expect to hear more from us on this topic soon.

Whichever approach turns out to be the best, on-device learning is here to stay – and Arm Total Compute provides the flexibility and performance to implement that in whatever form it may take.

Learn more about Total Compute and Arm Research ML Lab. Please do reach out to me if you have any questions.

Contact Mark O'Connor 

Anonymous
Research Articles
  • Using multiple labels improves neural network learning

    Axel Berg
    Axel Berg
    A single label is not enough. Label diversity can be introduced by creating several labels for each training example in a way that the ordinal structure allows.
    • February 22, 2021
  • Arizona State University: On the same frequency

    Andrea Kells
    Andrea Kells
    As part of DARPA’s Domain-Specific System on Chip program, Arizona State University assembled a team of academics and industry partners to pioneer a new approach to software-enabled radio frequency.
    • February 16, 2021
  • M0N0: A flashback

    Benoit Labbe
    Benoit Labbe
    Designing any power-converter is a matter of compromises. For M0N0, we focused on efficiency at an extremely low output power - but how?
    • February 11, 2021