Efficient Hardware for Mobile Computer Vision via Transfer Learning

April 1, 2019

4 minute read time.

Mobile computing is on the rise, and currently moving into some really exciting new applications and form factors – augmented reality (AR) glasses, unmanned aerial vehicles (UAVs), automated driver assistance systems (ADAS) in automobiles, and more.

See Arm Community articles exploring different applications of mobile computing, such as:

  Enabling Augmented Reality Mobile Apps through Low Power Machine Learning

  Inside Microsoft's Hololens 2

  Not just droning on! The rise of Kinibi-M

  Advances in ADAS – Getting Closer to the Self-Driving Car

One interesting trait that these applications tend to share in common is a ‘real-time’ performance requirement. The term ‘real-time’ means that the computing hardware needs to guarantee that it responds within a certain specified time period. For example, in the case of AR glasses, the vision system needs to meet a minimum frame-rate in order to provide a convincing experience as the user moves their head around. Or, in the case of ADAS applications in the automotive industry, the real-time latency must be extremely low in order to ensure that any changes in the environment, such as another car overtaking, are quickly conveyed to the system. To make matters worse, on top of the real-time performance constraint, the majority of these platforms are also heavily energy constrained. For example, the power budget for the real-time vision system in AR glasses could be as low as 1W.

One of the biggest challenges we face in meeting these real-time throughput and energy constraints is computer vision (CV) algorithms. In recent years, CV workloads have become heavily reliant on machine learning algorithms such as neural networks (NNs) and have become prevalent in emerging mobile computing applications. In fact, NNs have become such an important workload that Arm have introduced the ML Processor, a dedicated hardware processor to accelerate and reduce the power consumption of NN workloads specifically. For more details on the Arm ML Processor, I’d highly recommend Ian Bratt’s excellent talk at the Hot Chips conference.

Specialization and Transfer Learning

Figure 1: Visualization of the low-level features typically learnt by CNNs trained on natural images. Reproduced from Yosinski et al., 2014

At the Arm ML Research Lab, we are focused on enabling NN workloads on constrained hardware platforms, including real-time and low-energy systems. One option to improve the efficiency of the hardware is to design fixed-function hardware that performs inference on a single network for a single application. There are severe limitations to this approach. Although it drastically increases efficiency, in doing this, we lose flexibility, and the hardware is unlikely to be useful on new application datasets in the future. This tension between efficiency and flexibility is a common theme in computer architecture.

In grappling with the challenges of hardware specialization, we recently took inspiration from the machine learning community, by means of the concept of transfer learning – an interesting property of NNs. Transfer learning shows that it is possible to reuse the early layers of a network trained on task A for a different network trained on task B. There are some limitations on this, such as that task A and task B must be from a similar problem domain, for example both being image classification problems. Even with this caveat, transfer learning is a powerful concept. A simple interpretation of this is that the front layers of vision NNs are very similar. For example, Figure 1 shows a visualization of the filters learnt by the early layers of a convolutional neural network (CNN). These features are extremely common to CNNs trained with natural images. If we circle back to the specialization discussion earlier, I hope it becomes clear that there is an opportunity to specialize the hardware that processes these early layers, without losing the flexibility to tackle new datasets.

The FixyNN Architecture

FixyNN

Figure 2: A simplified FixyNN concept

One of our technical focusses lately has been around jointly co-designing the NN model architecture and the hardware architecture. Traditionally, one team will design the NN architecture, and another team will design the hardware. Our finding was that considering the design of both together at the system level delivered some interesting results. FixyNN is an example of what is possible with the co-design approach.

Let’s dive in and take a look at Figure 2, which shows the simplified FixyNN concept. The CNN is split into two pieces, a fixed front-end feature extractor which is shared by all tasks, and a programmable back-end section which is trained specifically for each task. In this arrangement, the hardware used to implement the common front-end layers can be heavily optimized. The weights are fixed in hardware and no longer need to be loaded from main memory. The result of all this is that the shared front-end becomes very fast, whilst remaining low energy!

Please do check out our paper for more details. My co-authors are Chuteng Zhou, Patrick Hansen, Shreyas Kolala Venkataramanaiah, Jae-sun Seo and Matthew Mattina. The findings we report show that the FixyNN architecture can achieve nearly 2× greater energy efficiency compared to a conventional programmable CNN accelerator of the same silicon area. On top of this, we demonstrate that flexibility is not sacrificed – we were able to train a suite of six datasets via transfer learning with an accuracy loss of < 1%. If you’re interested in exploring this in more detail, we’ve also open sourced our tools for automatically generating hardware for fixed neural networks, which is called DeepFreeze.

Read the Paper

Access DeepFreeze Tools

I’ll be presenting more details of FixyNN at the SysML conference this week. SysML is a new conference providing a venue for systems research in the area of machine learning, and I’m really excited to see what’s going on in the field!

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

Efficient Hardware for Mobile Computer Vision via Transfer Learning

Specialization and Transfer Learning

The FixyNN Architecture

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond