Augmented reality (AR) is a technology where virtual content is overlaid on top of the real world. If you’ve ever played with Snapchat “Lenses” on your mobile phone, to transform a selfie with a cartoon monocle or dog ears, you will probably know what I’m talking about - and will have enjoyed the entertaining results!
In this case, AR is a fun playful element in a social media app. However, AR also has broad practical uses. For instance, AR systems have been used to provide assistance to disabled shoppers, and to boost assembly line workers' efficiency while reducing physical intensity, and, as Figure 1 shows, it can also have more serious uses in medical applications. In this blog, I’m going to take a look at the technological challenges posed by AR on mobile devices and briefly describe a paper from Arm Research Machine Learning Group that seeks to address these challenges.
Figure 1: Augmented reality can also have more serious medical applications
Computer vision (CV) is the fundamental underlying technology behind AR. In the Snapchat example I mentioned, computer vision techniques are used to recognize facial features in the video stream, so that the virtual content can be overlaid (composited) appropriately. For the effect to be realistic, the object detection needs to be very accurate, or else the overlays might not appear in the correct location on the image, which renders the effect unrealistic. It is also necessary to not only detect that objects, but also to track them in the video as they move about in the frame in real-time.
The best approach we currently know of to perform accurate object detection and tracking is to use convolutional neural networks (CNNs). This is probably not a surprise to anyone who’s been following the rise of Machine Learning (ML) in recent years. The well-known “YOLO” CNN architecture is a good example of this new approach, which can perform object detection in images in real-time. But unfortunately, CNNs require a huge amount of compute power and lots of memory, and are therefore power hungry. As a consequence, real-time object detection and tracking is very difficult on mobile/IoT devices, because they are battery powered and therefore have a limited power budget to spend on compute. In fact, Figure 2 shows that the more accurate CNN approaches to object detection, such as Faster R-CNN, YOLO and SSD, are not feasible in the power budget of a mobile device. Older “hand-crafted” approaches fit within the power budget, but do not achieve sufficient accuracy to be useful.
Figure 2: For object detection in images, there is a trade-off between the achieved accuracy and the number of compute operations required per second. Given a 1W power budget of a mobile device, only approaches with lower accuracy are practical.
At Arm’s Machine Learning Research Lab in Boston, MA, we’ve been working on reducing the power consumption of CNNs in mobile SoCs. Recent work in collaboration with Prof. Yuhao Zhu at the University of Rochester, has resulted in a number of promising advances. Arm Research are enthusiastic about sharing advances with the eco-system where possible, so as a result of this work, we will be presenting a paper entitled “Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision” at the International Symposium on Computer Architecture (ISCA) this summer. The paper shows how we can perform real-time object detection and tracking within the power budget of mobile devices, which is the key enabling technology for future mobile AR apps!
I often find that the most exciting breakthrough moments in a research project come after thinking very broadly about the problem at hand. In this case, instead of focusing on trying to make CNNs themselves more power efficient, we started to examine the whole imaging pipeline involved in an end-to-end AR application. This involves a large number of components and processing, from the camera sensor itself, to the image signal processor (ISP), the CNN accelerator, and then on to the CPU, as shown in Figure 3. In particular, the algorithms in the ISP have become increasingly sophisticated in recent years, and although they don’t actually perform object detection per-se, they do calculate pixel-level motion between video frames (see the simple example in Figure 4). This meta-data is used by the ISP to implement advanced de-noising techniques.
Intuitively, it seems logical that pixel-level motion information would be useful after the ISP stage, where we implement computer vision. For object detection and tracking, a stationary or nearly stationary video frame will exhibit the same object region-of-interest (ROI) co-ordinates as the previous frame. And, in fact, if there is some motion between frames, as long as the frame-rate is sufficiently high, we can easily extrapolate the object ROIs instead of running an expensive CNN inference. These observations can lead to very significant savings in battery power, because the motion information is essentially free, as it is previously calculated in the ISP.
Figure 3: Typical imaging pipeline in a mobile device.
Figure 4: Simple illustration of motion vectors in an image of a person walking.
I explained previously that it’s feasible to reduce the power consumption of real-time object detection and tracking by exploiting motion information generated in the ISP. But how exactly can this be implemented in a practical mobile SoC? Well, this is where my favorite computer architecture concept comes in: co-design. This is a somewhat fashionable term in the computer architecture field, but to me, the notion of hardware/software co-design is absolutely nothing more than a reminder that hardware design should always be driven by the needs of the actual workloads and algorithms we want to execute.
The basic algorithmic approach proposed in our ISCA’18 paper, is to exploit motion as illustrated in Figure 5. In essence, the ROIs associated with an object are initially generated by (power-hungry) CNN inference on the input image, referred to as an I-frame. However, for the following frame, we switch to a cheaper approach of calculating a simple affine transform of the ROIs based on motion information, an E-frame. You can think of this as a prediction of where an object has moved, based on how the pixels have moved between the previous frame and the current one. An I-frame can typically be extrapolated using motion data over a number of frames, without significantly degrading the accuracy of the ROI positions. The number of E-frames estimated before another “ground-truth” I-frame is called the extrapolation window (EW).
The performance of the motion-extrapolation approach is excellent, even though we are using much less compute power. Figure 6 shows a couple of results from the ISCA’18 paper. In terms of average accuracy, a motion-extrapolation over 2 or 4 frames (EW-2/EW-4) shows very similar performance to the state-of-the-art YOLOv2 CNN. And, in terms of operations/frame and memory bandwidth/frame, motion-extrapolation shows a huge benefit in efficiency. More concretely, by incorporating motion information, we are able to double the achieved video frame rate at a 45% energy saving, with a negligible accuracy loss of only 0.58% for object detection tasks.
Finally, the co-design bit comes in. We need to make some changes to the SoC architecture to expose the motion meta-data which is computed by the ISP, and ordinarily discarded after it is used internally. Instead, we expose this motion information at the system level and introduce a new component, the motion controller, which reads the motion data via the frame buffer in DRAM. The motion controller calculates the extrapolation of ROIs for E-frames, and sequences the CNN accelerator during I-frames. Figure 7 illustrates all of this, but for the gory details, I encourage you to check out the ISCA’18 paper, which presents results from a detailed simulation of this SoC architecture.
Download the full paper
Figure 5: The object Region of Interest (ROI) is initially detected using CNN inference during an I-frame. The ROI in the E-frames is extrapolated from the previous frame using motion information.
Figure 6: Motion-extrapolation with moderate extrapolation window sizes (EW-2 to EW-4) compares very favorably with state-of-the-art CNNs such as YOLOv2, in terms of accuracy (left) at a fraction of the compute and memory requirements (right).
Figure 7: Block diagram of the augmented vision system in a mobile SoC, with motion meta-data shared via the frame buffer in DRAM.
Dedicated hardware accelerators for CNNs are all the rage right now, as they allow us to significantly reduce the power consumption compared to executing the same CNN computation on a general-purpose CPU or GPU on the SoC. In this work, I needed to be able to accurately model the power consumption, latency, circuit area and memory bandwidth of running different CNN models on a hardware accelerator. To meet this requirement, we wrote our own CNN accelerator simulator called Scale-Sim, in collaboration with PhD student Ananda Samajdar (Georgia Tech). Scale-Sim allows anyone to easily generate metrics for any CNN model in the hugely popular Tensorflow framework. The architectural parameters of the accelerator can be tuned to suit a wide-range of applications from datacenter down to IoT devices. Arm has recently open-sourced Scale-Sim on Github, so I encourage you to give it a go!
Download SCALE-Sim on Github
Moving forwards, low-power neural networks have taken center stage in mobile computing systems and are rapidly becoming a cornerstone of a vast number of application domains, across product segments including IoT, automotive and datacenter. The Arm Research ML Group is currently expanding to meet the broad ML needs of the Arm ecosystem. If you are a motivated machine learning or computer architecture researcher, and interested in working on solving real problems then please get in touch. For more details, please visit our career website.