Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Mobile blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Innovation
Innovation
Innovation blog Maximizing the System Efficiency of Augmented Reality Devices
  • Blog
  • Videos & Files
  • Innovation events
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
  • New
Innovation requires membership for participation - click to join
More blogs in Innovation
  • Innovation blog

Tags
  • Augmented Reality (AR)
  • arm streamline
  • performance analysis
  • Machine Learning (ML)
  • Streamline Performance Analyzer
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Maximizing the System Efficiency of Augmented Reality Devices

Chris Szabo
Chris Szabo
June 1, 2020

Future augmented reality (AR) devices execute workloads that create immersive experiences - a synergy between real and virtual worlds. Just like our current mobile devices, it is not enough to perform just a single task efficiently, so a collection of well-designed features will be required to make AR part of our lives.

Common questions we ask today when buying a new product for our homes are: does it fit in the space available and are the colors blending in well with the overall design? This is where AR performs well today. Some of us have already tried this kind of market-available AR apps and found them simple, yet effective. IKEA Place is one example of many.

IKEA Place Android app.

An image showing IKEA Place in use.

Let us look further into the future and imagine a more enhanced immersive experience using AR glasses. It should feel natural, just like two human beings talking or interacting with their surroundings. To translate these human actions into AR events, the device needs to perform tasks such as: voice recognition, gesture classification, hand and body pose tracking, eye tracking, virtual to real world occlusion, object classification and segmentation. All these features, when enabled on an AR device, make the interaction appear more natural. Currently, one of the most complete AR devices on the market is the HoloLens 2.

 Microsoft HoloLens 2

Microsoft HoloLens 2

AR Device System

There are several industry challenges that must be overcome for an AR device to become mainstream. These are:

  • A lightweight form factor size for daily comfortable wearing
  • Display with high color contrast to obtain realistic 3D holograms
  • Energy efficiency that allows hours of continuous usage

Historically, the cloud was the only place where intensive computation could take place, but innovation in mobile technology has dramatically increased the compute capacity of mobile devices. Performing computation on mobile devices has the following advantages:

  • Data protection, with sensor data not leaving the device
  • Network bandwidth savings, with no sensor data transfer required.
  • Lower latency response, with no network activity included
  • High-quality data, with no lossy compression required

However, fitting this workload onto a mobile System on Chip (SoC) efficiently is a challenge. This is because a modern device contains a multitude of compute elements (CPU, GPU, NPU, DSP), all with different performance, power consumption and programmability characteristics.

An AR use-case to render visual effects for human faces can be described in the following steps:

  1. Obtain color image of environment (Camera sensor).
  2. Face detection (CPU processor)
  3. 3D Face Mesh prediction (NPU processor)
  4. Render visual effects on mesh geometry (GPU processor).
  5. Send the final image to be displayed (DPU processor).

AR emoji render use-case

The AR use-case

We need to achieve a target rate of 60 FPS for this use-case. A popular approach is to pipeline the execution of tasks, which will lead to a more efficient hardware utilization. A pipelined example of the previous flow is presented in the following:

A pipeline task execution on multiple processing units.

Pipeline execution of tasks per processing unit

Digitize Human Perception

At Arm, we are researching how the human senses (that is, touch, vision, hearing) can be digitized. Just like human perception, an AR device uses digital sensors to capture images of the environment and elements of human anatomy and recording sounds. The input data is captured by sensors, then passed to the relevant processing tasks to obtain a high-level understanding of the surrounding and of the people, as per the following image.

A human perception intelligent machine.

Digitizing human perception

In other words, data covering human senses are represented in a digital form that an AR device can understand. A few questions that AR devices need to answer are: Where are the hands located? What are the eyes looking at? Which language is the person speaking? How far are the surrounding objects? The block where this logic happens is during the processing of the sensors’ data. Examples of processing tasks for each category:

  • Depth Prediction and Completion, Object Detection, Hand Tracking and Gesture Recognition, Person Detection, Eye Tracking, Face Detection, Text Detection, Image Segmentation

Audio Processing:

  • Speech Recognition, Voice Commands, Language Detection, Voice Translation, Sound Classification, Audio Fingerprinting, Voice Separation, Speech Transcription

Motion Processing:

  • Simultaneous Localization and Mapping, Hand Gesture, Body and Hand 3D Pose, Face Mesh, Body Skeleton, Eye Direction, Bounding Box Orientation, Object Tracking and Detection

The modern approach to solving these problems is with Machine Learning (ML), a sub-class of Artificial Intelligence (AI), which can effectively deal with partial or noisy data from real-world sensors.

For some of the tasks, the input required comes from two or more sensors (that is, localization with a SLAM system requires input from color camera and inertial measurement unit). The data access and storage, handled by the system memory, plays an important role for ensuring an efficient data transfer from different processors.

To obtain a great experience on AR devices, these tasks are required to execute in real time. The objective is to maximize the number of tasks that can be executed in parallel by the compute system, so that a low latency response is obtained for the user.

AR Device Bottlenecks

Execution time is typically a good metric for measuring the performance of an algorithm. However, on an AR device, which has multiple processing units and tasks with complex dependencies, it is hard to determine if the execution is efficient just by observing total time. To make better decisions for algorithm improvement or task scheduling, it is beneficial to have a low-level view of how the tasks are executed across the mobile SoC.

Using Arm Streamline, we can quickly determine whether the performance bottleneck relates to CPU processing, GPU rendering or NPU inference. In the following example, interactive charts and comprehensive data visualizations show us how busy the CPU and GPU are with millisecond resolution.

Arm Streamline counter visualization and interactive charts

Arm Streamline Performance Analyzer

On a system with a CPU and GPU, we usually want to make sure that the CPU and GPU work is not stalling one or the other. This can be easily identified by looking at the time spent by a processing unit while in an active and stalling state during a certain time window. The following image shows the bottleneck to be the CPU because the GPU is sometimes idle, while the CPU is continually busy.

Arm Streamline bottleneck identification between cpu and gpu

CPU and GPU workload utilization

With the recent launch of the Arm Ethos-N78 NPU, Arm Streamline has added support for the Ethos NPU hardware counters are available together with the CPU and GPU in the same profiling session. This makes the performance triage process possible for the full system. In the example below, we have selected counters to measure the memory bandwidth for NPU and we can easily select CPU counters to be captured in the same time.

 Arm Streamline counter configuration.

Ethos-N78 counter configuration in Arm Streamline

To demonstrate how Ethos NPUs fit seamlessly into bottleneck identification workflows, let us have a look at how CPUs and NPUs are working together when running neural network inference. The following image contains the memory usage of a neural network executed on NPU.

Arm Streamline profile cpu and ethos-n 78

Streamline counters capture of MobileNet model

Using a single tool to profile a whole SoC, across the CPU, GPU and NPU allows us to see in great detail where an algorithm is running. This highlights areas to focus our attention for potential optimization efforts.

Conclusion

Immersive Augmented Reality requires a mobile SoC that can handle a wide variety of compute-intensive workloads, from application to graphics rendering and ML inference. The latest CPUs, GPUs, and NPUs can provide the compute needed to power future AR applications that engage the user into an immersive experience. However, making efficient use of these complex systems is a challenge.

To deploy those experiences as efficiently as possible, we need tools that provide full-system insights into how efficiently those applications are using the compute available to them. This allows AR developers to identify potential optimizations, performance bottlenecks and opportunities to explore with new scheduling algorithms. The results are smoother and smarter experiences that consume less power.

Streamline is Arm’s flagship tool for full-system performance analysis, available with support for the latest Arm technology as part of Arm Development Studio.

Learn more about Arm Streamline

Learn more about Ethos NPUs

Anonymous
Innovation blog
  • Innovation Coffee: benchmarking & service migration with Liz Fong-Jones from Honeycomb

    Robert Wolff
    Robert Wolff
    Entrepreneur, developer advocate, labor and ethics organizer, Liz Fong-Jones joined us for this episode of Innovation Coffee. We talked about benchmarking and profiling, Honeycomb on AWS, OpenTelemetry
    • April 1, 2022
  • Innovation Coffee - Learn about Nix and NixOS

    Robert Wolff
    Robert Wolff
    In this episode of Arm's Innovation Coffee, Robert Wolff met with Matthew Croughan who runs his own software consultancy firm called Nix.how, and is a DevOps engineer at Platonic Systems.
    • March 22, 2022
  • Innovation Coffee - MWC Breakdown

    Robert Wolff
    Robert Wolff
    Missed MWC 2022? Don't worry! Watch this episode to learn about the latest news, best demos, and the most exciting updates from Mobile World Congress 2022
    • March 14, 2022