Future augmented reality (AR) devices execute workloads that create immersive experiences - a synergy between real and virtual worlds. Just like our current mobile devices, it is not enough to perform just a single task efficiently, so a collection of well-designed features will be required to make AR part of our lives.
Common questions we ask today when buying a new product for our homes are: does it fit in the space available and are the colors blending in well with the overall design? This is where AR performs well today. Some of us have already tried this kind of market-available AR apps and found them simple, yet effective. IKEA Place is one example of many.
An image showing IKEA Place in use.
Let us look further into the future and imagine a more enhanced immersive experience using AR glasses. It should feel natural, just like two human beings talking or interacting with their surroundings. To translate these human actions into AR events, the device needs to perform tasks such as: voice recognition, gesture classification, hand and body pose tracking, eye tracking, virtual to real world occlusion, object classification and segmentation. All these features, when enabled on an AR device, make the interaction appear more natural. Currently, one of the most complete AR devices on the market is the HoloLens 2.
Microsoft HoloLens 2
There are several industry challenges that must be overcome for an AR device to become mainstream. These are:
Historically, the cloud was the only place where intensive computation could take place, but innovation in mobile technology has dramatically increased the compute capacity of mobile devices. Performing computation on mobile devices has the following advantages:
However, fitting this workload onto a mobile System on Chip (SoC) efficiently is a challenge. This is because a modern device contains a multitude of compute elements (CPU, GPU, NPU, DSP), all with different performance, power consumption and programmability characteristics.
An AR use-case to render visual effects for human faces can be described in the following steps:
The AR use-case
We need to achieve a target rate of 60 FPS for this use-case. A popular approach is to pipeline the execution of tasks, which will lead to a more efficient hardware utilization. A pipelined example of the previous flow is presented in the following:
Pipeline execution of tasks per processing unit
At Arm, we are researching how the human senses (that is, touch, vision, hearing) can be digitized. Just like human perception, an AR device uses digital sensors to capture images of the environment and elements of human anatomy and recording sounds. The input data is captured by sensors, then passed to the relevant processing tasks to obtain a high-level understanding of the surrounding and of the people, as per the following image.
Digitizing human perception
In other words, data covering human senses are represented in a digital form that an AR device can understand. A few questions that AR devices need to answer are: Where are the hands located? What are the eyes looking at? Which language is the person speaking? How far are the surrounding objects? The block where this logic happens is during the processing of the sensors’ data. Examples of processing tasks for each category:
Audio Processing:
Motion Processing:
The modern approach to solving these problems is with Machine Learning (ML), a sub-class of Artificial Intelligence (AI), which can effectively deal with partial or noisy data from real-world sensors.
For some of the tasks, the input required comes from two or more sensors (that is, localization with a SLAM system requires input from color camera and inertial measurement unit). The data access and storage, handled by the system memory, plays an important role for ensuring an efficient data transfer from different processors.
To obtain a great experience on AR devices, these tasks are required to execute in real time. The objective is to maximize the number of tasks that can be executed in parallel by the compute system, so that a low latency response is obtained for the user.
Execution time is typically a good metric for measuring the performance of an algorithm. However, on an AR device, which has multiple processing units and tasks with complex dependencies, it is hard to determine if the execution is efficient just by observing total time. To make better decisions for algorithm improvement or task scheduling, it is beneficial to have a low-level view of how the tasks are executed across the mobile SoC.
Using Arm Streamline, we can quickly determine whether the performance bottleneck relates to CPU processing, GPU rendering or NPU inference. In the following example, interactive charts and comprehensive data visualizations show us how busy the CPU and GPU are with millisecond resolution.
Arm Streamline Performance Analyzer
On a system with a CPU and GPU, we usually want to make sure that the CPU and GPU work is not stalling one or the other. This can be easily identified by looking at the time spent by a processing unit while in an active and stalling state during a certain time window. The following image shows the bottleneck to be the CPU because the GPU is sometimes idle, while the CPU is continually busy.
CPU and GPU workload utilization
With the recent launch of the Arm Ethos-N78 NPU, Arm Streamline has added support for the Ethos NPU hardware counters are available together with the CPU and GPU in the same profiling session. This makes the performance triage process possible for the full system. In the example below, we have selected counters to measure the memory bandwidth for NPU and we can easily select CPU counters to be captured in the same time.
Ethos-N78 counter configuration in Arm Streamline
To demonstrate how Ethos NPUs fit seamlessly into bottleneck identification workflows, let us have a look at how CPUs and NPUs are working together when running neural network inference. The following image contains the memory usage of a neural network executed on NPU.
Streamline counters capture of MobileNet model
Using a single tool to profile a whole SoC, across the CPU, GPU and NPU allows us to see in great detail where an algorithm is running. This highlights areas to focus our attention for potential optimization efforts.
Immersive Augmented Reality requires a mobile SoC that can handle a wide variety of compute-intensive workloads, from application to graphics rendering and ML inference. The latest CPUs, GPUs, and NPUs can provide the compute needed to power future AR applications that engage the user into an immersive experience. However, making efficient use of these complex systems is a challenge.
To deploy those experiences as efficiently as possible, we need tools that provide full-system insights into how efficiently those applications are using the compute available to them. This allows AR developers to identify potential optimizations, performance bottlenecks and opportunities to explore with new scheduling algorithms. The results are smoother and smarter experiences that consume less power.
Streamline is Arm’s flagship tool for full-system performance analysis, available with support for the latest Arm technology as part of Arm Development Studio.
Learn more about Arm Streamline
Learn more about Ethos NPUs