Getting real-time CNN inference for AR on mobile

March 15, 2021

10 minute read time.

Modern augmented reality (AR) applications are based on machine learning (ML), which require running heavy computational workloads. In the previous blog, we have described our experience building such an app.

On mobile, it is a common practice to do on-device neural network (NN) inference without sending any data to a remote server. This has a few advantages:

Security and privacy - You keep the data on the device.
Low latency - You do not depend on the network (cloud).
Data quality - When you do inference on the Cloud, the data is compressed or modified to save bandwidth, and this alters the original data quality.

When it comes to processing a single image and saving it to the gallery, the process should be fast enough. However, with real-time processing, performance becomes particularly important, and developers look for any opportunity to save an extra millisecond per frame. In this blog, we explain some common approaches to getting better performance for NN inference on Mobile.

Choosing and optimizing a neural network model

The biggest way to speed up the workload is to reduce the number of computations done.

Architecture

When it comes to encoder-decoder based architectures, it is worth trying to use an existing well-optimized encoder (for example, MobileNet). MobileNet and other popular networks use clever optimizations, like separable convolutions. Using an existing pre-trained decoder can also save some time during the training process, as you will only train the decoder for a particular task.

The depth of the model is one of the parameters to experiment with. If you are building your own model or using an existing one and modifying it, it makes sense to reduce it and check if the accuracy is still acceptable. Another thing to try is making the operations themselves lighter by reducing the number of output features in convolutional layers. For example, in Keras some models like MobileNet are configurable, and arguments, such as alpha and depth_multiplier, allow you to make the model smaller.

Pruning

Once the model is designed, trained, and good accuracy is achieved, it is worth trying to prune it. That means, getting rid of insignificant layer weights (weight pruning) or connections (channel pruning). It helps to make the model both smaller and faster. You can find more info about both pruning methods here.

Measuring and comparison

To compare different models or see the difference after an optimization, it is necessary to estimate how heavy the network is.

Let us say, we have a convolutional layer with input size H × W × C (height, width, number of channels, Fig. 1a) and kernel size N × N × C × F (width, height, number of input channels, number of output features, Fig. 1b).

Then the output has a size of (H - N + 1) × (W – N + 1) as in Fig. 1c. This is also the number of convolutions because each of them belongs to a single pixel of the output image.

During each convolution, we need to loop through an N-by-N region and do a weight sum of all C channels. It must be done for each of the F output features. So, the total number of multiplications during a single convolution is N × N × C × F.

That gives us a total number of multiplications:

N_mul = (H – N + 1) × (W – N + 1) × N × N × C × F

Note that strides and paddings affects the size of the output and the number of multiplications for each convolution, so these parameters must be considered as well.

In our project, we used MOPS (millions of operations) to estimate the size of the segmentation model. After we reduced its depth, this number decreased from ~4000 to ~600, which demonstrates the impact very well. Other popular metrics are MACs (also known as MADDs) - the number of multiply-accumulates, and the number of parameters (weights, biases) in the model. For example, in Keras, model.summary() method shows how many parameters the model has.

Performance optimizations across various hardware units

Most of the frameworks for mobile NN inference (for example, Tensorflow Lite) allow you to choose the device which will be used for computations.

GPU

Sometimes, GPU is preferred, especially when working with convolutional neural networks. For more detail on making the most of Arm NN for GPU inference and the different performance booster options that are available, I would recommend reading two previous blogs about Open CL tuner and FP16 and Fastmath.

Tensors can be represented internally as textures or buffers, and the same operation (like convolution) is applied to multiple units of data independently in separate threads. ML workloads typically allow this independency and, therefore, are scalable for highly multi-threaded execution.

CPU

Other NN models show better performance when CPU is used for inference. It is worth trying both options, measuring inference time and then deciding. CPU can be a good choice when the model is small, and the actual workload is comparable to the overhead of transferring input and output data between devices. Or, in some cases, it is necessary to use a custom implementation of a certain layer. If the framework allows this only for the CPU (using C++, for example), then additional data transfer takes place (which will not necessarily affect the performance significantly, but it is something to consider).

Inference on CPU also relies on using multiple threads and can use SIMD instructions. This also aligns well with image processing (in convolution, for example, multiple weight values can be multiplied by multiple input values at once). One of the backends in the Unity Barracuda inference engine uses their Burst Compiler technology, which utilizes SIMD functionality and shows really good results. In Arm Compute Library, one of the available backends allows you to use Arm Neon technology, which is also SIMD. You can find more information about CPU inference in blogs in the ML section of Arm Community.

Concurrent execution

Another thing to consider is using different processing units to execute different workloads in parallel. For example, if two neural networks are executed each frame, you can execute CPU for one of them, and GPU for another. Alternatively, ML workloads can be delegated to the CPU, while the GPU will be busy with another kind of workload.

AR is a set of concurrent tasks (NN inference, image processing, graphics rendering), so the overall performance depends on how the entire SoC works together, rather than just the CPU or GPU. Arm’s new Total Compute approach is particularly important for AR, with many compute elements needing to come together to allow different AR uses cases and workloads to run seamlessly on devices. The CPU is driving performance in a power efficient manner. The GPU is driving the graphics. AI is being used for detection – from the user’s location to specific objects and landmarks. Then, we need to bring this IP together to work seamlessly in the system, with Arm’s interconnects, security IP and controllers adding huge value. Helping to build better systems focused on low-power constraints and high security protections.

And it is not just the performance and security elements of Total Compute that are important for AR. This new approach also designs and delivers technologies that enable developers to access high performance that can be deployed across multiple platforms to create the most exciting, engaging, and immersive applications. In addition, we are providing developers with frameworks for programming, debugging and analyzing across all our IP – CPU, GPU, and NPU.

Choosing framework

The right choice of mobile ML framework may not only make the development process easier, but also improve performance depending on your target devices.

In our mobile AR filter app project, we needed to execute three neural networks each frame. So, we decided to put some effort into integrating Arm NN into our project, to get maximum performance on Arm devices and have more control over available optimization options.

Arm NN is a low-level machine learning framework, and normally not used by app developers directly. But in our case, it was worth it. Apart from being well optimized for Arm CPUs and GPUs, Arm NN also provides options which allow developers to get even more performance and reduce memory bandwidth and footprint. For example, by using 16-bit floating numbers, or OpenCL workgroup size tuner. You can find more info about these features here.

Of course, the choice also depends on how portable the framework is and how easy it is to use and adjust your NN models for it. But if performance is the most important factor, then using a low-level option like Arm NN or CoreML may be beneficial.

Pipeline optimizations

Apart from speeding up NN inference, other parts of the pipeline may need to be optimized as well.

Pre-processing of input images is required in most cases. It can be scaling, cropping, or normalization. These operations may be costly, especially if the source image is in high resolution, as the images streamed from the mobile cameras or loaded from the gallery. One way to make it faster is to use existing well-optimized libraries, like OpenCV. But performing these operations on the GPU may be even more beneficial. It requires writing custom shaders for OpenGL ES, Vulkan or Metal, or using other tools like Android RenderScript. In our project, we used Unity shaders, which allowed us to make both pre- and post-processing a few times faster.

Fullscreen

1
2
3
4
5
6
7
oat4 frag(v2f_img i) : COLOR {
  float4 foreground = tex2D(_MainTex, i.uv);
  float4 background = tex2D(_Background, i.uv * _BackgroundOffsetScale.zw + _BackgroundOffsetScale.xy);
  float maskValue = tex2D(_Mask, float2(i.uv.x, 1.0 - i.uv.y)).r;
  float4 result = foreground * maskValue + background * (1.0 - maskValue);
  return result;
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

oat4 frag(v2f_img i) : COLOR {
  float4 foreground = tex2D(_MainTex, i.uv);
  float4 background = tex2D(_Background, i.uv * _BackgroundOffsetScale.zw + _BackgroundOffsetScale.xy);
  float maskValue = tex2D(_Mask, float2(i.uv.x, 1.0 - i.uv.y)).r;
  float4 result = foreground * maskValue + background * (1.0 - maskValue);
  return result;
}

A custom Unity shader, which blends two textures by mask and applies an offset to create a parallax effect

Training the model with input images in the format used by the phone camera (for example, YUV) can be beneficial sometimes, as it removes the need of converting inputs to RGB.

If the input image is represented by a texture (after some processing on GPU or if you are using Android SurfaceTexture) and the inference is running on GPU, then it is much better to supply it directly to the inference engine (if possible). Reading the contents of the texture on CPU before passing it as input (Fig. 2a) may be redundant and will affect performance.

It is also worth trying to improve quality without having a big impact on performance. Certain tricks may be used to hide imperfections of the NN model. For example, if human segmentation is used for a background replacement filter, the way the original frame is blended with the new background image may be important. If the edges of the person are looking good, then even a small resolution neural network may be enough - and the smaller the resolution, the better the inference time.

When you are dealing with a continuous stream of frames, there is another way to improve the quality of segmentation neural networks that produce a grayscale mask. If the inference is running on GPU, it makes sense to use three channels of the input RGBA texture for the current frame data. The fourth (alpha) channel will be used for the mask from the previous frame. The model must be trained accordingly.

Profiling

Measuring each step of the pipeline is crucial when you are trying to achieve better performance. It may not only help to find the bottleneck, but also to see how different parts of the pipeline can be rearranged and performed in parallel, to avoid CPU or GPU stalling.

Profiling the neural network itself can be also extremely useful. It shows which layers take most time, how certain optimizations like pruning have helped, and which parts of the architecture are less crucial for optimizing. The latest version of Arm Streamline allows users to see individual NN layer execution times when Arm NN is used for inference.

Arm NN timeline trace in Streamline

Figure 3 - Arm NN timeline trace in Streamline

Conclusions

AR involves executing computation-heavy workloads. Even with more powerful mobile CPUs, GPUs and NPUs, it is still a challenge to get real-time performance. However, the popularity of mobile AR and neural network models and continuous evolution of ML frameworks, allows developers to achieve the maximum performance with the minimal effort.

This blog has shown some common approaches to optimizing AR mobile performance. However, the choice of neural networks and technologies very much depends on the particular use case and target platform.

Learn more about Arm NN

0 comments
0 members are here

AI blog

Unlocking audio generation on Arm CPUs to all: Running Stable Audio Open Small with KleidiAI

Gian Marco Iodice

Real-time AI audio on Arm: Generate 10s of sound in ~7s with Stable Audio Open Small, now open-source and ready for mobile.
- May 14, 2025
Deploying PyTorch models on Arm edge devices: A step-by-step tutorial

Cornelius Maroa

As AI adoption in edge computing grows, deploying PyTorch models on ARM devices is becoming essential. This tutorial guides you through the process.
- April 22, 2025
Updates in KleidiCV: Multithreading support and OpenCV 4.11 integration

Mark Horvath

What's new with KleidiCV 0.2.0 and 0.3.0? Updates include new features and performance enhancements.
- February 25, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog