Using Arm Streamline for profiling ML workloads

November 18, 2021

4 minute read time.

When working with heavy computational workloads, such as ML inference, optimization and profiling are essential. It requires the correct approach and also the right tools.

This tutorial demonstrates how to use Arm Streamline Performance Analyzer to profile an ML-based Android application.

Streamline

Streamline is a tool which allows you to profile programs running on Arm-based mobile devices. It provides both CPU and GPU counters. GPU counters are especially useful for GPU ML inference or combined ML and graphics applications. The most important metrics we can estimate using Streamline are:

CPU usage (including activity for each core);
GPU usage (including utilization for fragment and non-fragment queues); and
GPU memory bandwidth

Streamline is included in Arm Mobile Studio, which can be downloaded from our developer portal.

Practical example

In one of our previous blogs, we have described an AR filter project. Let’s take a look at how Streamline helped us to analyze the app performance.

In Streamline, we can see the list of the devices connected through adb.

Arm Streamline image 1

After the device is selected, we can see the Android packages available for profiling.

Arm Streamline image 2

The package must be ‘debuggable’ to be profiled with Streamline.

Note for Unity developers: to make the application package ‘debuggable’, you should enable “Development Build” option in the Build Settings.

Arm Streamline image 3

Finally, we can press “Configure Counters” to select the counters we want to see in the capture and start the capture itself (the application will be launched automatically).

Arm Streamline image 4

In our case, the application uses GPU ML inference, so we are interested in GPU counters. The easiest way to configure them is to select an existing template (in this example it is the Arm Mali G78 GPU):

Arm Streamline image 5

Once the capture is finished, we can select a region using callipers and see the data specific to this region. On the image below, the selected area corresponds to a single frame in our AR application.

Arm Streamline image 6

The values for each counter (such as CPU Activity, Mali GPU Usage, Mali Memory Bandwidth) are calculated for the range we have selected. These are displayed on the left from it.

In the AR Filter app, for each frame we execute 3 neural network models:

Background Segmentation
Face Detection
Face Landmarks Recognition

Note how periods of high non-fragment queue activity correspond to these 3 networks.

Most of the ML frameworks use compute shaders or OpenCL kernels for GPU inference. In the case of Mali, this kind of GPU workload is scheduled to the non-fragment queue. In combined graphics and ML applications, this is how we can distinguish inference from graphics rendering (which relies both on fragment and non-fragment queues). Graphics workloads will also have fragment shader activity straight after the vertex (non-fragment) stage.

Using Streamline with ArmNN

In our AR project, we used ArmNN for neural network inference. It provides good performance on Arm-based mobile devices, but we can also benefit from using ArmNN and Streamline together. If ArmNN runtime is configured for profiling, we can see on the timeline each neural network execution and even individual layers.

ArmNN must be built with -DPROFILING_BACKEND_STREAMLINE=1 flag to add support for this functionality. Also, you need to enable it in the code when initializing ArmNN runtime:

armnn::IRuntime::CreationOptions options;
options.m_ProfilingOptions.m_EnableProfiling = true;
options.m_ProfilingOptions.m_TimelineEnabled = true;
auto runtime = armnn::IRuntime::Create(options);

The capture is configured and recorded as usual. Then we are able to select “Arm NN timeline”:

Arm Streamline image 7

In the following capture, you can see how 3 models are executed one-by-one in each frame. The selected range represents the first model (Background Segmentation). Note how it matches high non-fragment queue usage on Mali GPU.

Arm Streamline image 8

We can expand each of the models and look at individual layer executions and links between them.

Arm Streamline image 9

As you can see, the first model is the real bottleneck in our pipeline. We optimize the model and reduce the number of filters in the first decoder layer from 512 to 128.

The overall model execution time decreased from 37ms to 20ms:

Arm Streamline image 10

And the duration of the layer called “DepthwiseConv2D:0:17” decreased from 71 microseconds to 51 microseconds:

Arm Streamline image 11

Streamline annotations

Another useful feature in streamline is annotations. You can mark specific parts of your code and see them in the Streamline capture. For example, we can mark the start and end of each neural network execution:

ANNOTATE_COLOR(ANNOTATE_PURPLE, "NN Inference");
runtime->EnqueueWorkload(networkId, inputTensors, outputTensors);
ANNOTATE_END();

If we expand the main thread entry in the Heat Map, we will be able to see each execution on the timeline and select the corresponding range using callipers:

Arm Streamline image 12

Summary

We have covered the process of configuring Streamline and getting a profiling capture of an Android application.

Profiling ML-based applications using Streamline allows you to:

Find bottlenecks in ML workloads
See if there are any gaps between neural network executions that can be reduced
Check if CPUs or GPUs are being utilized efficiently
See how much a certain model optimization has helped to reduce inference time or memory bandwidth.

This will help you to find ways to optimize your application and get better performance.

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Using Arm Streamline for profiling ML workloads

Streamline

Practical example

Using Streamline with ArmNN

Streamline annotations

Summary

Further reading

Coaching AI coding agents: A guide for senior engineers

Optimize Llama.cpp with Arm I8MM instruction

Build AI responsibly with the Yellow Teaming methodology and LLM assistant