Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
AI and ML blog Using Arm Streamline for profiling ML workloads
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • arm streamline
  • Machine Learning (ML)
  • Arm NN
  • Arm Mobile Studio
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Using Arm Streamline for profiling ML workloads

Pavel Rudko
Pavel Rudko
November 18, 2021
4 minute read time.

When working with heavy computational workloads, such as ML inference, optimization and profiling are essential. It requires the correct approach and also the right tools.

This tutorial demonstrates how to use Arm Streamline Performance Analyzer to profile an ML-based Android application.

Streamline

Streamline is a tool which allows you to profile programs running on Arm-based mobile devices. It provides both CPU and GPU counters. GPU counters are especially useful for GPU ML inference or combined ML and graphics applications. The most important metrics we can estimate using Streamline are:

  • CPU usage (including activity for each core);
  • GPU usage (including utilization for fragment and non-fragment queues); and
  • GPU memory bandwidth

Streamline is included in Arm Mobile Studio, which can be downloaded from our developer portal.

Practical example

In one of our previous blogs, we have described an AR filter project. Let’s take a look at how Streamline helped us to analyze the app performance.

In Streamline, we can see the list of the devices connected through adb.

Arm Streamline image 1

After the device is selected, we can see the Android packages available for profiling.

Arm Streamline image 2

The package must be ‘debuggable’ to be profiled with Streamline.

Note for Unity developers: to make the application package ‘debuggable’, you should enable “Development Build” option in the Build Settings.

Arm Streamline image 3

Finally, we can press “Configure Counters” to select the counters we want to see in the capture and start the capture itself (the application will be launched automatically).

Arm Streamline image 4

In our case, the application uses GPU ML inference, so we are interested in GPU counters. The easiest way to configure them is to select an existing template (in this example it is the Arm Mali G78 GPU):

Arm Streamline image 5

Once the capture is finished, we can select a region using callipers and see the data specific to this region. On the image below, the selected area corresponds to a single frame in our AR application.

Arm Streamline image 6

The values for each counter (such as CPU Activity, Mali GPU Usage, Mali Memory Bandwidth) are calculated for the range we have selected. These are displayed on the left from it.

In the AR Filter app, for each frame we execute 3 neural network models:

  • Background Segmentation
  • Face Detection
  • Face Landmarks Recognition

Note how periods of high non-fragment queue activity correspond to these 3 networks.

Most of the ML frameworks use compute shaders or OpenCL kernels for GPU inference. In the case of Mali, this kind of GPU workload is scheduled to the non-fragment queue. In combined graphics and ML applications, this is how we can distinguish inference from graphics rendering (which relies both on fragment and non-fragment queues). Graphics workloads will also have fragment shader activity straight after the vertex (non-fragment) stage.

Using Streamline with ArmNN

In our AR project, we used ArmNN for neural network inference. It provides good performance on Arm-based mobile devices, but we can also benefit from using ArmNN and Streamline together. If ArmNN runtime is configured for profiling, we can see on the timeline each neural network execution and even individual layers.

ArmNN must be built with -DPROFILING_BACKEND_STREAMLINE=1 flag to add support for this functionality. Also, you need to enable it in the code when initializing ArmNN runtime:

armnn::IRuntime::CreationOptions options;
options.m_ProfilingOptions.m_EnableProfiling = true;
options.m_ProfilingOptions.m_TimelineEnabled = true;
auto runtime = armnn::IRuntime::Create(options);

The capture is configured and recorded as usual. Then we are able to select “Arm NN timeline”:

Arm Streamline image 7

In the following capture, you can see how 3 models are executed one-by-one in each frame. The selected range represents the first model (Background Segmentation). Note how it matches high non-fragment queue usage on Mali GPU.

Arm Streamline image 8

We can expand each of the models and look at individual layer executions and links between them.

Arm Streamline image 9

As you can see, the first model is the real bottleneck in our pipeline. We optimize the model and reduce the number of filters in the first decoder layer from 512 to 128.

The overall model execution time decreased from 37ms to 20ms:

Arm Streamline image 10

And the duration of the layer called “DepthwiseConv2D:0:17” decreased from 71 microseconds to 51 microseconds:

Arm Streamline image 11

Streamline annotations

Another useful feature in streamline is annotations. You can mark specific parts of your code and see them in the Streamline capture. For example, we can mark the start and end of each neural network execution:

ANNOTATE_COLOR(ANNOTATE_PURPLE, "NN Inference");
runtime->EnqueueWorkload(networkId, inputTensors, outputTensors);
ANNOTATE_END();

If we expand the main thread entry in the Heat Map, we will be able to see each execution on the timeline and select the corresponding range using callipers:

Arm Streamline image 12

Summary

We have covered the process of configuring Streamline and getting a profiling capture of an Android application.

Profiling ML-based applications using Streamline allows you to:

  • Find bottlenecks in ML workloads
  • See if there are any gaps between neural network executions that can be reduced
  • Check if CPUs or GPUs are being utilized efficiently
  • See how much a certain model optimization has helped to reduce inference time or memory bandwidth.

This will help you to find ways to optimize your application and get better performance.

Further reading

Profiling Arm NN Machine Learning applications running on Linux with Streamline

Anonymous
AI and ML blog
  • Analyzing Machine Learning models on a layer-by-layer basis

    George Gekov
    George Gekov
    In this blog, we demonstrate how to analyze a Machine Learning model on a layer-by-layer basis.
    • October 31, 2022
  • How audio development platforms can take advantage of accelerated ML processing

    Mary Bennion
    Mary Bennion
    Join DSP Concepts and Alif Semiconductor at Arm DevSummit 2022 to discuss ML techniques commonly used for audio. Discover the features and benefits of the Audio Weaver platform.
    • October 24, 2022
  • How to Deploy PaddlePaddle on Arm Cortex-M with Arm Virtual Hardware

    Liliya Wu
    Liliya Wu
    This blog introduces how to deploy a PP-OCRv3 English text recognition model on Arm Cortex-M55 processor with Arm Virtual Hardware.
    • August 31, 2022