Profiling Arm NN Machine Learning applications running on Linux with Streamline

January 20, 2021

4 minute read time.

The Arm Streamline performance analyzer provides system performance metrics, software tracing, and statistical profiling to help engineers get the most performance from hardware and find important bottlenecks in software.

Streamline v7.4 has added support for the Arm NN machine learning inference engine for CPUs, GPUs, and NPUs. Arm NN bridges the gap between existing frameworks and the underlying hardware. Streamline can automatically profile an Arm NN application and provide useful performance information about inference runtime.

The example discussed here is a Linux application running ML inference on Arm. We have previously trained a neural network on the MNIST data set to recognize handwritten digits. Using Arm NN and Streamline, we want to run understand the efficiency of our model and how to optimize it further.

Enable Arm NN profiling in an application

To enable Streamline profiling in an application the Arm NN must be compiled with the Streamline profiling flag. This is done using a define:

-DPROFILING_BACKEND_STREAMLINE=1

Additionally, for an application to use the profiling it must enable the profiling options, typically done in the application source code. Enable Arm NN profiling options in the application:

options.m_ProfilingOptions.m_EnableProfiling = true;
options.m_ProfilingOptions.m_TimelineEnabled = true;

Let us see how to enable the profiling and capture a Streamline trace of an application running on N1 SDP and display the ML inference information.

Setup an ML application to profile

First, we are going to build Gator, Arm Streamline’s profiling agent, the dependency libraries and Arm NN with the define option specified previously. It can be done easily with one of the scripts from the Arm tools solutions repository on github. Install the required packages and run the script to build Arm NN in ${HOME}/armnn-devenv.

$ sudo apt install cmake unzip curl scons
$ cd ${HOME}
$ git clone https://github.com/ARM-software/Tool-Solutions.git
$ cd Tool-Solutions/ml-tool-examples/build-armnn/
$ bash ./build-armnn.sh

To install the ML application, go to the following repository and compile the example:

$ cd ${HOME}/Tool-Solutions/ml-tool-examples/mnist-demo/
$ ARMNN_LIB=/home/ubuntu/armnn-devenv/armnn/build/ \
  ARMNN_INC=/home/ubuntu/armnn-devenv/armnn/include \
  make

We have already added the options to enable Arm NN profiling. We are now ready to profile the application with Gator and connect Streamline:

$ ~/gatord --app ./mnist_tf_convol 0 80

To visualize the Arm NN events in the timeline, the Arm NN_Runtime events need to be enabled from the counter configuration menu for Gator to collect them. Note that the application needs to be run once before to see these events in the menu, as Arm NN needs to count and collect ML-related counters to pass them off to Gator. In addition, we have also enabled instructions PMU to understand SIMD performance.

Arm NN events in the counter configuration menu

Analyze Arm NN activity with Streamline

When opening the results in Streamline, extra information appear in the Timeline tab:

The Arm NN_Runtime chart displays backend registration (the backend is an abstraction that maps the layers of a network graph to the hardware that is responsible for executing those layers: Arm Cortex-A, Mali-GPU, or Arm ML processors), network loads and inferences. This chart helps monitoring inferences, using different models running on many backends. In our example, we are running a single inference on the Neoverse N1 CPU;

Arm NN runtime chart

In the details panel of the Timeline view
The “Arm NN timeline” can be selected to display information about the NN pipeline.

Arm NN timeline

The “Arm NN timeline” displays the different layers of the neural network – in a similar way to what Tensorboard can display. In our case, we have trained a convolutional neural network (CNN), which is very efficient in recognizing and classifying images for computer vision.

Tensorboard ML pipeline

CNNs have two main parts:

A convolution and pooling mechanism that divide the input images into features.
A fully connected layer that processes the output of the convolution and pooling mechanism and attributes a description label.

The Arm NN timeline allows to understand when the transformations for the different layers are executed, how they are connected and what backend is used. For example, you can see that the convolution “conv2D” takes a significant amount on time on the CpuRef backend. Clicking on the layer in the timeline helps visualize where it fits in the CNN layer diagram thanks to the parent/child connections.

Arm NN parent-child connections

To conclude with the performance analysis, the profile indicates that the application does not take advantage of all the CPU resources:

the CPU activity is 25% on average – the inference is only run on 1 N1 core when the N1SDP has 4 cores;
“Advanced SIMD” instructions are very low, despite the large amount of instructions issued when performing convolutions. The application is not taking advantage of vectorization.

Maximize Arm NN efficiency

In our example, switching the first argument of the application from 0 to 1 enables a much faster version of the application:

$ ~/gatord --app ./mnist_tf_convol 1 80

This argument sets the optimization mode: 0 uses the portable CpuRef mode and 1 uses the CpuAcc mode which will work on Neoverse, Cortex-A and some Cortex-R processors.

When opening the result in Streamline, we can naturally corelate the PMU metrics with the different steps of the inference.

Arm NN example profile with CpuAcc

We can see that the target CpuAcc backend enables Advanced SIMD instructions as well as multithreading to take advantage of all the compute resources on the Neoverse N1 CPU. This result in an important speedup, particularly for the convolution operation.

Summary

This covers the steps to profile and optimize a machine learning application running inference on Linux with Arm NN. To know more about how Streamline can help your development of ML and Arm NN-based applications:

Try Streamline Performance Analyzer

AI blog

Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

Ashok Bhat

As part of the new PyTorch 2.9 release, Arm contributed key enhancements to ensure seamless performance and stability on Arm platforms. Learn more about the enhancements in this blog post.
- October 15, 2025
Are you attending PyTorch Conference 2025?

Michelle Yung

Join us on site at the PyTorch Conference 2025 on October 22-23 to learn how Arm empowers developers to build and deploy AI applications easily using PyTorch and ExecuTorch.
- October 15, 2025
Unlocking AI Potential with Kleidi: Seamless Acceleration Workshop Recap

Parichay Das

Explore takeaways from our Kleidi AI workshop led by Arm Ambassador Parichay Das, where participants tackled performance gaps and future AI needs.
- September 25, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Profiling Arm NN Machine Learning applications running on Linux with Streamline

Enable Arm NN profiling in an application

Setup an ML application to profile

Analyze Arm NN activity with Streamline

Maximize Arm NN efficiency

Summary

Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

Are you attending PyTorch Conference 2025?

Unlocking AI Potential with Kleidi: Seamless Acceleration Workshop Recap