Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
AI blog Profiling Arm NN Machine Learning applications running on Linux with Streamline
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • optimization
  • Profiling
  • Machine Learning (ML)
  • Software Developers
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Profiling Arm NN Machine Learning applications running on Linux with Streamline

Florent Lebeau
Florent Lebeau
January 20, 2021
4 minute read time.

The Arm Streamline performance analyzer provides system performance metrics, software tracing, and statistical profiling to help engineers get the most performance from hardware and find important bottlenecks in software.

Streamline v7.4 has added support for the Arm NN machine learning inference engine for CPUs, GPUs, and NPUs. Arm NN bridges the gap between existing frameworks and the underlying hardware. Streamline can automatically profile an Arm NN application and provide useful performance information about inference runtime.

The example discussed here is a Linux application running ML inference on Arm. We have previously trained a neural network on the MNIST data set to recognize handwritten digits. Using Arm NN and Streamline, we want to run understand the efficiency of our model and how to optimize it further.

Enable Arm NN profiling in an application

To enable Streamline profiling in an application the Arm NN must be compiled with the Streamline profiling flag. This is done using a define:

-DPROFILING_BACKEND_STREAMLINE=1

Additionally, for an application to use the profiling it must enable the profiling options, typically done in the application source code. Enable Arm NN profiling options in the application:

options.m_ProfilingOptions.m_EnableProfiling = true;
options.m_ProfilingOptions.m_TimelineEnabled = true;

Let us see how to enable the profiling and capture a Streamline trace of an application running on N1 SDP and display the ML inference information.

Setup an ML application to profile

First, we are going to build Gator, Arm Streamline’s profiling agent, the dependency libraries and Arm NN with the define option specified previously. It can be done easily with one of the scripts from the Arm tools solutions repository on github. Install the required packages and run the script to build Arm NN in ${HOME}/armnn-devenv.

$ sudo apt install cmake unzip curl scons
$ cd ${HOME}
$ git clone https://github.com/ARM-software/Tool-Solutions.git
$ cd Tool-Solutions/ml-tool-examples/build-armnn/
$ bash ./build-armnn.sh

To install the ML application, go to the following repository and compile the example:

$ cd ${HOME}/Tool-Solutions/ml-tool-examples/mnist-demo/
$ ARMNN_LIB=/home/ubuntu/armnn-devenv/armnn/build/ \
  ARMNN_INC=/home/ubuntu/armnn-devenv/armnn/include \
  make

We have already added the options to enable Arm NN profiling. We are now ready to profile the application with Gator and connect Streamline:

$ ~/gatord --app ./mnist_tf_convol 0 80

To visualize the Arm NN events in the timeline, the Arm NN_Runtime events need to be enabled from the counter configuration menu for Gator to collect them. Note that the application needs to be run once before to see these events in the menu, as Arm NN needs to count and collect ML-related counters to pass them off to Gator. In addition, we have also enabled instructions PMU to understand SIMD performance.

Arm NN events in the counter configuration menu

Analyze Arm NN activity with Streamline

When opening the results in Streamline, extra information appear in the Timeline tab:

  • The Arm NN_Runtime chart displays backend registration (the backend is an abstraction that maps the layers of a network graph to the hardware that is responsible for executing those layers: Arm Cortex-A, Mali-GPU, or Arm ML processors), network loads and inferences. This chart helps monitoring inferences, using different models running on many backends. In our example, we are running a single inference on the Neoverse N1 CPU;

Arm NN runtime chart

  • In the details panel of the Timeline view Arm NN timeline details

    The “Arm NN timeline” can be selected to display information about the NN pipeline.

Arm NN timeline

The “Arm NN timeline” displays the different layers of the neural network – in a similar way to what Tensorboard can display. In our case, we have trained a convolutional neural network (CNN), which is very efficient in recognizing and classifying images for computer vision.

Tensorboard ML pipeline

CNNs have two main parts:

  • A convolution and pooling mechanism that divide the input images into features.
  • A fully connected layer that processes the output of the convolution and pooling mechanism and attributes a description label.

The Arm NN timeline allows to understand when the transformations for the different layers are executed, how they are connected and what backend is used. For example, you can see that the convolution “conv2D” takes a significant amount on time on the CpuRef backend. Clicking on the layer in the timeline helps visualize where it fits in the CNN layer diagram thanks to the parent/child connections.

Arm NN parent-child connections

To conclude with the performance analysis, the profile indicates that the application does not take advantage of all the CPU resources:

  • the CPU activity is 25% on average – the inference is only run on 1 N1 core when the N1SDP has 4 cores;
  • “Advanced SIMD” instructions are very low, despite the large amount of instructions issued when performing convolutions. The application is not taking advantage of vectorization.

Maximize Arm NN efficiency

In our example, switching the first argument of the application from 0 to 1 enables a much faster version of the application:

$ ~/gatord --app ./mnist_tf_convol 1 80

This argument sets the optimization mode: 0 uses the portable CpuRef mode and 1 uses the CpuAcc mode which will work on Neoverse, Cortex-A and some Cortex-R processors.

When opening the result in Streamline, we can naturally corelate the PMU metrics with the different steps of the inference.

Arm NN example profile with CpuAcc

We can see that the target CpuAcc backend enables Advanced SIMD instructions as well as multithreading to take advantage of all the compute resources on the Neoverse N1 CPU. This result in an important speedup, particularly for the convolution operation.

Summary

This covers the steps to profile and optimize a machine learning application running inference on Linux with Arm NN. To know more about how Streamline can help your development of ML and Arm NN-based applications:

Try Streamline Performance Analyzer

Anonymous
AI blog
  • Coaching AI coding agents: A guide for senior engineers

    Alex Spinelli
    Alex Spinelli
    Learn how senior engineers can coach AI coding agents to design, debug, and deliver high-quality code in immersive dev environments.
    • June 30, 2025
  • Optimize Llama.cpp with Arm I8MM instruction

    Yibo Cai
    Yibo Cai
    Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.
    • June 27, 2025
  • Build AI responsibly with the Yellow Teaming methodology and LLM assistant

    Zach Lasiuk
    Zach Lasiuk
    Yellow Teaming helps developers build responsible AI by aligning products with long-term value, not just short-term success.
    • June 6, 2025