Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Graphics and Gaming
    • High Performance Computing
    • Innovation
    • Multimedia
    • Open Source Software and Platforms
    • Physical
    • Processors
    • Security
    • System
    • Software Tools
    • TrustZone for Armv8-M
    • 中文社区
  • Blog
    • Announcements
    • Artificial Intelligence
    • Automotive
    • Healthcare
    • HPC
    • Infrastructure
    • Innovation
    • Internet of Things
    • Machine Learning
    • Mobile
    • Smart Homes
    • Wearables
  • Forums
    • All developer forums
    • IP Product forums
    • Tool & Software forums
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Processors
  • Developer Community
  • IP Products
  • Processors
  • Jump...
  • Cancel
Processors
Machine Learning IP blog Profiling Arm NN Machine Learning applications running on Linux with Streamline
  • Blogs
  • Leaderboard
  • Forums
  • Videos & Files
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
  • New
More blogs in Processors
  • DesignStart blog

  • Machine Learning IP blog

  • Processors blog

  • TrustZone for Armv8-M blog

Tags
  • optimization
  • Profiling
  • Machine Learning (ML)
  • Software Developers
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Profiling Arm NN Machine Learning applications running on Linux with Streamline

Florent Lebeau
Florent Lebeau
January 20, 2021

The Arm Streamline performance analyzer provides system performance metrics, software tracing, and statistical profiling to help engineers get the most performance from hardware and find important bottlenecks in software.

Streamline v7.4 has added support for the Arm NN machine learning inference engine for CPUs, GPUs, and NPUs. Arm NN bridges the gap between existing frameworks and the underlying hardware. Streamline can automatically profile an Arm NN application and provide useful performance information about inference runtime.

The example discussed here is a Linux application running ML inference on Arm. We have previously trained a neural network on the MNIST data set to recognize handwritten digits. Using Arm NN and Streamline, we want to run understand the efficiency of our model and how to optimize it further.

Enable Arm NN profiling in an application

To enable Streamline profiling in an application the Arm NN must be compiled with the Streamline profiling flag. This is done using a define:

-DPROFILING_BACKEND_STREAMLINE=1

Additionally, for an application to use the profiling it must enable the profiling options, typically done in the application source code. Enable Arm NN profiling options in the application:

options.m_ProfilingOptions.m_EnableProfiling = true;
options.m_ProfilingOptions.m_TimelineEnabled = true;

Let us see how to enable the profiling and capture a Streamline trace of an application running on N1 SDP and display the ML inference information.

Setup an ML application to profile

First, we are going to build Gator, Arm Streamline’s profiling agent, the dependency libraries and Arm NN with the define option specified previously. It can be done easily with one of the scripts from the Arm tools solutions repository on github. Install the required packages and run the script to build Arm NN in ${HOME}/armnn-devenv.

$ sudo apt install cmake unzip curl scons
$ cd ${HOME}
$ git clone https://github.com/ARM-software/Tool-Solutions.git
$ cd Tool-Solutions/ml-tool-examples/build-armnn/
$ bash ./build-armnn.sh

To install the ML application, go to the following repository and compile the example:

$ cd ${HOME}/Tool-Solutions/ml-tool-examples/mnist-demo/
$ ARMNN_LIB=/home/ubuntu/armnn-devenv/armnn/build/ \
  ARMNN_INC=/home/ubuntu/armnn-devenv/armnn/include \
  make

We have already added the options to enable Arm NN profiling. We are now ready to profile the application with Gator and connect Streamline:

$ ~/gatord --app ./mnist_tf_convol 0 80

To visualize the Arm NN events in the timeline, the Arm NN_Runtime events need to be enabled from the counter configuration menu for Gator to collect them. Note that the application needs to be run once before to see these events in the menu, as Arm NN needs to count and collect ML-related counters to pass them off to Gator. In addition, we have also enabled instructions PMU to understand SIMD performance.

Arm NN events in the counter configuration menu

Analyze Arm NN activity with Streamline

When opening the results in Streamline, extra information appear in the Timeline tab:

  • The Arm NN_Runtime chart displays backend registration (the backend is an abstraction that maps the layers of a network graph to the hardware that is responsible for executing those layers: Arm Cortex-A, Mali-GPU, or Arm ML processors), network loads and inferences. This chart helps monitoring inferences, using different models running on many backends. In our example, we are running a single inference on the Neoverse N1 CPU;

Arm NN runtime chart

  • In the details panel of the Timeline view Arm NN timeline details

    The “Arm NN timeline” can be selected to display information about the NN pipeline.

Arm NN timeline

The “Arm NN timeline” displays the different layers of the neural network – in a similar way to what Tensorboard can display. In our case, we have trained a convolutional neural network (CNN), which is very efficient in recognizing and classifying images for computer vision.

Tensorboard ML pipeline

CNNs have two main parts:

  • A convolution and pooling mechanism that divide the input images into features.
  • A fully connected layer that processes the output of the convolution and pooling mechanism and attributes a description label.

The Arm NN timeline allows to understand when the transformations for the different layers are executed, how they are connected and what backend is used. For example, you can see that the convolution “conv2D” takes a significant amount on time on the CpuRef backend. Clicking on the layer in the timeline helps visualize where it fits in the CNN layer diagram thanks to the parent/child connections.

Arm NN parent-child connections

To conclude with the performance analysis, the profile indicates that the application does not take advantage of all the CPU resources:

  • the CPU activity is 25% on average – the inference is only run on 1 N1 core when the N1SDP has 4 cores;
  • “Advanced SIMD” instructions are very low, despite the large amount of instructions issued when performing convolutions. The application is not taking advantage of vectorization.

Maximize Arm NN efficiency

In our example, switching the first argument of the application from 0 to 1 enables a much faster version of the application:

$ ~/gatord --app ./mnist_tf_convol 1 80

This argument sets the optimization mode: 0 uses the portable CpuRef mode and 1 uses the CpuAcc mode which will work on Neoverse, Cortex-A and some Cortex-R processors.

When opening the result in Streamline, we can naturally corelate the PMU metrics with the different steps of the inference.

Arm NN example profile with CpuAcc

We can see that the target CpuAcc backend enables Advanced SIMD instructions as well as multithreading to take advantage of all the compute resources on the Neoverse N1 CPU. This result in an important speedup, particularly for the convolution operation.

Summary

This covers the steps to profile and optimize a machine learning application running inference on Linux with Arm NN. To know more about how Streamline can help your development of ML and Arm NN-based applications:

Try Streamline Performance Analyzer

Anonymous
Machine Learning IP blog
  • Building a mobile AR filter app in Unity

    Pavel Rudko
    Pavel Rudko
    This blog describes the process for building an augmented reality (AR) filter mobile app in Unity.
    • February 11, 2021
  • Making the most of Arm NN for GPU inference: FP16 and FastMath

    Roberto Lopez Mendez
    Roberto Lopez Mendez
    This blog demonstrates how to significantly reduce memory usage and achieve a substantial inference speed-up through enabling new Arm NN features.
    • January 26, 2021
  • Profiling Arm NN Machine Learning applications running on Linux with Streamline

    Florent Lebeau
    Florent Lebeau
    This blog article introduces how to profile and optimize machine learning applications running with the Arm NN inference engine.
    • January 20, 2021