The Arm Streamline performance analyzer provides system performance metrics, software tracing, and statistical profiling to help engineers get the most performance from hardware and find important bottlenecks in software.
Streamline v7.4 has added support for the Arm NN machine learning inference engine for CPUs, GPUs, and NPUs. Arm NN bridges the gap between existing frameworks and the underlying hardware. Streamline can automatically profile an Arm NN application and provide useful performance information about inference runtime.
The example discussed here is a Linux application running ML inference on Arm. We have previously trained a neural network on the MNIST data set to recognize handwritten digits. Using Arm NN and Streamline, we want to run understand the efficiency of our model and how to optimize it further.
To enable Streamline profiling in an application the Arm NN must be compiled with the Streamline profiling flag. This is done using a define:
-DPROFILING_BACKEND_STREAMLINE=1
Additionally, for an application to use the profiling it must enable the profiling options, typically done in the application source code. Enable Arm NN profiling options in the application:
options.m_ProfilingOptions.m_EnableProfiling = true; options.m_ProfilingOptions.m_TimelineEnabled = true;
Let us see how to enable the profiling and capture a Streamline trace of an application running on N1 SDP and display the ML inference information.
First, we are going to build Gator, Arm Streamline’s profiling agent, the dependency libraries and Arm NN with the define option specified previously. It can be done easily with one of the scripts from the Arm tools solutions repository on github. Install the required packages and run the script to build Arm NN in ${HOME}/armnn-devenv.
$ sudo apt install cmake unzip curl scons $ cd ${HOME} $ git clone https://github.com/ARM-software/Tool-Solutions.git $ cd Tool-Solutions/ml-tool-examples/build-armnn/ $ bash ./build-armnn.sh
To install the ML application, go to the following repository and compile the example:
$ cd ${HOME}/Tool-Solutions/ml-tool-examples/mnist-demo/ $ ARMNN_LIB=/home/ubuntu/armnn-devenv/armnn/build/ \ ARMNN_INC=/home/ubuntu/armnn-devenv/armnn/include \ make
We have already added the options to enable Arm NN profiling. We are now ready to profile the application with Gator and connect Streamline:
$ ~/gatord --app ./mnist_tf_convol 0 80
To visualize the Arm NN events in the timeline, the Arm NN_Runtime events need to be enabled from the counter configuration menu for Gator to collect them. Note that the application needs to be run once before to see these events in the menu, as Arm NN needs to count and collect ML-related counters to pass them off to Gator. In addition, we have also enabled instructions PMU to understand SIMD performance.
When opening the results in Streamline, extra information appear in the Timeline tab:
The “Arm NN timeline” can be selected to display information about the NN pipeline.
The “Arm NN timeline” displays the different layers of the neural network – in a similar way to what Tensorboard can display. In our case, we have trained a convolutional neural network (CNN), which is very efficient in recognizing and classifying images for computer vision.
CNNs have two main parts:
The Arm NN timeline allows to understand when the transformations for the different layers are executed, how they are connected and what backend is used. For example, you can see that the convolution “conv2D” takes a significant amount on time on the CpuRef backend. Clicking on the layer in the timeline helps visualize where it fits in the CNN layer diagram thanks to the parent/child connections.
To conclude with the performance analysis, the profile indicates that the application does not take advantage of all the CPU resources:
In our example, switching the first argument of the application from 0 to 1 enables a much faster version of the application:
$ ~/gatord --app ./mnist_tf_convol 1 80
This argument sets the optimization mode: 0 uses the portable CpuRef mode and 1 uses the CpuAcc mode which will work on Neoverse, Cortex-A and some Cortex-R processors.
When opening the result in Streamline, we can naturally corelate the PMU metrics with the different steps of the inference.
We can see that the target CpuAcc backend enables Advanced SIMD instructions as well as multithreading to take advantage of all the compute resources on the Neoverse N1 CPU. This result in an important speedup, particularly for the convolution operation.
This covers the steps to profile and optimize a machine learning application running inference on Linux with Arm NN. To know more about how Streamline can help your development of ML and Arm NN-based applications:
[CTAToken URL = "https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer" target="_blank" text="Try Streamline Performance Analyzer" class ="green"]