Optimizing a NVIDIA CUDA ML Inference Application with Arm Forge

November 1, 2019

6 minute read time.

With NVIDIA’s recent announcement of upcoming support for standalone NVIDIA GPUs for Arm servers, the Arm Forge team is excited to be bringing its leading developer tools to support this platform too.

In advance of the full release, we preview an example of what Arm Forge will do for server developers using Arm architecture CPUs and NVIDIA GPUs. Arm Forge is our cross-platform application profiling and debugging tool for Linux developers. It is widely used by NVIDIA CUDA developers too, particularly those working on large multi-server HPC systems.

By way of example, we use are using an NVIDIA Tegra device, the NVIDIA Jetson Nano. It has a sufficiently similar software environment to the upcoming Arm server-enabled release, which enables us to demonstrate tuning and optimizing an ML inference application.

Preparing the NVIDIA Jetson Nano

To reproduce this steps in this blog, you’ll need the NVIDIA Jetson Nano developer kit with the MicroSD Card image installed. These resources and instructions are at the Getting started with Jetson Nano pages.

After the first boot, head to the Jetson Inference examples (https://github.com/dusty-nv/jetson-inference). Follow the instructions for downloading and building from source. When presented with choices for which models to download “Downloading Models” – select “all models”, and for PyTorch “Installing PyTorch” – select both, we may need these at a later date.

Image Classification

The example we will explore uses a pre-trained neural network, trained using the ImageNet dataset, to classify a given example image – dog_0.jpg.

$ cd ~/jetson-inference/build/aarch64/bin
$ ./imagenet-console –network=resnet-50 dog_0.jpg
..
..
imagenet-console:  'dog_0.jpg' -> 34.40443% class #199 (Scotch terrier, Scottish terrier, Scottie)

The first time it runs, it will download the pre-trained network, around 100MB. But, subsequent runs load this locally cached copy and are only a few seconds. Here we see it has identified a dog and the breed as most likely a Scottish Terrier.

Classifying a photo collection

Classifying one photograph was interesting but my objective is to categorize and tag a collection of 3,500 vacation photos. I could use a batch script, with each image taking about 5 seconds. That’s 5 hours.

We can see from the text output of the program that there is a lot of time at the start loading the trained network.

We make a small modification to the application. This is so that the trained network loads once and is used it for the whole list of images.

~/jetson-inference/build/aarch64/bin/imagenet-console --network=resnet-50 *.jpg

Running through 12 photographs takes around 45 seconds. Faster than the 60 seconds a batch script would have delivered, but is it good? Let’s take a look.

Profiling the image classification program

We use Arm Forge’s profiling tool (MAP) to explore how the application is spending its time. MAP is an application performance profiler supporting C, C++, Fortran Python, and NVIDIA CUDA. Evaluation licences and package downloads are available from https://www.arm.com/products/development-tools/server-and-hpc/forge.

Once installed, let’s start MAP and get profiling:

$ ~/arm/forge/bin/map ./jetson-inference/build/aarch64/bin/imagenet-console --network resnet-50 ~/images/*.jpg

MAP shows the profile for the application: the timeline identifies CPU time (green) and time waiting for the GPU (purple).

Arm MAP profiling an NVIDIA CUDA application

At the bottom of the screen is the stack view. A top-down view of code executing over time. Our initial run time is 42 seconds. About 5-6 seconds (12%) to load the network (imageNet::create) and then 36 seconds for classifying (classifyImage) the 12 images. That works out at 3 seconds per image or 3 hours for all those holiday snaps.

Most of the timeline (top) is green. In fact, about 90% of the time.

This is significant. That the timeline is green shows tells us that CPUs don’t spend much time waiting for the GPU. This immediately tells us that there is no point optimizing the GPU time: the CPUs are not being held up by it. Or to put it another way, the CPU is the performance bottleneck for this application, not the GPU.

With over 83% of the time classifying images. Let’s dig deeper, and explore inside classifyImage.

Arm MAP showing stacks over time

We are spending most of its time loading an image (61.3% of time) rather than actual classification. This is an obvious candidate for optimization.

A first easy optimization

The image reader used in the Jetson inference suite is STBI. A single-file library for image reading. With other more specialized JPEG tools, perhaps we could do better?

Some brief internet research later, the consensus is that a recent GCC with “-O3” level of optimization will make STBI almost as quick as the fastest open implementations. Let’s do that and see if it helps.

$ cmake -DCMAKE_CXX_FLAGS="-g -O4" ../
$ make

It does. Run time is halved at 20 seconds, 6 seconds to load, and just over a second per image.

We’re down to 1 hour for the photos now.

So, is this sufficient and did a simple compile flag do everything? Let's profile again with MAP.

After compiler optimization of image loading

The timeline shows that although the GPU usage is higher proportionately, as the CPU is now loading images faster, we are still spending under 20% waiting on the GPU. Time is still heavily in loading the images on the CPU cores. So, optimization still needs to be in the CPU portion to make a difference.

Using threads for parallelization

The Jetson Nano has a 128-core Maxwell GPU and 4 Arm Cortex-A57 cores but the application is single threaded. That means we’re only using one CPU core.

If we use all the CPU cores to feed images to the GPU, we should get more done faster.

We use C++11’s thread class and create a pool of threads to handle images (one thread per core). We make each thread responsible for an image, allocating its memory, loading the image, and invoking the trained network on the GPU with this image. As we’ve no known promises about thread safety of the underlying imageNet class, we’ll use a Mutex to play safe and prevent any unfortunate bugs around some key GPU sections. We know from inspection that the JPEG reader is safe.

This time we are down to 12 seconds.

We take another look with MAP to show us how our four cores were used, by selecting the “View threads” display option in the GUI.

With this viewing option on, the “Application activity” bar has height corresponding to four physical cores. This now shows the initial network-loading phase (one core with an active thread in green, and one with a daemon thread in grey). This is followed by a very busy section showing four CPU cores are fully active, with a small amount of GPU waiting time (purple). The lighter green parts of this show, where worker threads are running over the (up to) four cores, and the darker green is when the (mostly waiting in thread join) master thread is active.

Profiling the multithreaded application

A final top down look at the code shows loading images is still significant, along with some time in thread synchronization (mutex locks), so we could still probably do better.

But at 6 seconds to load the network, and then just 6 seconds to handle 12 photographs, or 0.5 seconds each, we’ve done enough.

With 3,500 photos to go, it’ll be finished in 30 minutes.

For further information about application profiling with Arm Forge - visit Arm Developer.

Download and evaluate Arm Forge

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog