With NVIDIA’s recent announcement of upcoming support for standalone NVIDIA GPUs for Arm servers, the Arm Forge team is excited to be bringing its leading developer tools to support this platform too.
In advance of the full release, we preview an example of what Arm Forge will do for server developers using Arm architecture CPUs and NVIDIA GPUs. Arm Forge is our cross-platform application profiling and debugging tool for Linux developers. It is widely used by NVIDIA CUDA developers too, particularly those working on large multi-server HPC systems.
By way of example, we use are using an NVIDIA Tegra device, the NVIDIA Jetson Nano. It has a sufficiently similar software environment to the upcoming Arm server-enabled release, which enables us to demonstrate tuning and optimizing an ML inference application.
To reproduce this steps in this blog, you’ll need the NVIDIA Jetson Nano developer kit with the MicroSD Card image installed. These resources and instructions are at the Getting started with Jetson Nano pages.
After the first boot, head to the Jetson Inference examples (https://github.com/dusty-nv/jetson-inference). Follow the instructions for downloading and building from source. When presented with choices for which models to download “Downloading Models” – select “all models”, and for PyTorch “Installing PyTorch” – select both, we may need these at a later date.
The example we will explore uses a pre-trained neural network, trained using the ImageNet dataset, to classify a given example image – dog_0.jpg.
$ cd ~/jetson-inference/build/aarch64/bin
$ ./imagenet-console –network=resnet-50 dog_0.jpg
imagenet-console: 'dog_0.jpg' -> 34.40443% class #199 (Scotch terrier, Scottish terrier, Scottie)
The first time it runs, it will download the pre-trained network, around 100MB. But, subsequent runs load this locally cached copy and are only a few seconds. Here we see it has identified a dog and the breed as most likely a Scottish Terrier.
Classifying one photograph was interesting but my objective is to categorize and tag a collection of 3,500 vacation photos. I could use a batch script, with each image taking about 5 seconds. That’s 5 hours.
We can see from the text output of the program that there is a lot of time at the start loading the trained network.
We make a small modification to the application. This is so that the trained network loads once and is used it for the whole list of images.
~/jetson-inference/build/aarch64/bin/imagenet-console --network=resnet-50 *.jpg
Running through 12 photographs takes around 45 seconds. Faster than the 60 seconds a batch script would have delivered, but is it good? Let’s take a look.
We use Arm Forge’s profiling tool (MAP) to explore how the application is spending its time. MAP is an application performance profiler supporting C, C++, Fortran Python, and NVIDIA CUDA. Evaluation licences and package downloads are available from https://www.arm.com/products/development-tools/server-and-hpc/forge.
Once installed, let’s start MAP and get profiling:
$ ~/arm/forge/bin/map ./jetson-inference/build/aarch64/bin/imagenet-console --network resnet-50 ~/images/*.jpg
MAP shows the profile for the application: the timeline identifies CPU time (green) and time waiting for the GPU (purple).
At the bottom of the screen is the stack view. A top-down view of code executing over time. Our initial run time is 42 seconds. About 5-6 seconds (12%) to load the network (imageNet::create) and then 36 seconds for classifying (classifyImage) the 12 images. That works out at 3 seconds per image or 3 hours for all those holiday snaps.
Most of the timeline (top) is green. In fact, about 90% of the time.
This is significant. That the timeline is green shows tells us that CPUs don’t spend much time waiting for the GPU. This immediately tells us that there is no point optimizing the GPU time: the CPUs are not being held up by it. Or to put it another way, the CPU is the performance bottleneck for this application, not the GPU.
With over 83% of the time classifying images. Let’s dig deeper, and explore inside classifyImage.
We are spending most of its time loading an image (61.3% of time) rather than actual classification. This is an obvious candidate for optimization.
The image reader used in the Jetson inference suite is STBI. A single-file library for image reading. With other more specialized JPEG tools, perhaps we could do better?
Some brief internet research later, the consensus is that a recent GCC with “-O3” level of optimization will make STBI almost as quick as the fastest open implementations. Let’s do that and see if it helps.
$ cmake -DCMAKE_CXX_FLAGS="-g -O4" ../
It does. Run time is halved at 20 seconds, 6 seconds to load, and just over a second per image.
We’re down to 1 hour for the photos now.
So, is this sufficient and did a simple compile flag do everything? Let's profile again with MAP.
The timeline shows that although the GPU usage is higher proportionately, as the CPU is now loading images faster, we are still spending under 20% waiting on the GPU. Time is still heavily in loading the images on the CPU cores. So, optimization still needs to be in the CPU portion to make a difference.
The Jetson Nano has a 128-core Maxwell GPU and 4 Arm Cortex-A57 cores but the application is single threaded. That means we’re only using one CPU core.
If we use all the CPU cores to feed images to the GPU, we should get more done faster.
We use C++11’s thread class and create a pool of threads to handle images (one thread per core). We make each thread responsible for an image, allocating its memory, loading the image, and invoking the trained network on the GPU with this image. As we’ve no known promises about thread safety of the underlying imageNet class, we’ll use a Mutex to play safe and prevent any unfortunate bugs around some key GPU sections. We know from inspection that the JPEG reader is safe.
This time we are down to 12 seconds.
We take another look with MAP to show us how our four cores were used, by selecting the “View threads” display option in the GUI.
With this viewing option on, the “Application activity” bar has height corresponding to four physical cores. This now shows the initial network-loading phase (one core with an active thread in green, and one with a daemon thread in grey). This is followed by a very busy section showing four CPU cores are fully active, with a small amount of GPU waiting time (purple). The lighter green parts of this show, where worker threads are running over the (up to) four cores, and the darker green is when the (mostly waiting in thread join) master thread is active.
A final top down look at the code shows loading images is still significant, along with some time in thread synchronization (mutex locks), so we could still probably do better.
But at 6 seconds to load the network, and then just 6 seconds to handle 12 photographs, or 0.5 seconds each, we’ve done enough.
With 3,500 photos to go, it’ll be finished in 30 minutes.
For further information about application profiling with Arm Forge - visit Arm Developer.
Download and evaluate Arm Forge