Bringing NVIDIA GPU Debugging to AArch64 with Arm DDT

July 24, 2019

3 minute read time.

Following the ISC 2019 announcement that NVIDIA would be bringing its full stack of AI and HPC software to the Arm ecosystem, work is already underway to expand tooling support for developing NVIDIA GPUs on Arm hardware.

For many years, Arm Forge has enabled developers to debug and profile their NVIDIA GPU-enabled applications on x86-64 and ppc64le architectures. Now we're looking to enable the same functionality for AArch64.

Below is a sneak-peak of the Arm DDT debugger in Forge, running on an NVIDIA Jetson Nano Developer Kit. The $99 kit features a quad-core Arm Cortex-A57 CPU, accelerated by a 128-core Maxwell GPU with CUDA 10.0, allowing a miniaturized preview of CUDA on Arm before HPC-grade hardware is made available.

Starting the Application

We start by launching a simple GPU application that applies a 2D convolution filter to an image.

Note that we enable "Detect invalid accesses" to pick up potential memory errors on the GPU.

DDT Run Dialog

Our application starts up, and is initially paused as we're presented with the DDT user interface. Here we can see the source code, files used in the applications, and various other views - some of which we'll touch on below.

Now would be a good time to add breakpoints to our application (by clicking in the margin to the left of the code viewer), although in this case this isn't necessary.

DDT Post Launch

Clicking the "play" button in the top-left allows the application to continue, and reveals an issue.

The Bug

While the code runs to completion outside of the debugger, when invalid read/write detection is enabled with DDT, we see the following error when executing a CUDA kernel:

DDT Error Message

Clicking the "Pause" button dismisses the dialog, showing us the GPU code responsible for the error.

DDT Post Error

Exploring with DDT

Near the top of the screen, we can see the current CUDA thread (as well as the CUDA grid and block size):

DDT GPU thread selector

This indicates the thread that triggered the error. (It also allows us to select another CUDA block/thread.)

The selected CUDA thread determines the values displayed in other views, such as "Locals", "Current Line(s)", and "Evaluate".

DDT Locals View

Viewing the "Locals" for this thread allows us to see the computed index for the input, output, and convolution arrays.

To see the location of this, and other CUDA threads in our program, we can glance at the parallel stacks view.

DDT Parallel Stack View

Here, the selected CUDA thread corresponds to the selected item. This shows 992 CUDA threads (including the one triggering the error) in conv2d_global at edge.cu:121.

Selecting an item here will also select a CUDA thread associated with that item. (In the case of MPI programs, a corresponding process would also be selected).

We can also see where other CUDA threads are by glancing at the code viewer.

DDT Source Code Editor

Here the blue highlights show where other CUDA threads were paused.

From the error message and the "Stacks" view, we know that the error was triggered on line 121, while reading our input array:

uchar4 element = in[x+cx + (y+cy)*width];

The next step is determining the computed index into this array.

While we could compute this ourselves from the information in the "Locals" view, a more convenient method is to select the expression, right-click, and "Add to Evaluations".

Eureka!

Viewing the value in the "Evaluate" panel reveals the source of our issue - a negative array index!

DDT Evaluate View

Additional expressions allow us to further narrow this down to the y-component of the calculation.

Glancing back at the code, we can see that while applying our convolution filter, we're allowing our x and y components to become negative (when adding cx and cy).

Adding a simple bounds check here is enough to fix our issue, and the memory error disappears!

Summary

The above shows a quick walkthrough of debugging an NVIDIA GPU with Arm DDT.

While this example runs on the relatively modest AArch64 NVIDIA Jetson Nano, users can expect a very similar experience across architectures, and scaling up to the world's largest Supercomputers.

GPU Debugging is already available on x86-64 and OpenPOWER platforms, with support for AArch64 coming soon.

Useful links:

0 comments
0 members are here

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Bringing NVIDIA GPU Debugging to AArch64 with Arm DDT

Starting the Application

The Bug

Exploring with DDT

Eureka!

Summary

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors