Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Bringing NVIDIA GPU Debugging to AArch64 with Arm DDT
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • AArch64
  • Arm Forge
  • Debugging
  • Tutorial
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Bringing NVIDIA GPU Debugging to AArch64 with Arm DDT

Nick Forrington
Nick Forrington
July 24, 2019
3 minute read time.

Following the ISC 2019 announcement that NVIDIA would be bringing its full stack of AI and HPC software to the Arm ecosystem, work is already underway to expand tooling support for developing NVIDIA GPUs on Arm hardware.

For many years, Arm Forge has enabled developers to debug and profile their NVIDIA GPU-enabled applications on x86-64 and ppc64le architectures. Now we're looking to enable the same functionality for AArch64.

Below is a sneak-peak of the Arm DDT debugger in Forge, running on an NVIDIA Jetson Nano Developer Kit. The $99 kit features a quad-core Arm Cortex-A57 CPU, accelerated by a 128-core Maxwell GPU with CUDA 10.0, allowing a miniaturized preview of CUDA on Arm before HPC-grade hardware is made available.

Starting the Application

We start by launching a simple GPU application that applies a 2D convolution filter to an image.

Note that we enable "Detect invalid accesses" to pick up potential memory errors on the GPU.

DDT Run Dialog

Our application starts up, and is initially paused as we're presented with the DDT user interface. Here we can see the source code, files used in the applications, and various other views - some of which we'll touch on below.

Now would be a good time to add breakpoints to our application (by clicking in the margin to the left of the code viewer), although in this case this isn't necessary.

DDT Post Launch

Clicking the "play" button in the top-left allows the application to continue, and reveals an issue.

The Bug

While the code runs to completion outside of the debugger, when invalid read/write detection is enabled with DDT, we see the following error when executing a CUDA kernel:

DDT Error Message

Clicking the "Pause" button dismisses the dialog, showing us the GPU code responsible for the error.

DDT Post Error

Exploring with DDT

Near the top of the screen, we can see the current CUDA thread (as well as the CUDA grid and block size):

DDT GPU thread selector

This indicates the thread that triggered the error. (It also allows us to select another CUDA block/thread.)

The selected CUDA thread determines the values displayed in other views, such as "Locals", "Current Line(s)", and "Evaluate".

DDT Locals View

Viewing the "Locals" for this thread allows us to see the computed index for the input, output, and convolution arrays.

To see the location of this, and other CUDA threads in our program, we can glance at the parallel stacks view.

DDT Parallel Stack View

Here, the selected CUDA thread corresponds to the selected item. This shows 992 CUDA threads (including the one triggering the error) in conv2d_global at edge.cu:121.

Selecting an item here will also select a CUDA thread associated with that item. (In the case of MPI programs, a corresponding process would also be selected).

We can also see where other CUDA threads are by glancing at the code viewer.

DDT Source Code Editor

Here the blue highlights show where other CUDA threads were paused.

From the error message and the "Stacks" view, we know that the error was triggered on line 121, while reading our input array:

uchar4 element = in[x+cx + (y+cy)*width];

The next step is determining the computed index into this array.

While we could compute this ourselves from the information in the "Locals" view, a more convenient method is to select the expression, right-click, and "Add to Evaluations".

Eureka!

Viewing the value in the "Evaluate" panel reveals the source of our issue - a negative array index!

DDT Evaluate View

Additional expressions allow us to further narrow this down to the y-component of the calculation.

Glancing back at the code, we can see that while applying our convolution filter, we're allowing our x and y components to become negative (when adding cx and cy).

Adding a simple bounds check here is enough to fix our issue, and the memory error disappears!

Summary

The above shows a quick walkthrough of debugging an NVIDIA GPU with Arm DDT.

While this example runs on the relatively modest AArch64 NVIDIA Jetson Nano, users can expect a very similar experience across architectures, and scaling up to the world's largest Supercomputers.

GPU Debugging is already available on x86-64 and OpenPOWER platforms, with support for AArch64 coming soon.

Useful links:

  • Arm Forge help and tutorials
  • Take a free trial of Arm Forge
Anonymous
Servers and Cloud Computing blog
  • Optimizing Code Cache Performance for Large Code Footprint Java Applications on Neoverse

    Yanqin Wei
    Yanqin Wei
    Learn how smarter cache use transforms heavy Java apps into faster, more efficient workloads.
    • September 16, 2025
  • Redefining Datacenter Performance for AI: The Arm Neoverse Advantage

    Shivangi Agrawal
    Shivangi Agrawal
    In this blog post, explore the features that make Neoverse V series the choice of compute platform for AI.
    • September 8, 2025
  • Migrating our GenAI pipeline to AWS Graviton powered by Arm Neoverse: A 40% cost reduction story

    Hrudu Shibu
    Hrudu Shibu
    This blog post explains how Esankethik.com, an IT and AI solutions company, successfully migrated its internal GenAI pipeline to AWS Graviton Arm64.
    • August 28, 2025