Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Mobile blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Bringing NVIDIA GPU Debugging to AArch64 with Arm DDT
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • AArch64
  • Arm Forge
  • Debugging
  • Tutorial
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Bringing NVIDIA GPU Debugging to AArch64 with Arm DDT

Nick Forrington
Nick Forrington
July 24, 2019

Following the ISC 2019 announcement that NVIDIA would be bringing its full stack of AI and HPC software to the Arm ecosystem, work is already underway to expand tooling support for developing NVIDIA GPUs on Arm hardware.

For many years, Arm Forge has enabled developers to debug and profile their NVIDIA GPU-enabled applications on x86-64 and ppc64le architectures. Now we're looking to enable the same functionality for AArch64.

Below is a sneak-peak of the Arm DDT debugger in Forge, running on an NVIDIA Jetson Nano Developer Kit. The $99 kit features a quad-core Arm Cortex-A57 CPU, accelerated by a 128-core Maxwell GPU with CUDA 10.0, allowing a miniaturized preview of CUDA on Arm before HPC-grade hardware is made available.

Starting the Application

We start by launching a simple GPU application that applies a 2D convolution filter to an image.

Note that we enable "Detect invalid accesses" to pick up potential memory errors on the GPU.

DDT Run Dialog

Our application starts up, and is initially paused as we're presented with the DDT user interface. Here we can see the source code, files used in the applications, and various other views - some of which we'll touch on below.

Now would be a good time to add breakpoints to our application (by clicking in the margin to the left of the code viewer), although in this case this isn't necessary.

DDT Post Launch

Clicking the "play" button in the top-left allows the application to continue, and reveals an issue.

The Bug

While the code runs to completion outside of the debugger, when invalid read/write detection is enabled with DDT, we see the following error when executing a CUDA kernel:

DDT Error Message

Clicking the "Pause" button dismisses the dialog, showing us the GPU code responsible for the error.

DDT Post Error

Exploring with DDT

Near the top of the screen, we can see the current CUDA thread (as well as the CUDA grid and block size):

DDT GPU thread selector

This indicates the thread that triggered the error. (It also allows us to select another CUDA block/thread.)

The selected CUDA thread determines the values displayed in other views, such as "Locals", "Current Line(s)", and "Evaluate".

DDT Locals View

Viewing the "Locals" for this thread allows us to see the computed index for the input, output, and convolution arrays.

To see the location of this, and other CUDA threads in our program, we can glance at the parallel stacks view.

DDT Parallel Stack View

Here, the selected CUDA thread corresponds to the selected item. This shows 992 CUDA threads (including the one triggering the error) in conv2d_global at edge.cu:121.

Selecting an item here will also select a CUDA thread associated with that item. (In the case of MPI programs, a corresponding process would also be selected).

We can also see where other CUDA threads are by glancing at the code viewer.

DDT Source Code Editor

Here the blue highlights show where other CUDA threads were paused.

From the error message and the "Stacks" view, we know that the error was triggered on line 121, while reading our input array:

uchar4 element = in[x+cx + (y+cy)*width];

The next step is determining the computed index into this array.

While we could compute this ourselves from the information in the "Locals" view, a more convenient method is to select the expression, right-click, and "Add to Evaluations".

Eureka!

Viewing the value in the "Evaluate" panel reveals the source of our issue - a negative array index!

DDT Evaluate View

Additional expressions allow us to further narrow this down to the y-component of the calculation.

Glancing back at the code, we can see that while applying our convolution filter, we're allowing our x and y components to become negative (when adding cx and cy).

Adding a simple bounds check here is enough to fix our issue, and the memory error disappears!

Summary

The above shows a quick walkthrough of debugging an NVIDIA GPU with Arm DDT.

While this example runs on the relatively modest AArch64 NVIDIA Jetson Nano, users can expect a very similar experience across architectures, and scaling up to the world's largest Supercomputers.

GPU Debugging is already available on x86-64 and OpenPOWER platforms, with support for AArch64 coming soon.

Useful links:

  • Arm Forge help and tutorials
  • Take a free trial of Arm Forge
Anonymous
High Performance Computing (HPC) blog
  • Key workloads demonstrate how Arm servers are changing HPC

    David Lecomber
    David Lecomber
    In the blog we look at the progress made in the Arm HPC application ecosystem and give a preview of our activities at ISC'22.
    • May 24, 2022
  • Arm Compilers and Performance Libraries for HPC developers now available for free

    Ashok Bhat
    Ashok Bhat
    Arm C/C++/Fortran Compilers and Arm Performance Libraries, aimed at HPC application developers, are now available for free. You no longer need license files to use the tools.
    • May 9, 2022
  • Stoking the Fire in Arm HPC

    David Lecomber
    David Lecomber
    In this blog we look at the growth of Arm in HPC - from humble beginnings to the number one ranked supercomputer in the world
    • May 3, 2022