Following the ISC 2019 announcement that NVIDIA would be bringing its full stack of AI and HPC software to the Arm ecosystem, work is already underway to expand tooling support for developing NVIDIA GPUs on Arm hardware.
For many years, Arm Forge has enabled developers to debug and profile their NVIDIA GPU-enabled applications on x86-64 and ppc64le architectures. Now we're looking to enable the same functionality for AArch64.
Below is a sneak-peak of the Arm DDT debugger in Forge, running on an NVIDIA Jetson Nano Developer Kit. The $99 kit features a quad-core Arm Cortex-A57 CPU, accelerated by a 128-core Maxwell GPU with CUDA 10.0, allowing a miniaturized preview of CUDA on Arm before HPC-grade hardware is made available.
We start by launching a simple GPU application that applies a 2D convolution filter to an image.
Note that we enable "Detect invalid accesses" to pick up potential memory errors on the GPU.
Our application starts up, and is initially paused as we're presented with the DDT user interface. Here we can see the source code, files used in the applications, and various other views - some of which we'll touch on below.
Now would be a good time to add breakpoints to our application (by clicking in the margin to the left of the code viewer), although in this case this isn't necessary.
Clicking the "play" button in the top-left allows the application to continue, and reveals an issue.
While the code runs to completion outside of the debugger, when invalid read/write detection is enabled with DDT, we see the following error when executing a CUDA kernel:
Clicking the "Pause" button dismisses the dialog, showing us the GPU code responsible for the error.
Near the top of the screen, we can see the current CUDA thread (as well as the CUDA grid and block size):
This indicates the thread that triggered the error. (It also allows us to select another CUDA block/thread.)
The selected CUDA thread determines the values displayed in other views, such as "Locals", "Current Line(s)", and "Evaluate".
Viewing the "Locals" for this thread allows us to see the computed index for the input, output, and convolution arrays.
To see the location of this, and other CUDA threads in our program, we can glance at the parallel stacks view.
Here, the selected CUDA thread corresponds to the selected item. This shows 992 CUDA threads (including the one triggering the error) in conv2d_global at edge.cu:121.
conv2d_global
edge.cu:121
Selecting an item here will also select a CUDA thread associated with that item. (In the case of MPI programs, a corresponding process would also be selected).
We can also see where other CUDA threads are by glancing at the code viewer.
Here the blue highlights show where other CUDA threads were paused.
From the error message and the "Stacks" view, we know that the error was triggered on line 121, while reading our input array:
uchar4 element = in[x+cx + (y+cy)*width];
The next step is determining the computed index into this array.
While we could compute this ourselves from the information in the "Locals" view, a more convenient method is to select the expression, right-click, and "Add to Evaluations".
Viewing the value in the "Evaluate" panel reveals the source of our issue - a negative array index!
Additional expressions allow us to further narrow this down to the y-component of the calculation.
Glancing back at the code, we can see that while applying our convolution filter, we're allowing our x and y components to become negative (when adding cx and cy).
cx
cy
Adding a simple bounds check here is enough to fix our issue, and the memory error disappears!
The above shows a quick walkthrough of debugging an NVIDIA GPU with Arm DDT.
While this example runs on the relatively modest AArch64 NVIDIA Jetson Nano, users can expect a very similar experience across architectures, and scaling up to the world's largest Supercomputers.
GPU Debugging is already available on x86-64 and OpenPOWER platforms, with support for AArch64 coming soon.
Useful links: