Debugging CUDA Dynamic Parallelism

September 13, 2013

3 minute read time.

Today, using one of the early examples from the CUDA toolkit, I’m going to introduce a neat feature of CUDA 5 and CUDA 5.5 - dynamic parallelism - and how to use Arm DDT to debug it.

What is CUDA?

CUDA brings highly parallel computing into the graphics card (GPU) of your laptop - or your purpose-built hybrid supercomputer. CUDA’s programming language is an extension to C, and there is also OpenACC, a higher-level extension to C and Fortran that can be used too.

Once you start to develop your own code for the GPU - you can be sure that your first bug won’t be too far away - but with an application running as thousands of concurrent threads, the days of print statements for debugging are gone!

CUDA and OpenACC debugging is a breeze with Arm DDT: Step threads and blocks of threads, watch how the program progresses, run to a breakpoint, evaluate variables. It’s all sounding like debugging regular code! You can find out more on our CUDA page.

CUDA Dynamic Parallelism

Dynamic parallelism in CUDA lets one kernel invoke another - it was new in CUDA 5. It lets you move the control logic for recursion, for example, from the CPU over to the GPU. At a stroke, performance is improved, and code simplified. A double win.

Here’s how things look, after we’ve recursed inside a GPU kernel - it’s got 4 active GPU kernels - with a total of 128 threads.

The stacks of multiple CUDA kernels simultaneously scheduled

The threads are at varying stages through the kernel - for example 32 threads are at line 12, and 93 threads are at line 19. You can see the CPU threads (in main, and in a CUDA API call) and GPU threads all in the same screen.

Let’s dive into a real program. This example is called “permute” - but the purpose of the kernel is to just repeatedly double the values in an array using a recursive pattern.

The Kernel: Shared memory, and synchronization - can’t go wrong!

The wrapping C code

The code prints out element 42 of the array to see how many times it has been doubled - from the initial value of 21.

Session output

That’s not good - how did it get to zero?

Debugging Dynamic Parallelism

The first step to debug any CUDA problem is to fire up Arm DDT - and see what happens!

Instantly we see a message: that’s a memory error - reading a location that is not valid (an illegal lane address). Arm DDT has selected the errant CUDA thread (thread number 254).

That’s unusual - the data (data[threadIdx.x]) isn’t supposed to be zero, anywhere.

So what’s wrong with the element at position 254? Examining where the pointer for data comes from - it’s the math on line 15 and 16, a bit of pointer arithmetic.

The pointer (data) has increased in the second recursive call, meaning the valid length of the array passed in should be smaller - but the size of kernel has not changed, it’s still 256 - and that’ll mean we try to read 256 elements, some of which are beyond the end of the allocated data.

We should not be changing data beyond the length of our data, and if we are a kernel thread index beyond the size of the array, we should quit.

A quick fix will do this:

Yet another problem solved - and another lesson learned in developing CUDA!

Read more about debugging and profiling CUDA applications.

High Performance Computing (HPC) blog

Defacto SoC Compiler performance on AWS Graviton3

Tim Thornton

In this blog, we compare the runtime performance and cost of using the Defacto SoC Compiler on Arm and x86-based Amazon EC2 instances.
- April 17, 2024
Arm Compiler for Linux and Arm Performance Libraries 24.04

Chris Goodyer

In this blog we outline some of the biggest changes available in version 24.04 of the Arm Compiler for Linux.
- April 16, 2024
SPDK NVMe over TCP Optimization on Arm

Rui Chang

This blog introduces optimizing Storage Performance Development Kit (SPDK) NVMe over TCP on Arm, and how to maximize its performance.
- February 5, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Debugging CUDA Dynamic Parallelism

What is CUDA?

CUDA Dynamic Parallelism

Debugging Dynamic Parallelism

Defacto SoC Compiler performance on AWS Graviton3

Arm Compiler for Linux and Arm Performance Libraries 24.04

SPDK NVMe over TCP Optimization on Arm