Debugging CUDA Dynamic Parallelism

September 13, 2013

3 minute read time.

Today, using one of the early examples from the CUDA toolkit, I’m going to introduce a neat feature of CUDA 5 and CUDA 5.5 - dynamic parallelism - and how to use Arm DDT to debug it.

What is CUDA?

CUDA brings highly parallel computing into the graphics card (GPU) of your laptop - or your purpose-built hybrid supercomputer. CUDA’s programming language is an extension to C, and there is also OpenACC, a higher-level extension to C and Fortran that can be used too.

Once you start to develop your own code for the GPU - you can be sure that your first bug won’t be too far away - but with an application running as thousands of concurrent threads, the days of print statements for debugging are gone!

CUDA and OpenACC debugging is a breeze with Arm DDT: Step threads and blocks of threads, watch how the program progresses, run to a breakpoint, evaluate variables. It’s all sounding like debugging regular code! You can find out more on our CUDA page.

CUDA Dynamic Parallelism

Dynamic parallelism in CUDA lets one kernel invoke another - it was new in CUDA 5. It lets you move the control logic for recursion, for example, from the CPU over to the GPU. At a stroke, performance is improved, and code simplified. A double win.

Here’s how things look, after we’ve recursed inside a GPU kernel - it’s got 4 active GPU kernels - with a total of 128 threads.

The stacks of multiple CUDA kernels simultaneously scheduled

The threads are at varying stages through the kernel - for example 32 threads are at line 12, and 93 threads are at line 19. You can see the CPU threads (in main, and in a CUDA API call) and GPU threads all in the same screen.

Let’s dive into a real program. This example is called “permute” - but the purpose of the kernel is to just repeatedly double the values in an array using a recursive pattern.

The Kernel: Shared memory, and synchronization - can’t go wrong!

The wrapping C code

The code prints out element 42 of the array to see how many times it has been doubled - from the initial value of 21.

Session output

That’s not good - how did it get to zero?

Debugging Dynamic Parallelism

The first step to debug any CUDA problem is to fire up Arm DDT - and see what happens!

Instantly we see a message: that’s a memory error - reading a location that is not valid (an illegal lane address). Arm DDT has selected the errant CUDA thread (thread number 254).

That’s unusual - the data (data[threadIdx.x]) isn’t supposed to be zero, anywhere.

So what’s wrong with the element at position 254? Examining where the pointer for data comes from - it’s the math on line 15 and 16, a bit of pointer arithmetic.

The pointer (data) has increased in the second recursive call, meaning the valid length of the array passed in should be smaller - but the size of kernel has not changed, it’s still 256 - and that’ll mean we try to read 256 elements, some of which are beyond the end of the allocated data.

We should not be changing data beyond the length of our data, and if we are a kernel thread index beyond the size of the array, we should quit.

A quick fix will do this:

Yet another problem solved - and another lesson learned in developing CUDA!

Read more about debugging and profiling CUDA applications.

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Debugging CUDA Dynamic Parallelism

What is CUDA?

CUDA Dynamic Parallelism

Debugging Dynamic Parallelism

Refining MurmurHash64A for greater efficiency in Libstdc++

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3