Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Debugging CUDA Dynamic Parallelism
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm DDT
  • HPC Tools
  • Development Tools
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Debugging CUDA Dynamic Parallelism

David Lecomber
David Lecomber
September 13, 2013
3 minute read time.

Today, using one of the early examples from the CUDA toolkit, I’m going to introduce a neat feature of CUDA 5 and CUDA 5.5 - dynamic parallelism - and how to use Arm DDT to debug it.

What is CUDA?

CUDA brings highly parallel computing into the graphics card (GPU) of your laptop - or your purpose-built hybrid supercomputer. CUDA’s programming language is an extension to C, and there is also OpenACC, a higher-level extension to C and Fortran that can be used too.

Once you start to develop your own code for the GPU - you can be sure that your first bug won’t be too far away - but with an application running as thousands of concurrent threads, the days of print statements for debugging are gone!

CUDA and OpenACC debugging is a breeze with Arm DDT:  Step threads and blocks of threads, watch how the program progresses, run to a breakpoint, evaluate variables. It’s all sounding like debugging regular code!  You can find out more on our CUDA page.

CUDA Dynamic Parallelism

Dynamic parallelism in CUDA lets one kernel invoke another - it was new in CUDA 5. It lets you move the control logic for recursion, for example, from the CPU over to the GPU. At a stroke, performance is improved, and code simplified. A double win.

Here’s how things look, after we’ve recursed inside a GPU kernel - it’s got 4 active GPU kernels - with a total of 128 threads.

The stacks of multiple CUDA kernels simultaneously scheduled

The threads are at varying stages through the kernel - for example 32 threads are at line 12, and 93 threads are at line 19. You can see the CPU threads (in main, and in a CUDA API call) and GPU threads all in the same screen.

Let’s dive into a real program. This example is called “permute” - but the purpose of the kernel is to just repeatedly double the values in an array using a recursive pattern.

The Kernel: Shared memory, and synchronization - can’t go wrong!

The wrapping C code

The code prints out element 42 of the array to see how many times it has been doubled - from the initial value of 21.

Session output

That’s not good - how did it get to zero?

Debugging Dynamic Parallelism

The first step to debug any CUDA problem is to fire up Arm DDT - and see what happens!

Instantly we see a message: that’s a memory error - reading a location that is not valid (an illegal lane address). Arm DDT has selected the errant CUDA thread (thread number 254).

That’s unusual - the data (data[threadIdx.x]) isn’t supposed to be zero, anywhere.

So what’s wrong with the element at position 254?  Examining where the pointer for data comes from - it’s the math on line 15 and 16, a bit of pointer arithmetic.

The pointer (data) has increased in the second recursive call, meaning the valid length of the array passed in should be smaller - but the size of kernel has not changed, it’s still 256 - and that’ll mean we try to read 256 elements, some of which are beyond the end of the allocated data.

We should not be changing data beyond the length of our data, and if we are a kernel thread index beyond the size of the array, we should quit.

A quick fix will do this:

Yet another problem solved - and another lesson learned in developing CUDA!

 Read more about debugging and profiling CUDA applications.

Anonymous
Servers and Cloud Computing blog
  • Refining MurmurHash64A for greater efficiency in Libstdc++

    Zongyao Zhang
    Zongyao Zhang
    Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
    • October 16, 2025
  • How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

    Marc Meunier
    Marc Meunier
    Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
    • October 13, 2025
  • Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

    odinlmshen
    odinlmshen
    In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
    • October 13, 2025