Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Debugging CUDA Dynamic Parallelism
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm DDT
  • HPC Tools
  • Development Tools
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Debugging CUDA Dynamic Parallelism

David Lecomber
David Lecomber
September 13, 2013
3 minute read time.

Today, using one of the early examples from the CUDA toolkit, I’m going to introduce a neat feature of CUDA 5 and CUDA 5.5 - dynamic parallelism - and how to use Arm DDT to debug it.

What is CUDA?

CUDA brings highly parallel computing into the graphics card (GPU) of your laptop - or your purpose-built hybrid supercomputer. CUDA’s programming language is an extension to C, and there is also OpenACC, a higher-level extension to C and Fortran that can be used too.

Once you start to develop your own code for the GPU - you can be sure that your first bug won’t be too far away - but with an application running as thousands of concurrent threads, the days of print statements for debugging are gone!

CUDA and OpenACC debugging is a breeze with Arm DDT:  Step threads and blocks of threads, watch how the program progresses, run to a breakpoint, evaluate variables. It’s all sounding like debugging regular code!  You can find out more on our CUDA page.

CUDA Dynamic Parallelism

Dynamic parallelism in CUDA lets one kernel invoke another - it was new in CUDA 5. It lets you move the control logic for recursion, for example, from the CPU over to the GPU. At a stroke, performance is improved, and code simplified. A double win.

Here’s how things look, after we’ve recursed inside a GPU kernel - it’s got 4 active GPU kernels - with a total of 128 threads.

The stacks of multiple CUDA kernels simultaneously scheduled

The threads are at varying stages through the kernel - for example 32 threads are at line 12, and 93 threads are at line 19. You can see the CPU threads (in main, and in a CUDA API call) and GPU threads all in the same screen.

Let’s dive into a real program. This example is called “permute” - but the purpose of the kernel is to just repeatedly double the values in an array using a recursive pattern.

The Kernel: Shared memory, and synchronization - can’t go wrong!

The wrapping C code

The code prints out element 42 of the array to see how many times it has been doubled - from the initial value of 21.

Session output

That’s not good - how did it get to zero?

Debugging Dynamic Parallelism

The first step to debug any CUDA problem is to fire up Arm DDT - and see what happens!

Instantly we see a message: that’s a memory error - reading a location that is not valid (an illegal lane address). Arm DDT has selected the errant CUDA thread (thread number 254).

That’s unusual - the data (data[threadIdx.x]) isn’t supposed to be zero, anywhere.

So what’s wrong with the element at position 254?  Examining where the pointer for data comes from - it’s the math on line 15 and 16, a bit of pointer arithmetic.

The pointer (data) has increased in the second recursive call, meaning the valid length of the array passed in should be smaller - but the size of kernel has not changed, it’s still 256 - and that’ll mean we try to read 256 elements, some of which are beyond the end of the allocated data.

We should not be changing data beyond the length of our data, and if we are a kernel thread index beyond the size of the array, we should quit.

A quick fix will do this:

Yet another problem solved - and another lesson learned in developing CUDA!

 Read more about debugging and profiling CUDA applications.

Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025