Today, using one of the early examples from the CUDA toolkit, I’m going to introduce a neat feature of CUDA 5 and CUDA 5.5 - dynamic parallelism - and how to use Arm DDT to debug it.
CUDA brings highly parallel computing into the graphics card (GPU) of your laptop - or your purpose-built hybrid supercomputer. CUDA’s programming language is an extension to C, and there is also OpenACC, a higher-level extension to C and Fortran that can be used too.
Once you start to develop your own code for the GPU - you can be sure that your first bug won’t be too far away - but with an application running as thousands of concurrent threads, the days of print statements for debugging are gone!
CUDA and OpenACC debugging is a breeze with Arm DDT: Step threads and blocks of threads, watch how the program progresses, run to a breakpoint, evaluate variables. It’s all sounding like debugging regular code! You can find out more on our CUDA page.
Dynamic parallelism in CUDA lets one kernel invoke another - it was new in CUDA 5. It lets you move the control logic for recursion, for example, from the CPU over to the GPU. At a stroke, performance is improved, and code simplified. A double win.
Here’s how things look, after we’ve recursed inside a GPU kernel - it’s got 4 active GPU kernels - with a total of 128 threads.
The stacks of multiple CUDA kernels simultaneously scheduled
The threads are at varying stages through the kernel - for example 32 threads are at line 12, and 93 threads are at line 19. You can see the CPU threads (in main, and in a CUDA API call) and GPU threads all in the same screen.
Let’s dive into a real program. This example is called “permute” - but the purpose of the kernel is to just repeatedly double the values in an array using a recursive pattern.
The Kernel: Shared memory, and synchronization - can’t go wrong!
The wrapping C code
The code prints out element 42 of the array to see how many times it has been doubled - from the initial value of 21.
Session output
That’s not good - how did it get to zero?
The first step to debug any CUDA problem is to fire up Arm DDT - and see what happens!
Instantly we see a message: that’s a memory error - reading a location that is not valid (an illegal lane address). Arm DDT has selected the errant CUDA thread (thread number 254).
That’s unusual - the data (data[threadIdx.x]) isn’t supposed to be zero, anywhere.
So what’s wrong with the element at position 254? Examining where the pointer for data comes from - it’s the math on line 15 and 16, a bit of pointer arithmetic.
The pointer (data) has increased in the second recursive call, meaning the valid length of the array passed in should be smaller - but the size of kernel has not changed, it’s still 256 - and that’ll mean we try to read 256 elements, some of which are beyond the end of the allocated data.
We should not be changing data beyond the length of our data, and if we are a kernel thread index beyond the size of the array, we should quit.
A quick fix will do this:
Yet another problem solved - and another lesson learned in developing CUDA!
Read more about debugging and profiling CUDA applications.