Arm Forge is a development tool suite for developing, debugging and optimizing CUDA and OpenACC codes - from GeForce to Tesla and the Kepler K80. Forge includes the parallel and multi-process CUDA debugger, Arm DDT, and the profiler, Arm MAP.
The Arm DDT debugger enables:
The Arm MAP profiler enables:
Arm DDT and Arm MAP have support for the combinations that matter to you:
CUDA C, C++ and Fortran and OpenACC are fully supported by Arm DDT. The world's biggest users of CUDA and OpenACC debug their applications with Arm DDT - including on Oak Ridge National Laboratory’s Titan and CSCS’s Daint, the two largest GPU systems in the world.
Let’s start with some tips on how to use Arm DDT for CUDA - or read the list of CUDA features.
Just like a CPU debugger – you can set a breakpoint at any line of CUDA source code. Any time a block of CUDA threads gets to that line, the debugger will pause the whole application.
GPUs are massively SIMD – which means you can have thousands of threads active at any point in time. Use the debugger to select a CUDA thread by its index or select a thread that is on a particular line of code.
Stepping a thread is a great way to watch how a kernel progresses: CUDA GPUs are slightly different to CPUs as they actually execute threads in groups. It’s worth knowing that other GPU threads in the same “warp” (usually 32 threads) will also progress at the same time. Did the thread move through the code as you expected?
Each CUDA thread has its own register variables but shares other memory with threads in the same block, the whole device or even the host.
Whatever type of memory data resides in you need to check that it is what you expect it to be. Single values are easily seen – but the real neat trick is to visualize array data or filter to look for unusual values.
Perhaps you want to bring up a second visualization to compare GPU data to the CPU copy as a sanity check?
Before you start your application inside Arm DDT – tick the option to debug CUDA memory usage. It’s easy to make an error that reads beyond an array in CUDA. Not all arrays are multiples of the warp-size, but a frequent error is to assume they are.
Those errors are not always fatal, but they cause non-deterministic behavior - which can cause failure at unexpected times. With memory debugging, they’ll be spotted by the debugger – so you can fix them before trouble happens.
Whenever you have a compute intensive code, you should profile that code in order to get the most performance. Arm MAP is a profiler that profiles the performance of applications. It shows the lines of code that are executing for the most time in the CPU (host) code.
Use the Arm MAP profiler to discover which parts of your code are consuming the most CPU time and what they do. If scalar or floating point operations are dominating, you have a good candidate for GPU usage – but if I/O, branching or communication is dominating, you need to fix those issues first.
Once you have a working CUDA or OpenACC code – identify if the performance has improved – and where the next target for optimization is. If you can reduce the number of times data is transferred from the CPU to GPU and vice-versa by combining sequences of CPU operations into one large GPU usage, performance should be improved.
Note: MAP is not able to give profile information for source lines or functions for CUDA or OpenACC kernels executed on the GPU.
Although we can't see the individual source lines when profiling CUDA, MAP still profiles how the CPU and GPU work together.
One optimization is to overlap GPU and CPU computation. CUDA makes it easy to do this with streams – but you still need take a look at how much time is spent in data transfer. Too little time waiting at the synchronization? Your GPU may have finished quicker than the CPU – try giving it more work. Too much time at synchronization? Your CPU has wasted cycles you could use!