CUDA Debugger and Profiler - Advanced Debugging and Performance Optimization Tools for CUDA and OpenACC

November 25, 2014

5 minute read time.

Debugging and Optimizing CUDA and OpenACC

Arm Forge is a development tool suite for developing, debugging and optimizing CUDA and OpenACC codes - from GeForce to Tesla and the Kepler K80. Forge includes the parallel and multi-process CUDA debugger, Arm DDT, and the profiler, Arm MAP.

Key Features of Arm Forge CUDA support

The Arm DDT debugger enables:

Breakpoints in CUDA threads at specific lines of CUDA or OpenACC code.
Mixed CPU and GPU debugging - and even multi-process code - all in the same debugging session.
View CPU and GPU threads - with our unique thread-consolidating parallel stack view to simplify the information you see and highlight the differences.
Dynamic mode (often used for recursive CUDA).
All memory types are visible - including register, shared (block) and global - and unified virtual addressing and GPU Direct.
Step warps, blocks and entire kernels.
Debug CUDA core dumps with CUDA 7 and above.
Supports multiple GPUs simultaneously.
Memory debugging for access errors - and memory leak reporting for global memory.

The Arm MAP profiler enables:

View memory transfers and global memory used.
View GPU temperature as your job progresses.
Profile line-level CPU code (line level GPU profiling not supported).
View and analyze the time your CPU threads spend waiting for CUDA kernels to complete

Arm DDT and Arm MAP have support for the combinations that matter to you:

Large range of supported compilers
- CUDA C and C++ from the NVIDIA compilers.
- CUDA Fortran and F90 and OpenACC from Portland.
- Cray OpenACC compiler.
- Inline PTX.
The latest CUDA toolkits - CUDA 6.5, 7, 7.5 and 8.
HPC clusters with CUDA (eg. MPI)

CUDA debugging

CUDA C, C++ and Fortran and OpenACC are fully supported by Arm DDT. The world's biggest users of CUDA and OpenACC debug their applications with Arm DDT - including on Oak Ridge National Laboratory’s Titan and CSCS’s Daint, the two largest GPU systems in the world.

Let’s start with some tips on how to use Arm DDT for CUDA - or read the list of CUDA features.

Set a breakpoint

Just like a CPU debugger – you can set a breakpoint at any line of CUDA source code. Any time a block of CUDA threads gets to that line, the debugger will pause the whole application.

Explore behavior

GPUs are massively SIMD – which means you can have thousands of threads active at any point in time. Use the debugger to select a CUDA thread by its index or select a thread that is on a particular line of code.

CUDA debugging

Stepping a thread is a great way to watch how a kernel progresses: CUDA GPUs are slightly different to CPUs as they actually execute threads in groups. It’s worth knowing that other GPU threads in the same “warp” (usually 32 threads) will also progress at the same time. Did the thread move through the code as you expected?

Visualize data

Each CUDA thread has its own register variables but shares other memory with threads in the same block, the whole device or even the host.

Whatever type of memory data resides in you need to check that it is what you expect it to be. Single values are easily seen – but the real neat trick is to visualize array data or filter to look for unusual values.

Visual of array data

Perhaps you want to bring up a second visualization to compare GPU data to the CPU copy as a sanity check?

Verify memory usage

Before you start your application inside Arm DDT – tick the option to debug CUDA memory usage. It’s easy to make an error that reads beyond an array in CUDA. Not all arrays are multiples of the warp-size, but a frequent error is to assume they are.

Those errors are not always fatal, but they cause non-deterministic behavior - which can cause failure at unexpected times. With memory debugging, they’ll be spotted by the debugger – so you can fix them before trouble happens.

Profiling and tuning CUDA and OpenACC applications

Whenever you have a compute intensive code, you should profile that code in order to get the most performance. Arm MAP is a profiler that profiles the performance of applications. It shows the lines of code that are executing for the most time in the CPU (host) code.

Step 1: Profile the initial code

Use the Arm MAP profiler to discover which parts of your code are consuming the most CPU time and what they do. If scalar or floating point operations are dominating, you have a good candidate for GPU usage – but if I/O, branching or communication is dominating, you need to fix those issues first.

Step 2: Profile the results

Once you have a working CUDA or OpenACC code – identify if the performance has improved – and where the next target for optimization is. If you can reduce the number of times data is transferred from the CPU to GPU and vice-versa by combining sequences of CPU operations into one large GPU usage, performance should be improved.

Note: MAP is not able to give profile information for source lines or functions for CUDA or OpenACC kernels executed on the GPU.

Advanced CUDA: Overlap data transfer

Although we can't see the individual source lines when profiling CUDA, MAP still profiles how the CPU and GPU work together.

Profiling CUDA in Arm MAP 2

One optimization is to overlap GPU and CPU computation. CUDA makes it easy to do this with streams – but you still need take a look at how much time is spent in data transfer. Too little time waiting at the synchronization? Your GPU may have finished quicker than the CPU – try giving it more work. Too much time at synchronization? Your CPU has wasted cycles you could use!

More on CUDA

0 comments
0 members are here

Servers and Cloud Computing blog

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

CUDA Debugger and Profiler - Advanced Debugging and Performance Optimization Tools for CUDA and OpenACC

Debugging and Optimizing CUDA and OpenACC

Key Features of Arm Forge CUDA support

CUDA debugging

Set a breakpoint

Explore behavior

Visualize data

Verify memory usage

Profiling and tuning CUDA and OpenACC applications

Step 1: Profile the initial code

Step 2: Profile the results

Advanced CUDA: Overlap data transfer

More on CUDA

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Arm CMN S3: Driving CXL storage innovation