Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog CUDA Debugger and Profiler - Advanced Debugging and Performance Optimization Tools for CUDA and OpenACC
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm DDT
  • HPC Tools
  • Development Tools
  • Arm Forge
  • Arm MAP
  • Arm Performance Reports
  • Debugger
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

CUDA Debugger and Profiler - Advanced Debugging and Performance Optimization Tools for CUDA and OpenACC

Mark O'Connor
Mark O'Connor
November 25, 2014
5 minute read time.

Debugging and Optimizing CUDA and OpenACC

Arm Forge is a development tool suite for developing, debugging and optimizing CUDA and OpenACC codes - from GeForce to Tesla and the Kepler K80. Forge includes the parallel and multi-process CUDA debugger, Arm DDT, and the profiler, Arm MAP.

Key Features of Arm Forge CUDA support

The Arm DDT debugger enables:

  • Breakpoints in CUDA threads at specific lines of CUDA or OpenACC code.
  • Mixed CPU and GPU debugging - and even multi-process code - all in the same debugging session.
  • View CPU and GPU threads - with our unique thread-consolidating parallel stack view to simplify the information you see and highlight the differences.
  • Dynamic mode (often used for recursive CUDA).
  • All memory types are visible - including register, shared (block) and global - and unified virtual addressing and GPU Direct.
  • Step warps, blocks and entire kernels.
  • Debug CUDA core dumps with CUDA 7 and above.
  • Supports multiple GPUs simultaneously.
  • Memory debugging for access errors - and memory leak reporting for global memory.

The Arm MAP profiler enables:

  • View memory transfers and global memory used.
  • View GPU temperature as your job progresses.
  • Profile line-level CPU code (line level GPU profiling not supported).
  • View and analyze the time your CPU threads spend waiting for CUDA kernels to complete

Arm DDT and Arm MAP have support for the combinations that matter to you:

  • Large range of supported compilers
    • CUDA C and C++ from the NVIDIA compilers.
    • CUDA Fortran and F90 and OpenACC from Portland.
    • Cray OpenACC compiler.
    • Inline PTX.
  • The latest CUDA toolkits - CUDA 6.5, 7, 7.5 and 8.
  • HPC clusters with CUDA (eg. MPI)

CUDA debugging

CUDA C, C++ and Fortran and OpenACC are fully supported by Arm DDT. The world's biggest users of CUDA and OpenACC debug their applications with Arm DDT - including on Oak Ridge National Laboratory’s Titan and CSCS’s Daint, the two largest GPU systems in the world.

Let’s start with some tips on how to use Arm DDT for CUDA - or read the list of CUDA features.

Set a breakpoint

Just like a CPU debugger – you can set a breakpoint at any line of CUDA source code. Any time a block of CUDA threads gets to that line, the debugger will pause the whole application.

Explore behavior

GPUs are massively SIMD – which means you can have thousands of threads active at any point in time. Use the debugger to select a CUDA thread by its index or select a thread that is on a particular line of code.

CUDA debugging

Stepping a thread is a great way to watch how a kernel progresses: CUDA GPUs are slightly different to CPUs as they actually execute threads in groups. It’s worth knowing that other GPU threads in the same “warp” (usually 32 threads) will also progress at the same time. Did the thread move through the code as you expected?

Visualize data

Each CUDA thread has its own register variables but shares other memory with threads in the same block, the whole device or even the host.

Whatever type of memory data resides in you need to check that it is what you expect it to be. Single values are easily seen – but the real neat trick is to visualize array data or filter to look for unusual values.

Visual of array data

Perhaps you want to bring up a second visualization to compare GPU data to the CPU copy as a sanity check?

Verify memory usage

Before you start your application inside Arm DDT – tick the option to debug CUDA memory usage. It’s easy to make an error that reads beyond an array in CUDA. Not all arrays are multiples of the warp-size, but a frequent error is to assume they are.

Those errors are not always fatal, but they cause non-deterministic behavior - which can cause failure at unexpected times. With memory debugging, they’ll be spotted by the debugger – so you can fix them before trouble happens.

Profiling and tuning CUDA and OpenACC applications

Whenever you have a compute intensive code, you should profile that code in order to get the most performance. Arm MAP is a profiler that profiles the performance of applications. It shows the lines of code that are executing for the most time in the CPU (host) code.

Step 1: Profile the initial code

Use the Arm MAP profiler to discover which parts of your code are consuming the most CPU time and what they do. If scalar or floating point operations are dominating, you have a good candidate for GPU usage – but if I/O, branching or communication is dominating, you need to fix those issues first.

Profiling CUDA in Arm MAP

Step 2: Profile the results

Once you have a working CUDA or OpenACC code – identify if the performance has improved – and where the next target for optimization is. If you can reduce the number of times data is transferred from the CPU to GPU and vice-versa by combining sequences of CPU operations into one large GPU usage, performance should be improved.

Note: MAP is not able to give profile information for source lines or functions for CUDA or OpenACC kernels executed on the GPU.

Advanced CUDA: Overlap data transfer

Although we can't see the individual source lines when profiling CUDA, MAP still profiles how the CPU and GPU work together.

Profiling CUDA in Arm MAP 2

One optimization is to overlap GPU and CPU computation.  CUDA makes it easy to do this with streams – but you still need take a look at how much time is spent in data transfer. Too little time waiting at the synchronization? Your GPU may have finished quicker than the CPU – try giving it more work. Too much time at synchronization? Your CPU has wasted cycles you could use!

More on CUDA

  • Our blog on debugging CUDA dynamic parallelism
  • CUDA energy and power profiling and optimization
  • CUDA resources at NVIDIA
  • The OpenACC.org group
  • Watch a Video on Debugging and Profiling CUDA and OpenACC
Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025