Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog CUDA Debugger and Profiler - Advanced Debugging and Performance Optimization Tools for CUDA and OpenACC
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm DDT
  • HPC Tools
  • Development Tools
  • Arm Forge
  • Arm MAP
  • Arm Performance Reports
  • Debugger
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

CUDA Debugger and Profiler - Advanced Debugging and Performance Optimization Tools for CUDA and OpenACC

Mark O'Connor
Mark O'Connor
November 25, 2014
5 minute read time.

Debugging and Optimizing CUDA and OpenACC

Arm Forge is a development tool suite for developing, debugging and optimizing CUDA and OpenACC codes - from GeForce to Tesla and the Kepler K80. Forge includes the parallel and multi-process CUDA debugger, Arm DDT, and the profiler, Arm MAP.

Key Features of Arm Forge CUDA support

The Arm DDT debugger enables:

  • Breakpoints in CUDA threads at specific lines of CUDA or OpenACC code.
  • Mixed CPU and GPU debugging - and even multi-process code - all in the same debugging session.
  • View CPU and GPU threads - with our unique thread-consolidating parallel stack view to simplify the information you see and highlight the differences.
  • Dynamic mode (often used for recursive CUDA).
  • All memory types are visible - including register, shared (block) and global - and unified virtual addressing and GPU Direct.
  • Step warps, blocks and entire kernels.
  • Debug CUDA core dumps with CUDA 7 and above.
  • Supports multiple GPUs simultaneously.
  • Memory debugging for access errors - and memory leak reporting for global memory.

The Arm MAP profiler enables:

  • View memory transfers and global memory used.
  • View GPU temperature as your job progresses.
  • Profile line-level CPU code (line level GPU profiling not supported).
  • View and analyze the time your CPU threads spend waiting for CUDA kernels to complete

Arm DDT and Arm MAP have support for the combinations that matter to you:

  • Large range of supported compilers
    • CUDA C and C++ from the NVIDIA compilers.
    • CUDA Fortran and F90 and OpenACC from Portland.
    • Cray OpenACC compiler.
    • Inline PTX.
  • The latest CUDA toolkits - CUDA 6.5, 7, 7.5 and 8.
  • HPC clusters with CUDA (eg. MPI)

CUDA debugging

CUDA C, C++ and Fortran and OpenACC are fully supported by Arm DDT. The world's biggest users of CUDA and OpenACC debug their applications with Arm DDT - including on Oak Ridge National Laboratory’s Titan and CSCS’s Daint, the two largest GPU systems in the world.

Let’s start with some tips on how to use Arm DDT for CUDA - or read the list of CUDA features.

Set a breakpoint

Just like a CPU debugger – you can set a breakpoint at any line of CUDA source code. Any time a block of CUDA threads gets to that line, the debugger will pause the whole application.

Explore behavior

GPUs are massively SIMD – which means you can have thousands of threads active at any point in time. Use the debugger to select a CUDA thread by its index or select a thread that is on a particular line of code.

CUDA debugging

Stepping a thread is a great way to watch how a kernel progresses: CUDA GPUs are slightly different to CPUs as they actually execute threads in groups. It’s worth knowing that other GPU threads in the same “warp” (usually 32 threads) will also progress at the same time. Did the thread move through the code as you expected?

Visualize data

Each CUDA thread has its own register variables but shares other memory with threads in the same block, the whole device or even the host.

Whatever type of memory data resides in you need to check that it is what you expect it to be. Single values are easily seen – but the real neat trick is to visualize array data or filter to look for unusual values.

Visual of array data

Perhaps you want to bring up a second visualization to compare GPU data to the CPU copy as a sanity check?

Verify memory usage

Before you start your application inside Arm DDT – tick the option to debug CUDA memory usage. It’s easy to make an error that reads beyond an array in CUDA. Not all arrays are multiples of the warp-size, but a frequent error is to assume they are.

Those errors are not always fatal, but they cause non-deterministic behavior - which can cause failure at unexpected times. With memory debugging, they’ll be spotted by the debugger – so you can fix them before trouble happens.

Profiling and tuning CUDA and OpenACC applications

Whenever you have a compute intensive code, you should profile that code in order to get the most performance. Arm MAP is a profiler that profiles the performance of applications. It shows the lines of code that are executing for the most time in the CPU (host) code.

Step 1: Profile the initial code

Use the Arm MAP profiler to discover which parts of your code are consuming the most CPU time and what they do. If scalar or floating point operations are dominating, you have a good candidate for GPU usage – but if I/O, branching or communication is dominating, you need to fix those issues first.

Profiling CUDA in Arm MAP

Step 2: Profile the results

Once you have a working CUDA or OpenACC code – identify if the performance has improved – and where the next target for optimization is. If you can reduce the number of times data is transferred from the CPU to GPU and vice-versa by combining sequences of CPU operations into one large GPU usage, performance should be improved.

Note: MAP is not able to give profile information for source lines or functions for CUDA or OpenACC kernels executed on the GPU.

Advanced CUDA: Overlap data transfer

Although we can't see the individual source lines when profiling CUDA, MAP still profiles how the CPU and GPU work together.

Profiling CUDA in Arm MAP 2

One optimization is to overlap GPU and CPU computation.  CUDA makes it easy to do this with streams – but you still need take a look at how much time is spent in data transfer. Too little time waiting at the synchronization? Your GPU may have finished quicker than the CPU – try giving it more work. Too much time at synchronization? Your CPU has wasted cycles you could use!

More on CUDA

  • Our blog on debugging CUDA dynamic parallelism
  • CUDA energy and power profiling and optimization
  • CUDA resources at NVIDIA
  • The OpenACC.org group
  • Watch a Video on Debugging and Profiling CUDA and OpenACC
Anonymous
High Performance Computing (HPC) blog
  • Arm Compiler for Linux and Arm PL now available in Spack

    Annop Wongwathanarat
    Annop Wongwathanarat
    We are happy to announce that Arm Compiler for Linux (ACfL) and Arm Performance Libraries (Arm PL) are now available as installable packages in Spack, a widely used package manager in the HPC community…
    • May 22, 2023
  • Using vector math functions on Arm

    Chris Goodyer
    Chris Goodyer
    In this post, we highlight the scale of performance increases possible, detail the accuracy requirements, and explain in detail how to use the libamath library that ships with ACfL.
    • May 16, 2023
  • Arm Compiler for Linux and Arm Performance Libraries 23.04

    Chris Goodyer
    Chris Goodyer
    Arm Compiler for Linux 23.04 is now available with improved compilers and libraries. In this blog, we explore what is new in this first major release of 2023.
    • May 9, 2023