Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Profiling OpenMP with Arm MAP 5.0
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • HPC Tools
  • Development Tools
  • Arm MAP
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Profiling OpenMP with Arm MAP 5.0

Mark O'Connor
Mark O'Connor
February 3, 2015
6 minute read time.

A whirlwind tour of Arm MAP's new OpenMP profiling capabilities

We're going to see what Arm MAP 5.0 can do by profiling three versions of a simple PI calculator program with some added I/O for good fun:

  • A serial version
  • An OpenMP version
  • A mixed OpenMP and MPI version
  • Let's get started!

1. Profiling a serial code

The pi calculator is based on the classic MPI example code included in every MPICH distribution. I've modified it for you by adding some gratuitous I/O to make its profiles a little more exciting.

We run the code like this:

$ ./cpi
calculating pi... 
pi is approximately 3.1415926535899708, Error is 0.0000000000001776 
writing to disk... 
finished

And to profile it we just add “map --profile” in front, like this:

$ map --profile ./cpi
Arm Forge 5.0 - Arm MAP
Profiling          : /home/mark/pi/cpi 
Arm sampler    : preload 
MPI implementation : Auto-Detect (MPICH 2) 

calculating pi... 
pi is approximately 3.1415926535899708, Error is 0.0000000000001776 
writing to disk... 
finished 

MAP analysing program... 
MAP gathering samples... 
MAP generated /home/mark/pi/cpi_1p_2015-01-27_14-20.map

You can also run a profile in Arm MAP via the GUI by just writing “map” and leaving out the “--profile” if you prefer. We can view the output like this:

$ map /home/mark/pi/cpi_1p_2015-01-27_14-20.map

Profiling OpenMP with MAP 1

We can already see some interesting things in here:

The program ran for 31 seconds and spent 84% of that time computing – this is shown in the details line below the large graphs at the top. The “main thread activity” chart gives us an overview of how this program spends its time – dark green is the time the program thread spent in single-threaded compute and orange is the time it spent in I/O calls.

16% of the time was spent in an I/O routine at the end of the program. As in all of Arm MAP's charts, time is on the horizontal axis and we can clearly see both here and in the source code and stacks windows that around 80% of the time is spent first computing and the other 20% is spent at the end in the do_io function.

The initialization phase took 8% of the time (do_init, seen in the source code view and the stacks view at the bottom).

Because a significant amount of time is spent in I/O we might want to look at the disk transfer rates to see whether it is effective or not. Right-clicking on the metrics lets us toggle them on and off and choose new ones from a wide list, all of which are captured by default. The I/O metrics look like this:

Profiling OpenMP with MAP 2

Here I have used click-and-drag to select just the I/O portion of the timelines. In the details on the left we can see that in this region of time 196MB/s were written – a respectable figure for single-node performance.

We have a range of tutorials introducing you to Arm MAP's features in more detail, but today we're going to skip ahead and see what an OpenMP version of this code would look like!

Parallelizing with OpenMP

Arm MAP is telling us that 80% of the computation time is spent inside this loop:

Profiling OpenMP with MAP

Let's parallelize that one by adding a pragma around it – just click in the code editor and start typing:

Profiling OpenMP with MAP

We save this as a new file (cpi-opemp.c) and edit the Makefile to add the new OpenMP version to it, then hit Ctrl-B to rebuild and generate a new profile:

Profiling OpenMP MAP 5

2. Profiling an OpenMP code

Opening the newly-created file shows us that things are getting interesting now:

Profiling OpenMP MAP 6

There's a wealth of information visible here:

  • The very top line informs us that this ran on 1 process using all 8 cores
  • The first part of the run (inside do_init) is still colored in dark green for single-threaded computation. That's expected, we didn't add OpenMP pragmas to that function.
  • The main loop is now shaded in light green for OpenMP computation. This shows us the fraction of total cores that are being used for OpenMP starts at 100% at the beginning of the loop, but tails off towards the end of it.

Any diagonal lines in these charts show imbalance. In this loop we can see that some cores finished their OpenMP workload sooner and had to wait in the implicit barrier at the end of the parallel region until the other cores had caught up.

Note that the synchronization time is colored in gray and marked as OpenMP overhead.

Despite this overhead, the loop completes much more quickly than before! If we drag and select the region we can see that it takes only 4.8 seconds instead of 23.7 seconds for the serial version – a 5x speedup.

The most dominant part of this program now is not the computation but the I/O – trying to reduce the relatively small amount of OpenMP overhead in the computation would be a waste of our time, we should instead either parallelize the initialization phase or try to overlap writing the output with the computation itself!

3. Profiling an MPI+OpenMP code

Many scientific codes today don't use just MPI or just OpenMP; they use both. Profiling a mixed-mode version of our pi calculator is just as easy as before:

$ OMP_NUM_THREADS=2 map --profile mpirun -n 4 ./cpi-mpi

The output we get looks like this:

Profiling OpenMP MAP 7

We see that 4 MPI processes were used, each with 2 OpenMP threads and can see that the program has three distinct phases:

Initialization, in which one process computes on the main thread (dark green) while the rest sit in an MPI_Barrier (blue) waiting for it. Computation, in which all processes spend their time in efficient OpenMP loops (light green) with a small amount of overhead at the end (slope between the computation phase and I/O phase) caused by workload imbalances or system noise during the computation. I/O, in which one process writes to disk (orange) while the rest wait for it in MPI_Finalize (blue). The source code and stacks windows break the same information down to the level of lines of source code, which can be grouped together into loops or functions.

From this we could draw some interesting performance conclusions:

  • The computation phase is well-parallelized, with all cores spending their time inside the OpenMP loop (light green) and very little time spent in OpenMP barriers (grey), MPI communications (blue) or single-threaded code (dark green).
  • The program as a whole is not well-parallelized – it spends just 29% of its time in the well-parallelized OpenMP compute region. Over 46% of the time is spent in the I/O phase and 24% of the time in the initialization phase.
  • During the I/O phase only one process writes to disk while the rest wait in MPI_Finalize.
    • Depending on the application it might be faster to use MPI I/O to write data in parallel.
    • Better yet would be to write as much data out as possible asynchronously during the computation phase.
  • The initialization phase might be easier to speed up:
    • Only one process spends time in the do_init function; the rest wait in an MPI_Barrier.
    • This function is single-threaded (dark green) – we can speed it up several times over if we are able to find a good target for an OpenMP loop!

There are a large range of metrics built into Arm MAP, from the percentage time spent in vectorized instructions to the overall system load. Try it yourself and see what you can find in your codes!

Other resources

Take a look at Arm's tools - to see what is behind your OpenMP performance and debug any OpenMP defects.

Read about Arm MAP

Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025