Profiling OpenMP with Arm MAP 5.0

February 3, 2015

6 minute read time.

A whirlwind tour of Arm MAP's new OpenMP profiling capabilities

We're going to see what Arm MAP 5.0 can do by profiling three versions of a simple PI calculator program with some added I/O for good fun:

A serial version
An OpenMP version
A mixed OpenMP and MPI version
Let's get started!

1. Profiling a serial code

The pi calculator is based on the classic MPI example code included in every MPICH distribution. I've modified it for you by adding some gratuitous I/O to make its profiles a little more exciting.

We run the code like this:

$ ./cpi
calculating pi... 
pi is approximately 3.1415926535899708, Error is 0.0000000000001776 
writing to disk... 
finished

And to profile it we just add “map --profile” in front, like this:

$ map --profile ./cpi
Arm Forge 5.0 - Arm MAP
Profiling          : /home/mark/pi/cpi 
Arm sampler    : preload 
MPI implementation : Auto-Detect (MPICH 2) 

calculating pi... 
pi is approximately 3.1415926535899708, Error is 0.0000000000001776 
writing to disk... 
finished 

MAP analysing program... 
MAP gathering samples... 
MAP generated /home/mark/pi/cpi_1p_2015-01-27_14-20.map

You can also run a profile in Arm MAP via the GUI by just writing “map” and leaving out the “--profile” if you prefer. We can view the output like this:

$ map /home/mark/pi/cpi_1p_2015-01-27_14-20.map

Profiling OpenMP with MAP 1

We can already see some interesting things in here:

The program ran for 31 seconds and spent 84% of that time computing – this is shown in the details line below the large graphs at the top. The “main thread activity” chart gives us an overview of how this program spends its time – dark green is the time the program thread spent in single-threaded compute and orange is the time it spent in I/O calls.

16% of the time was spent in an I/O routine at the end of the program. As in all of Arm MAP's charts, time is on the horizontal axis and we can clearly see both here and in the source code and stacks windows that around 80% of the time is spent first computing and the other 20% is spent at the end in the do_io function.

The initialization phase took 8% of the time (do_init, seen in the source code view and the stacks view at the bottom).

Because a significant amount of time is spent in I/O we might want to look at the disk transfer rates to see whether it is effective or not. Right-clicking on the metrics lets us toggle them on and off and choose new ones from a wide list, all of which are captured by default. The I/O metrics look like this:

Profiling OpenMP with MAP 2

Here I have used click-and-drag to select just the I/O portion of the timelines. In the details on the left we can see that in this region of time 196MB/s were written – a respectable figure for single-node performance.

We have a range of tutorials introducing you to Arm MAP's features in more detail, but today we're going to skip ahead and see what an OpenMP version of this code would look like!

Parallelizing with OpenMP

Arm MAP is telling us that 80% of the computation time is spent inside this loop:

Profiling OpenMP with MAP

Let's parallelize that one by adding a pragma around it – just click in the code editor and start typing:

Profiling OpenMP with MAP

We save this as a new file (cpi-opemp.c) and edit the Makefile to add the new OpenMP version to it, then hit Ctrl-B to rebuild and generate a new profile:

Profiling OpenMP MAP 5

2. Profiling an OpenMP code

Opening the newly-created file shows us that things are getting interesting now:

Profiling OpenMP MAP 6

There's a wealth of information visible here:

The very top line informs us that this ran on 1 process using all 8 cores
The first part of the run (inside do_init) is still colored in dark green for single-threaded computation. That's expected, we didn't add OpenMP pragmas to that function.
The main loop is now shaded in light green for OpenMP computation. This shows us the fraction of total cores that are being used for OpenMP starts at 100% at the beginning of the loop, but tails off towards the end of it.

Any diagonal lines in these charts show imbalance. In this loop we can see that some cores finished their OpenMP workload sooner and had to wait in the implicit barrier at the end of the parallel region until the other cores had caught up.

Note that the synchronization time is colored in gray and marked as OpenMP overhead.

Despite this overhead, the loop completes much more quickly than before! If we drag and select the region we can see that it takes only 4.8 seconds instead of 23.7 seconds for the serial version – a 5x speedup.

The most dominant part of this program now is not the computation but the I/O – trying to reduce the relatively small amount of OpenMP overhead in the computation would be a waste of our time, we should instead either parallelize the initialization phase or try to overlap writing the output with the computation itself!

3. Profiling an MPI+OpenMP code

Many scientific codes today don't use just MPI or just OpenMP; they use both. Profiling a mixed-mode version of our pi calculator is just as easy as before:

$ OMP_NUM_THREADS=2 map --profile mpirun -n 4 ./cpi-mpi

The output we get looks like this:

Profiling OpenMP MAP 7

We see that 4 MPI processes were used, each with 2 OpenMP threads and can see that the program has three distinct phases:

Initialization, in which one process computes on the main thread (dark green) while the rest sit in an MPI_Barrier (blue) waiting for it. Computation, in which all processes spend their time in efficient OpenMP loops (light green) with a small amount of overhead at the end (slope between the computation phase and I/O phase) caused by workload imbalances or system noise during the computation. I/O, in which one process writes to disk (orange) while the rest wait for it in MPI_Finalize (blue). The source code and stacks windows break the same information down to the level of lines of source code, which can be grouped together into loops or functions.

From this we could draw some interesting performance conclusions:

The computation phase is well-parallelized, with all cores spending their time inside the OpenMP loop (light green) and very little time spent in OpenMP barriers (grey), MPI communications (blue) or single-threaded code (dark green).
The program as a whole is not well-parallelized – it spends just 29% of its time in the well-parallelized OpenMP compute region. Over 46% of the time is spent in the I/O phase and 24% of the time in the initialization phase.
During the I/O phase only one process writes to disk while the rest wait in MPI_Finalize.
- Depending on the application it might be faster to use MPI I/O to write data in parallel.
- Better yet would be to write as much data out as possible asynchronously during the computation phase.
The initialization phase might be easier to speed up:
- Only one process spends time in the do_init function; the rest wait in an MPI_Barrier.
- This function is single-threaded (dark green) – we can speed it up several times over if we are able to find a good target for an OpenMP loop!

There are a large range of metrics built into Arm MAP, from the percentage time spent in vectorized instructions to the overall system load. Try it yourself and see what you can find in your codes!

Other resources

Take a look at Arm's tools - to see what is behind your OpenMP performance and debug any OpenMP defects.

Read about Arm MAP

0 comments
0 members are here

Servers and Cloud Computing blog

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Profiling OpenMP with Arm MAP 5.0

A whirlwind tour of Arm MAP's new OpenMP profiling capabilities

1. Profiling a serial code

Parallelizing with OpenMP

2. Profiling an OpenMP code

3. Profiling an MPI+OpenMP code

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Arm CMN S3: Driving CXL storage innovation