We're going to see what Arm MAP 5.0 can do by profiling three versions of a simple PI calculator program with some added I/O for good fun:
The pi calculator is based on the classic MPI example code included in every MPICH distribution. I've modified it for you by adding some gratuitous I/O to make its profiles a little more exciting.
We run the code like this:
$ ./cpi calculating pi... pi is approximately 3.1415926535899708, Error is 0.0000000000001776 writing to disk... finished
And to profile it we just add “map --profile” in front, like this:
$ map --profile ./cpi Arm Forge 5.0 - Arm MAP Profiling : /home/mark/pi/cpi Arm sampler : preload MPI implementation : Auto-Detect (MPICH 2) calculating pi... pi is approximately 3.1415926535899708, Error is 0.0000000000001776 writing to disk... finished MAP analysing program... MAP gathering samples... MAP generated /home/mark/pi/cpi_1p_2015-01-27_14-20.map
You can also run a profile in Arm MAP via the GUI by just writing “map” and leaving out the “--profile” if you prefer. We can view the output like this:
$ map /home/mark/pi/cpi_1p_2015-01-27_14-20.map
We can already see some interesting things in here:
The program ran for 31 seconds and spent 84% of that time computing – this is shown in the details line below the large graphs at the top. The “main thread activity” chart gives us an overview of how this program spends its time – dark green is the time the program thread spent in single-threaded compute and orange is the time it spent in I/O calls.
16% of the time was spent in an I/O routine at the end of the program. As in all of Arm MAP's charts, time is on the horizontal axis and we can clearly see both here and in the source code and stacks windows that around 80% of the time is spent first computing and the other 20% is spent at the end in the do_io function.
The initialization phase took 8% of the time (do_init, seen in the source code view and the stacks view at the bottom).
Because a significant amount of time is spent in I/O we might want to look at the disk transfer rates to see whether it is effective or not. Right-clicking on the metrics lets us toggle them on and off and choose new ones from a wide list, all of which are captured by default. The I/O metrics look like this:
Here I have used click-and-drag to select just the I/O portion of the timelines. In the details on the left we can see that in this region of time 196MB/s were written – a respectable figure for single-node performance.
We have a range of tutorials introducing you to Arm MAP's features in more detail, but today we're going to skip ahead and see what an OpenMP version of this code would look like!
Arm MAP is telling us that 80% of the computation time is spent inside this loop:
Let's parallelize that one by adding a pragma around it – just click in the code editor and start typing:
We save this as a new file (cpi-opemp.c) and edit the Makefile to add the new OpenMP version to it, then hit Ctrl-B to rebuild and generate a new profile:
Opening the newly-created file shows us that things are getting interesting now:
There's a wealth of information visible here:
Any diagonal lines in these charts show imbalance. In this loop we can see that some cores finished their OpenMP workload sooner and had to wait in the implicit barrier at the end of the parallel region until the other cores had caught up.
Note that the synchronization time is colored in gray and marked as OpenMP overhead.
Despite this overhead, the loop completes much more quickly than before! If we drag and select the region we can see that it takes only 4.8 seconds instead of 23.7 seconds for the serial version – a 5x speedup.
The most dominant part of this program now is not the computation but the I/O – trying to reduce the relatively small amount of OpenMP overhead in the computation would be a waste of our time, we should instead either parallelize the initialization phase or try to overlap writing the output with the computation itself!
Many scientific codes today don't use just MPI or just OpenMP; they use both. Profiling a mixed-mode version of our pi calculator is just as easy as before:
$ OMP_NUM_THREADS=2 map --profile mpirun -n 4 ./cpi-mpi
The output we get looks like this:
We see that 4 MPI processes were used, each with 2 OpenMP threads and can see that the program has three distinct phases:
Initialization, in which one process computes on the main thread (dark green) while the rest sit in an MPI_Barrier (blue) waiting for it. Computation, in which all processes spend their time in efficient OpenMP loops (light green) with a small amount of overhead at the end (slope between the computation phase and I/O phase) caused by workload imbalances or system noise during the computation. I/O, in which one process writes to disk (orange) while the rest wait for it in MPI_Finalize (blue). The source code and stacks windows break the same information down to the level of lines of source code, which can be grouped together into loops or functions.
From this we could draw some interesting performance conclusions:
There are a large range of metrics built into Arm MAP, from the percentage time spent in vectorized instructions to the overall system load. Try it yourself and see what you can find in your codes!
Other resources
Take a look at Arm's tools - to see what is behind your OpenMP performance and debug any OpenMP defects.
Read about Arm MAP