Arm MAP isn't just a lightweight profiler to help you optimize your code. It also lets you add your own metrics with just a couple of lines of code. To show how this works, I'm going to add PAPI's instructions-per-cycle metric to MAP.
The PAPI instructions-per-cycle metric measures the mean number of instructions executed per CPU cycle. This is often used as a proxy for computational intensity – values below 1.0 suggest the CPU is spending a lot of time stalled. Modern superscalar architectures can issue many instructions per cycle, but for HPC applications, 1.0 is generally considered acceptable.
MAP will let us track this metric over time and correlate it to our existing CPU, MPI and I/O metrics as well as against our source code. We're going to use the PAPI library to add this as a custom metric, but you can use the same approach to add any kind of metric you like to MAP. Let's go!
There's a set of high-quality reference docs in your forge/map/metrics/doc/ directory, both in PDF and interactive HTML format:
You don't need them to follow along with this guide but they'll be an invaluable help when you want to implement your own metrics!
Life's easier when you start with a working template. In the forge/map/metrics/examples/ directory there's a “sample” directory containing a sample metric that measures something boring like interrupts per second. We're going to use it as a framework for our new PAPI IPC metric.
$ cd forge/map/metrics/examples/
$ cp -r sample papi
$ cd papi
Makefile sample.c sample.xml
That's all you need for a new metric. Let's rename ours:
$ mv sample.c my_papi.c
$ mv sample.xml my_papi.xml
We now need to update the Makefile with the new filenames:
$ vim Makefile
This Makefile builds a shared library that MAP will automatically load into your program while it is running. Functions from this library will be called to take measurements.
Here we need to rename sample to my_papi, and we also want to add a -lpapi flag as well to ensure we can use the PAPI functions in our code:
That was easy! Now let's write the code!
We copied and renamed the existing sample.c file but it's still full of references to its original goal, which was reading from /proc/interrupts:
We can strip this back down to the minimum pretty quickly:
Now we are ready. We have 3 hooks available:
To get the PAPI instructions-per-cycle you just need to use one function from the PAPI API:
This is almost as easy as it gets. We don't need to do any special initialization or cleanup, we can just call PAPI_ipc each sampling interval. The manpage doesn't say whether PAPI_ipc is signal safe. This is an issue because we'll be calling it from a signal handler (that's how statistical profilers like MAP work). We'll just try it and see!
Using this in our custom metric functions is straightforward. First we rename sample_interrupts to something more descriptive, like sample_ipc and implement it like this:
There are 5 things I want to call out here:
This file tells MAP all the extra information about our metric – what to call it in the GUI, which group of metrics to put it in, which types to expect and so on. The version we copied from sample.xml looks like this:
Let's start making some changes. There are two main XML elements here – metric and metricGroup. For metric:
The whole file with some comments removed looks like this:
That's all there is to it!
Installing the metric is easy – just run “make” to build the library and then “make install” to put it in your ~/.allinea/map/metrics/ directory. MAP will look here for custom metrics:
Now we just run MAP on an example program to see the new metric appear!
$ map mpiexec -n 8 ~/allinea/forge/examples/wave_c
There's no need to select the metric within the GUI. All metrics in ~/.allinea/map/metrics/ are enabled by default. If you don't want to use one any more, remove it from that directory before running.
After MAP has finished we can find our new metric group in the Metrics menu:
And when we select it, there it is in all its glory:
Ta-da! And that's automatically preloaded into the application (no linking required), aggregated scalably from every rank, downsampled appropriately and delivered for your pleasure right alongside all of MAP's built-in metrics.
But what's this? There's a clear pattern to the IPC. It's not smooth, but spiky. And it seems to spike up whenever the program is in MPI calls?
Let's add in some of the built-in metrics to see what's going on here:
Well would you look at that? The instructions-per-cycle as measured by PAPI peaks whenever the program is waiting in an MPI call!
This is because many MPIs, including MPICH2 on my laptop, will busy-wait during communications to reduce latency. During that busy loop the CPU happily executes multiple instructions per second! But during my actual computation phases the IPC rate is much lower:
Just 0.62 instructions per cycle if we ignore the MPI parts of the run. Clearly there's a lot of optimization work to be done here!
If we had just been focusing on CPU performance with some ad-hoc or CPU-centric PAPI measurements we would have seen that that this program has a higher IPC at larger core counts. We might have surmised that we're getting better cache utilization or something! We'd have been wrong, wrong, wrong. In this case the higher PAPI IPC measurement would mostly be telling us we're spending more time in MPI busy-wait loops.
That's why it's important to use a profiler that combines CPU, MPI and I/O metrics into one overall picture of your code's performance. There's more to an elephant than its trunk!
I hope this guide has shown how straightforward adding your own metrics to Arm MAP can be and can act as a step-by-step tutorial when you want to write one of your own.
Reminder: you will need the Metrics Pack upgrade for MAP before any custom metrics will be loaded and visualized. If you have purchased the Energy Pack upgrade then this also includes the Metrics Pack and you are good to go! If not please visit:
Arm HPC Tools: Arm MAP