Writing a MAP Custom Metric: PAPI IPC

April 21, 2016

7 minute read time.

New metric

Arm MAP isn't just a lightweight profiler to help you optimize your code. It also lets you add your own metrics with just a couple of lines of code. To show how this works, I'm going to add PAPI's instructions-per-cycle metric to MAP.

The PAPI instructions-per-cycle metric measures the mean number of instructions executed per CPU cycle. This is often used as a proxy for computational intensity – values below 1.0 suggest the CPU is spending a lot of time stalled. Modern superscalar architectures can issue many instructions per cycle, but for HPC applications, 1.0 is generally considered acceptable.

MAP will let us track this metric over time and correlate it to our existing CPU, MPI and I/O metrics as well as against our source code. We're going to use the PAPI library to add this as a custom metric, but you can use the same approach to add any kind of metric you like to MAP. Let's go!

Just in case: reference documentation

There's a set of high-quality reference docs in your forge/map/metrics/doc/ directory, both in PDF and interactive HTML format:

Reference docs

You don't need them to follow along with this guide but they'll be an invaluable help when you want to implement your own metrics!

Getting started

Life's easier when you start with a working template. In the forge/map/metrics/examples/ directory there's a “sample” directory containing a sample metric that measures something boring like interrupts per second. We're going to use it as a framework for our new PAPI IPC metric.

$ cd forge/map/metrics/examples/
$ cp -r sample papi
$ cd papi
$ ls
Makefile  sample.c  sample.xml

That's all you need for a new metric. Let's rename ours:

$ mv sample.c my_papi.c
$ mv sample.xml my_papi.xml
Makefile

We now need to update the Makefile with the new filenames:

$ vim Makefile

Make file

This Makefile builds a shared library that MAP will automatically load into your program while it is running. Functions from this library will be called to take measurements.

Here we need to rename sample to my_papi, and we also want to add a -lpapi flag as well to ensure we can use the PAPI functions in our code:

Make file 2

That was easy! Now let's write the code!

Writing the my_papi.c file

We copied and renamed the existing sample.c file but it's still full of references to its original goal, which was reading from /proc/interrupts:

C initial

We can strip this back down to the minimum pretty quickly:

C initial 2

Now we are ready. We have 3 hooks available:

allinea_plugin_initialize – this is called once at the start of the program to initialize any counters or variables we might need.
allinea_plugin_cleanup – this is called at the end of the program in case we need to do any cleaning up of files or memory (both rare).
sample_interrupts – this will be called by MAP once per sampling interval (initially 50Hz but MAP will reduce this dynamically as the application runtime increases and will automatically resample your data for you).

To get the PAPI instructions-per-cycle you just need to use one function from the PAPI API:

PAPI man

This is almost as easy as it gets. We don't need to do any special initialization or cleanup, we can just call PAPI_ipc each sampling interval. The manpage doesn't say whether PAPI_ipc is signal safe. This is an issue because we'll be calling it from a signal handler (that's how statistical profilers like MAP work). We'll just try it and see!

Using this in our custom metric functions is straightforward. First we rename sample_interrupts to something more descriptive, like sample_ipc and implement it like this:

C done

There are 5 things I want to call out here:

I changed the type of the out_value parameter from uint64_t* to double*. MAP will let you use either of these. Instructions per cycle will be a float, so a double* is the type I want.
We set the in_out_sample_time to allinea_get_current_time(). It's usually fine to do this as the first thing this function does.
We call PAPI_ipc, which stores a bunch of information we don't need and one piece that we do, namely the IPC since the last call.
We check the return value of the function and use the allinea_set_metric_error_messagef function to report any errors back up to the user. We then return -1. Error handling across a distributed cluster is as simple as that!
Finally we set *out_value to the value we have recorded (ipc) and return 0 for success.
It's important that these functions be fast because the slower they are the more overhead we are adding to the program.

Telling MAP about our metric in my_papi.xml

This file tells MAP all the extra information about our metric – what to call it in the GUI, which group of metrics to put it in, which types to expect and so on. The version we copied from sample.xml looks like this:

XML 1

Let's start making some changes. There are two main XML elements here – metric and metricGroup. For metric:

The id can change from com.allinea.metrics.sample.interrupts to something else. I've chosen com.allinea.metrics.papi.ipc.
We used the double type, not uint64_t, so we need to change that.
The com.allinea.metrics.sample_src reference should be changed too. I used com.allinea.metrics.papi_src.
The functionName is not sample_interrupts – I renamed it to sample_ipc.
The divideBySampleTime property should be false – this is used if you are measuring e.g. bytes and want MAP to automatically display that as a rate over time (bytes/s). IPC/s would make no sense!
The <display> controls names and groups in the GUI. The colour here is specified in HTML notation.
The metricGroup element describes the group this metric will belong to in the GUI. I decided to add it to a PAPI group of metrics – perhaps we'll add more later! This is straightforward, but make sure the metric ref and source id match the ones used in your metric section.

The whole file with some comments removed looks like this:

XML done

That's all there is to it!

Testing our new metric

Installing the metric is easy – just run “make” to build the library and then “make install” to put it in your ~/.allinea/map/metrics/ directory. MAP will look here for custom metrics:

Make install

Now we just run MAP on an example program to see the new metric appear!

$ map mpiexec -n 8 ~/allinea/forge/examples/wave_c

There's no need to select the metric within the GUI. All metrics in ~/.allinea/map/metrics/ are enabled by default. If you don't want to use one any more, remove it from that directory before running.

After MAP has finished we can find our new metric group in the Metrics menu:

And when we select it, there it is in all its glory:

New metric

Ta-da! And that's automatically preloaded into the application (no linking required), aggregated scalably from every rank, downsampled appropriately and delivered for your pleasure right alongside all of MAP's built-in metrics.

But what's this? There's a clear pattern to the IPC. It's not smooth, but spiky. And it seems to spike up whenever the program is in MPI calls?

Let's add in some of the built-in metrics to see what's going on here:

New metric 2

Well would you look at that? The instructions-per-cycle as measured by PAPI peaks whenever the program is waiting in an MPI call!

This is because many MPIs, including MPICH2 on my laptop, will busy-wait during communications to reduce latency. During that busy loop the CPU happily executes multiple instructions per second! But during my actual computation phases the IPC rate is much lower:

New metric lower

Just 0.62 instructions per cycle if we ignore the MPI parts of the run. Clearly there's a lot of optimization work to be done here!

If we had just been focusing on CPU performance with some ad-hoc or CPU-centric PAPI measurements we would have seen that that this program has a higher IPC at larger core counts. We might have surmised that we're getting better cache utilization or something! We'd have been wrong, wrong, wrong. In this case the higher PAPI IPC measurement would mostly be telling us we're spending more time in MPI busy-wait loops.

That's why it's important to use a profiler that combines CPU, MPI and I/O metrics into one overall picture of your code's performance. There's more to an elephant than its trunk!

I hope this guide has shown how straightforward adding your own metrics to Arm MAP can be and can act as a step-by-step tutorial when you want to write one of your own.

Happy profiling!

Reminder: you will need the Metrics Pack upgrade for MAP before any custom metrics will be loaded and visualized. If you have purchased the Energy Pack upgrade then this also includes the Metrics Pack and you are good to go! If not please visit:

Arm HPC Tools: Arm MAP

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Writing a MAP Custom Metric: PAPI IPC

Just in case: reference documentation

Getting started

Writing the my_papi.c file

Telling MAP about our metric in my_papi.xml

Testing our new metric

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors