Detecting I/O contention in HPC code using Arm Forge Pro GPFS metrics

July 26, 2017

5 minute read time.

I/O contention is a frustrating problem to solve. An application run may be taking longer than expected, but how do you know if it’s due to I/O contention?

Arm Forge Pro includes I/O metrics for Lustre, but not GPFS. Fortunately Forge Pro also includes the ability to:

Create your own custom-built metrics
Use metrics created by other people.

In the rest of post I show you how I wrote a custom metric to add GPFS filesystem counters Forge Pro from scratch, but if you just want to use them then download these additional GPFS I/O metrics for Arm Forge Pro.

Creating a custom metric to read additional I/O counters

The GPFS filesystem exposes a number of counters that we can collect with Arm MAP and show in the GUI like any of the built-in metrics. One indicator of I/O contention is the time spent in I/O operations (IOPS) – when the average time spent in IOPS increases significantly this is indicative of I/O contention – so let's write a custom metric that measures the average cycles spent in GPFS IOPS. When this increases we'll have a strong indication of contention or other inefficiency affecting our run!

We have covered the basics of writing a custom metric before in a previous blog post. There are two files that we need:

A shared library (let's call it lib-gpfs.so) that collects the counters from the GPFS filesystem.
An XML file (let's call it gpfs.xml) that describes the metric to MAP.

You can follow along by downloading the source files for the metric:

https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-38-04/gpfs1.tar.bz2

Let's start with a Makefile, because no-one likes compiling source files by hand. Like everyone else, I copied a sample Makefile (see forge/map/metrics/examples/sample/Makefile) and edited it until it did what I needed:

The Makefile does some extra work to be a good citizen. First we detect the architecture we are building on and set a GPFS_ARCH_ define accordingly and secondly we point the compiler at the GPFS include files (/usr/lpp/mmfs/src/include/cxi) so it can find the necessary data structures and constants.

Next up is the source code for the shared library. Let's take it page by page:

We're going to use the cxiSharedSeg.h from the GPFS includes files to get the definitions of the data structures and constants we need to communicate with the GPFS kernel module. The /dev/ss0 device is the GPFS shared segment device and this is what we're going to use to access the GPFS filesystem counters.

Next up is allinea_plugin_initialize – MAP calls this function to initialize the custom metric plugin.

This is where we open the /dev/ss0 device. If it's not there then there is probably no GPFS filesystem present!

The allinea_plugin_cleanup function is the mirror of allinea_plugin_initialize – MAP calls this function to cleanup the custom metric plugin when profiling has finished.

The update function below is where all the heavy lifting occurs. Because this is called when the profiler interrupts the program to take a sample we want to keep that heavy lifting as light (and signal-safe) as possible!

Linux filesystems provide their services, such as open, read, write, close to the Linux kernel through the Linux Virtual Filesystem Switch (VFS) interface and GPFS counts the number of calls made through this interface, and the number of cycles spent in each call. This function sends a control request to the /dev/ss0 device to request these VFS statistic counters from the GPFS kernel driver. The function then sums the number of CPU cycles spent in each I/O call to get a grand total for all I/O calls.

The getMetricValue helper function below ensures that update is only called once per sample. It checks the time of the current sample against the last recorded sample time and if they are different it calls update, before returning the value of the requested metric.

Last, but by no means least, we have the metric function that MAP will call to retrieve the value of our metric.

To tell make this metric appear in the GUI we need to include a small XML file that describes the metric to MAP. In this case it describes a single metric: GPFS I/O cycle per I/O operation – the average number of cycles spent in the GPFS kernel module per I/O operation.

We can build the code by simply running make:

user@hostname:~/code/gpfs1$ make
gcc -D_REENTRANT -DGPFS_ARCH_X86_64 -I/usr/lpp/mmfs/src/include/cxi -I/home/user/allinea-forge-7.0.5-Redhat-7.0-x86_64/map/metrics/include -Wall -Werror -Wno-attributes -fno-omit-frame-pointer -g -O2 -Wno-unused-but-set-variable lib-gpfs.c -o lib-gpfs.so -fPIC -shared
Use make install to install the metric in ~/.allinea/map/metrics for testing.
user@hostname:~/code/gpfs1$

Using our additional GPFS metric to detect I/O contention

Now when we profile an application using Allinea MAP the profile will include our new GPFS metric! Let's take it for a test drive with a sample program that is showing unusual slowdown:

Here I have profiled a program which has ten iterations. Each iteration consists of three phases corresponding to the bands in the Main thread activity view: read (orange), compute (green), write (orange). The first five iterations perform as expected, but the last five iterations take much longer than the first five, which is why the orange bands get longer towards the right of the screen.

Can I determine whether the last five iterations take longer because they are performing more I/O, or whether they take longer due to I/O contention? Let's use our new metric to find out!

To use the new metric select the Metrics menu item and select the new GPFS cycles per I/O operation metric:

Now we can see the GPFS cycles per I/O operation metric at the top of the MAP window:

And what's more we can see that during the last five iterations the CPU cycles per I/O operation increases from a baseline average of 1.72 million cycles per IOP for the first five operations to an average of 1.15 billion cycles per IOP. The average number of cycles per I/O operation has significantly increased and now we can answer our earlier question – the last five iterations are taking longer due to I/O contention.

If we were looking for a representative run without I/O contention then now we know this run has been compromised we will want to do another profiling run. On the other hand if we wanted to know the impact of I/O contention on our program then now we now know exactly where to look, and MAP's source code views allow us to drill right down into the impacted code.

Now you know all you need to solve I/O problems by extending Arm MAP's I/O profiling capabilities. If you found this interesting, do check out an extended version of this GPFS custom metric plugin containing additional metrics for your edification and amusement:

https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-38-04/gpfs2.tar.bz2

Happy profiling!

[CTAToken URL = "https://hpc-buy.arm.com/free-trial" target="_blank" text="Trial Arm Forge" class ="green"]

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Detecting I/O contention in HPC code using Arm Forge Pro GPFS metrics

Creating a custom metric to read additional I/O counters

Using our additional GPFS metric to detect I/O contention

Expanding Arm on Arm with the NVIDIA Grace CPU

Arm Performance Libraries 24.10

Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL