Identifying Bottlenecks in Large-Scale HPC Applications using MAP

Hi all!  The story below comes courtesy of Keeran Brabazon.  Enjoy!

Introduction

I’ve been placed in front of an HPC code base for which I have not been the primary developer (in fact I am new to the domain as well as the code). The task at hand is to find the main bottleneck in the weak scaling of the application, and I want to do so in a simple manner that can be clearly shared with others. This is the situation that I was in for a Horizon2020 research project called SAGE at the start of 2017. Project partners were wanting to verify the prediction of the performance of the algorithms that perform I/O in their applications. I was interested in observing the scaling characteristics for the MPI- and CPU-bound parts of the HPC applications.

SAGE is a european funded project which addresses the problems in supercomputing storage.

The HPC space is lucky enough to have a good number of tools for the monitoring and analysis of performance of a system or an application (e.g. Score-P, Scalasca, Tau, Darshan, etc.). Being a developer for Arm MAP and being part of the H2020 project on behalf of the Arm Forge development tools team, it was obvious that I was going to use Arm MAP in order to characterise the performance of different applications in the project. Fortunately, the question I was wanting to answer is perfectly suited to the use of MAP, which enables profiling of threaded and parallel codes, including I/O, communication and compute.

Performance Investigation

In a production scale HPC application run, what does the performance of the file I/O, network communication and intra-CPU computation look like?

To answer this question, I need (approximate) timing information for time spent performing different operations within an application. This information is available from an instrumentation or statistical profiling tool. Instrumentation tools such as Score-P and Tau produce very fine-grained detail which help to diagnose low-level performance bottlenecks very well. However, we don’t need this resolution, and want to avoid collecting the volume of data that is generated when using larger numbers of processes (we want to go to 4096 processes in this study). As we are looking for summary data only, it is possible to use a profiler such as gprof that is relatively lightweight. However, a single profile is generated per process by gprof. This data needs to be combined and presented in a sensible format, which is not a trivial task to code up. It would be much easier to have a summary of an application run automatically merged, as is the case in Arm Performance Reports. Performance Reports gives the information that I am after (i.e. a summary over the lifetime of an application). However, the runtime overhead of using Arm MAP, which gives a timeline of information, is the same as Arm Performance Reports, but gives more detailed information. I may want to inspect more details than the summary, so Arm MAP is the best choice of tool for this job.

Now, I’ve got the right tool and I want to collect performance data for an application. As part of the research in SAGE, a number of applications were profiled, and the results are written up in a public deliverable. In this blog we concentrate on a single application. This is iPIC3D, which is a space-weather particle-in-cell code. The scientific details are not important here. What is important is where we run the application and at what scale. The study is performed on the HPC system at KTH, called Beskow which is a Cray XC40. The nodes used are Intel® Xeon® E5 (previously codenamed Haswell) dual socket 16 CPU nodes. Lustre is the underlying filesystem. Experiments are run saturating a node with MPI processes (so no OpenMP), from a single node to 128 nodes (32 to 4096 processes). Arm MAP was run repeatedly on the applications to make sure that the results were reproducible, and examples of a MAP profile of the application run at 1 node, and another at 128 nodes are shown below.

Figure 1

Figure 2

The profiles in Figures 1 and 2 show the difference between what a profile may look like at small scale and what it looks like at large scale. When optimising a code, it is always important to consider what a production run will look like. If I am going to be running on only a few nodes, then the profile in Figure 1 is useful to identify where optimisation activity should occur. However, iPIC3D is intended to be run on tens of thousands of processes, so that a profile at scale, such as that in Figure 2 shows where bottlenecks will be observed in the run. The scaling is not just in the number of processes, but also in the amount of data. We observe weak scaling here where the amount of computational work per process is kept constant as the number of processes is increased. If the amount of file I/O to perform is directly correlated to the computational work (it is in this case), then there is 128x as much data to write (not many reads are performed in this simulation) on 128 nodes as there is on 1 node. From these profiles I’ve already been able to identify that the application spends more time in MPI and I/O routines as the number of processes is increased in a weak scaling experiment. Now I’ve got my interest peaked, and would like to know if it is possible, using the Arm Performance Analysis tools, just how much information I can obtain on why the I/O is scaling so poorly.

Performance Analysis

From the MAP profiles presented, it is clear that the I/O gets ‘worse’ when we move to larger scales, but is it possible quantify how much ‘worse’? With a set of performance data from runs on differing numbers of processes, the data is available to show the scaling of different parts of the application. However, the ability to combine this data and display it in a front-end is not available through the Arm HPC Performance Analysis tools. To enable the community to gain access the data that is available, though, a number of simple Python scripts have been open sourced on GitHub. There are scripts which work on the JSON export of Arm MAP and Performance Reports files. Contributions and updates are always welcome. Let’s have a quick look and see what information can be provided using these scripts.

Figure 3

The graph presented in Figure 3 shows much more clearly the scaling properties of the time spent in file I/O operations. With the information from the profiles distilled as such it is evident that we can expect the run time to continue to be dominated by file I/O as we go to larger scales.

Using experimental functionality (sorry, no access to this yet), it is also possible to dig a little deeper into the I/O properties and see where we need to focus optimisation efforts. From inspection of the application it was observed that there are two different types of file I/O performed. The first is I/O for the purposes of visualisation. Visualisation data is written at regular intervals, and collective MPI I/O is used by all processes to do this. The other form of I/O is for the purposes of checkpoint-restart, where the state of the simulation is written to disk. For this, each process writes application data to an individual file using HDF5. The volume of data written for checkpoint-restart is almost two orders of magnitude more than for visualisation. Splitting the time recorded in these functions out, it is possible to observe the scaling of these individual approaches. I know which of these I expect to scale the worst. Figure 4 shows the outcome of plotting the scaling.

Figure 4

Summary and Conclusions

The graph in Figure 4 shows some surprising results. The scaling of MPI I/O is expected to be small (through perceived wisdom that collective I/O operations are better than individual I/O operations). Both the writing of HDF5 files and the MPI I/O approach ultimately begin to scale linearly with the number of writers and number of bytes written, but the performance of the collective MPI routines seems quite poor. The implementation of the MPI I/O was done with care, and it was made sure to be done ‘by the book’, so what has gone on here? Well, there are a few issues that can be discussed here, but let’s just say that the book doesn’t consider the data sizes that need to be written, identification of optimal striping factors across parallel filesystems, sharing of network for I/O and communication, as well as many others. Whilst MPI I/O is able to be faster than a simple write to individual files, configuring MPI I/O to be quick on your particular system can require an in-depth knowledge not only of the configuration of your file system and the application being run, but also of the internals of the implementation of MPI that you are using, and possibly differing approaches to I/O when different numbers of writers and different volumes of data are involved. All in all, optimising your I/O can be a difficult and highly non-portable process which may not even port to the next incarnation of your institution’s HPC system. This is where the information provided in Arm MAP and Performance Reports reaches its limit. We are able to observe slow running and diagnose the causes, but the solution for how to overcome these is up to the programmer.

This blog intends to show that the data in a MAP profile no longer needs to stay within a single MAP profile, and that combination of data from multiple runs can give rise to interesting and intuitive results that are easily sharable and describable to others in the research community. But I do know that a lot of you would like to take the story to its resolution, so – what is the solution to the optimisation of the problem at hand? It can be clearly seen through simple graphics – collected from complex data sources – that the I/O in our application iPIC3D should be optimised. But how? Well, in this case, it was decided that the synchronous nature of the I/O could be avoided, and the addition of asynchronous I/O servers should increase the performance of the I/O (see “MPI Streams for HPC Applications”). Allocating specific resources to I/O gives an upper bound to the time spent in I/O routines, as long as those resources are not overloaded.

Anonymous