This year we're proud to be sponsoring the Student Cluster Competition at SC15. One of the key codes teams will have to optimize for their systems is the classic Linpack benchmark. I decided to have a go on one of our test systems to see what the students are up against.
I expected a very straightforward tuning process – after all, this code has been a staple of the HPC community for decades – but as it turns out I was in for a surprise!
This blog post walks through using Performance Reports and Arm Forge to understand and improve the performance of the HPL 2.1 Linpack benchmark with lots of pictures and real performance figures.
After compiling and installing HPL-2.1 on our cluster I had to customize the HPL.dat file - the defaults are too small for a modern machine. After reading the Linpack FAQ I settled on an N of 15000 for a compromise between good memory usage and short runs for tuning on the two 8-core nodes I reserved for this test.
Another key parameter, NB, is the number of blocks. Apparently between 32-256 was a good range when the FAQ was last updated. Let's try 32 and see how we get on.
Finally, we need to make sure P*Q = the number of MPI processes we are using. HPL will happily leave the other ranks idle if you underspecify! I'm using two 8-core nodes, so I set these both to 4 as the HPL tuning guide recommends square layouts.
How does it do? Well, I got this:
T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L2 15000 32 4 4 67.63 3.327e+01
So, 33 Gflops. Are we done now, or can we do better? How do we know?
For a high-level overview let's look at a Performance Report of this run. It's easy to generate one:
$ perf-report srun -n 16 ./xhpl … T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L2 15000 32 4 4 68.22 3.299e+01
HPL runs as before and produces an extra HTML report file in the current directory. As we can see, the overhead of running under Performance Reports is <1%, which is below the level of noise on the system. The report looks like this:
The good news is that our job is compute-bound and not MPI- or I/O-bound! There are three pieces of advice given to us in the breakdown sections:
CPU. Lots of time is spent on memory accesses and the vectorization is only partial. We're not going to rewrite Linpack, but this shows us that the BLAS implementation used is not optimal on our system. We're running on an Intel cluster and could expect to get a lot more performance using an optimized MKL kernel.
Performance insight 1: 76% of the time is spent inside a BLAS implementation that is not well-optimized for this system.
MPI. 24% of our time is spent in point-to-point MPI communications at <50MB/s. It's suggested this is due to using lots of small messages or a workload imbalance. Certainly many small inter-node messages would be a problem with our Ethernet test cluster. This is something we can improve.
Performance insight 2: use a different PxQ layout to minimize inter-node communication as the latency of our test cluster's networking is hurting performance.
Memory. Only 11% of the available RAM on the nodes is being used. Linpack lets us freely scale to use larger workloads and we could expect this would involve more time computing and less time communicating, further improving performance.
Performance insight 3: increase N to maximize the computation:communication ratio.
Let's take these one at a time and see how performance is affected! I expect replacing the BLAS implementation to affect the CPU performance independently of the communication so I'll start with the quickest thing to change: the PxQ layout.
Happily, we can easily use a flatter 2x8 = 16 layout for our 16 cores. This matches the hardware layout of 2 nodes with 8 cores per node. Let's see what happens:
T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR00L2L2 15000 32 2 8 63.35 3.552e+01
Not bad! Our MPI point-to-point communications are now almost twice as fast and we spend 30% less time in MPI overall giving us a 7.7% speedup! Notice that the figures in the CPU section are almost exactly the same – the computational fingerprint of the job hasn't changed at all.
Building on the results of experiment 1, let's increase N from 15000 up to 52000. This should give us a useful datapoint for increasing RAM usage (up from ~4GB per node to without taking as long as a full run. The results are surprising:
T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L2 52000 32 2 8 5224.52 1.794e+01
The total time spent in MPI is lower – as expected – but the average transfer rate has dropped again. That's strange! And why was performance so much worse? Did we miscalculate and run into swap space? No, the memory section shows our peak memory usage is just 71% – well below 100%. So what is different to the previous reports?
Unexpectedly the CPU breakdown has changed. We are spending more time in memory accesses – this is up to 74.6%! Why does increasing N do this? And is there anything we can do about it?
It's tempting to wave our hands, invoke the Mysterious Ways of The Cache and say there must be a sweet spot we can find for N that minimizes this effect and maximizes the compute:communicate ratio.
That would be a mistake.
Any unexpected, inexplicable observation is a chance to learn. And learning is amazing fun! It's also the only way to ensure you really do get good performance from a system, as we shall see. The easiest way to dive into program performance and understand why it behaves as it does is Arm MAP.
Generating a run is just as easy as with Performance Reports:
$ map --profile srun -n 16 ./xhpl
The “--profile” argument tells MAP to run on the command-line only and produce a .map file instead of showing the GUI. Once the run has finished, we open the .map file and examine it:
The horizontal axis on all MAP's graphs is time, from the start of the run at the left to the end at the right. Green is time spent computing, blue is time in MPI. Generally speaking, Green is Good, Blue is Bad. Immediately we see two interesting features:
There's a large amount of MPI communication at the start of the run – moving the mouse over it shows this extends for the first 450 seconds. This accounts for the reduced MPI performance seen in the Performance Report.
In the rest of the run computation dominates, with the time spent in MPI gradually increasing as we get closer to the end of the run.
The Main Thread Stacks view at the bottom tells us that all the compute time is spent in two cblas_dgemm calls – this seems reasonable. Most of the communication time for the latter part of the run is spent in HPL_Pdlaswp01T but during the first crazy period HPL_bcast_1ring is dominant.
The Stacks and source views make it clear that HPL is not doing some kind of special initialization – HPL_bcast_1ring is done in every iteration, it just took a lot longer at the start for some reason.
Clicking on Metrics and then “Preset: MPI” shows all the MPI metrics available to MAP:
Here the time around the start has been selected. Clearly the MPI call duration metric is the most interesting. I go back to the default preset and add just this one. By clicking on the highlighted area again I can zoom in and hover the mouse over individual samples to see more information:
Here MAP is showing me that during this period some MPI calls took over 8 seconds! That's clearly ridiculous. In each case, rank 9 was the process that spent least time in the call, making it a very likely late sender. Is this an algorithmic problem or could it be some kind of system noise? Adding the CPU Time and System Load metrics from the menu will show us:
There doesn't seem to be any particular correlation between either other processes taking CPU time or increases in system load during this early phase – it really does look algorithmic in nature. At this point applying scientific thinking to the process really helps. All sorts of hypothesis spring to mind:
It's the HPL_bcast_1ring algorithm that's causing problems – I should rerun with one or all of the 5 other options available in the HPL.dat.
The size of the arrays is causing some kind of load imbalance, I should just restrict myself to using a smaller size. It seems to be around the 1GB per node mark which would limit me to using 50% of my total RAM.
???
Whilst these are valid hypothesis, there are quicker ways to confirm them than to rerun lots of jobs at different sizes. The first thing I want to do is collect more detailed information from the start of the run. I can do this really quickly with a single command:
$ map --stop-after=60 --profile srun -n 16 ./xhpl
This stops the run after 60 seconds, ensuring we have a nice high-resolution dataset of the first few iterations in which this is happening. You can also use --start-after and --stop-after in combination to record for one particular subsection of a run. The data I get out of this is fascinating and immediately refutes the first hypothesis:
Here I've selected the end of the second iteration in which we can clearly see 2 cores (13.9% of 16) have finished their cblas_dgemm call twice as early as the rest and are sitting in the HPL_bcast_1ring function waiting for their fellows.
In the first iteration we can see half the cores finished early and waited for the rest before they could synchronize. This is nothing to do with the communication algorithm and so rerunning with many different variations of it won't help at all.
Performance insight: with production-scale data a serious load imbalance in the dgemm calculation is occurring.
What would cause this? Is HPL and/or our configuration somehow decomposing the matrix incorrectly and sending less work to some cores than others? Should we experiment with the NB parameter? The tuning guide says it affects load balance, right? Let's stay scientific and state our hypothesis as clearly and precisely as we can.
Hypothesis: load imbalance in the decomposition of the problem is sending smaller arrays to cblas_dgemm to some ranks than others.
We can test this much more directly than just by blindly trying lots of different NB values.
This experiment is easy to run with Arm Forge's debugger, DDT: I leave my MAP trace open for reference use and launch a second copy of Forge on my laptop. The “remote launch” feature lets me connect Forge to the cluster via SSH. Once it's ready, I simply go back to the cluster terminal and run the job with “ddt --connect” in place of “map –profile”:
$ ddt --connect srun -n 16 ./xhpl
When the job starts this pops up a dialog on the GUI on my laptop asking if I want to accept this connection:
I sure do! Now I'm debugging on the remote cluster using my local GUI, all over SSH. This is the easiest way to debug since the print statement was invented! In my open MAP window I can see the two places the dgemm kernels are being called from: HPL_pdupdateTT.c on lines 382 and 405. I put a breakpoint at each and play the processes through the code:
Now we've found where the meat of the computation is done we can add a tracepoint for the mb, nb and jb variables (MxJ, JxN, MxN being the sizes of the matrices) to track how much work is being passed to the dgemm function each time:
I now hit play and watch how these sizes evolve iteration by iteration in real time as the program runs:
The results are clear – our hypothesis has been rejected! Every rank sends roughly the same sized arrays to dgemm for the first few iterations at least – varying by less than 1%. At this stage we two new ideas spring to mind:
Performance insight: the dgemm implementation itself has load-balancing problems with large arrays.
Let's have a closer look at the default CBLAS implementation on this test cluster:
[mark@mars CBLAS] $ less README _____________________________________________________________________________ This package contains C interface to Legacy BLAS. If you want to know how to use makefile, type 'make help.' Written by Keita Teranishi (5/20/98) ____________________________________________________________________________
Er, this is an interface to something that was already a legacy implementation 17 years ago? Let's have a look at the implementation itself:
* Further Details * =============== * * Level 3 Blas routine. * * -- Written on 8-February-1989.
Oh wow, this is a piece of computing history right here! In 1989 it wasn't even all that common to do HPC on the x86 architecture. It seems unreasonable to expect this routine to be well-optimized for running gigabyte-large arrays on modern Intel CPUs.
Hypothesis: swapping out the dgemm routine for one optimized for our architecture will improve the load balancing as well as the performance.
If you recall, we wanted to do this anyway because our first piece of advice from the initial Performance Report was that the BLAS routines were not well-optimized for the current architecture.
This isn't going to turn into a secret advertorial for Intel, but seriously: if you want to get the best performance from your hardware you should always try your hardware vendor's math libraries. All the major vendors provide optimized versions of common libraries including BLAS and Intel is no exception. As MKL comes with the Intel compilers these days swapping it out was pretty straightforward:
LAlib = -Wl,--no-as-needed -L/opt/intel/compiler16/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm
A quick rebuild of HPL and we're off for another test run to see how the first 60 seconds look now:
The results? Splendid!
Look how much faster it is! The iterations now last just 1.5 seconds instead of 25 with the default BLAS implementation and importantly all signs of imbalance have vanished. The MPI spikes are now short and almost vertical, implying all ranks hit the MPI calls at very similar times.
Let's do a full run now and look at the results and the Performance Report:
That's more like it! 85% of the time in arithmetic-bound compute with good vectorization and 94% of RAM used. Our point-to-point MPI communication time is up and is quite reasonable for this ethernet-based system considering this is a per-rank figure. And the results of these optimizations speak for themselves:
T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L2 60000 32 2 8 887.49 1.623e+02
162 Gflops from 2 nodes of our test cluster seems pretty good to me. The only thing I'd still try to improve here is the dgemm performance. Load balance looks pretty good according to the MPI time in the Performance Report – this suggests we can push the block size (NB) higher to get even better performance!
Hypothesis: increasing the block size from 32 will improve overall performance at the expense of slightly-increased time in MPI.
For this test this I pushed NB up to 256 and measured the first 60 seconds in MAP again. This is what we get back:
Uh-oh, check out that node memory usage chart! Using a larger block size has increased the amount of RAM required. MAP measures Resident Set Size – the amount of physical RAM used – so a full run is at risk of swapping to disk. Performance insight: with NB=256 the RAM requirements increased, potentially pushing us into swap. We'll decrease N a little to compensate and get the best results. The good news is that the floating-point performance has shot through the roof:
There's not much point in trying to optimize this any further, so exploring higher NBs are probably a waste of time. MPI time in the first 6 iterations has increased from 11.4% to 14.4%.
We have already seen that MPI time gradually increases during the run as the dgemm sizes decrease linearly in each iteration (thanks, DDT!) so the question arises: can we get the best of both worlds by finding a lower NB that reduces the time spent in MPI whilst still maximizing floating-point performance?
Performance insight: with NB=256 the dgemm performance is stellar but we spend more time in MPI calls. Can we get better MPI performance and the same CPU performance with an NB between 32 and 256?
At NB=128 we still see 14% time in MPI in the first 6 iterations; it doesn't look like there are any great savings to be made here. At NB=64 we still see 14% time in MPI in the first 6 iterations but compute intensity already starts to drop off. It looks like 256 is good enough for a production run. Let's do it:
T/V N NB P Q Time Gflops ----------------------------------------------------------------------------- WR00L2L2 59000 256 2 8 626.68 2.185e+02
That's 218 Gflops – a 6.6x improvement after just a handful of experiments and only two full-length runs. The Performance Report for this run shows where we might still find further improvement:
Performance Reports now says shows that every aspect of our system is being optimally used; only the MPI communication still leaves something to be desired. The 101 MB/s per-process transfer rate may be as much as we can expect from our bog-standard ethernet networking, but if there is any more performance to be squeezed out of this system this is where it will be.
We started with a completely naïve run of Linpack achieving just 33 Gflops. The Performance Report told us:
Swapping this out for the optimized MKL library fixed both these problems and let us scale to 162 Gflops. The Performance Report showed us load imbalance was not a limiting factor any more so we tried increasing the block size to get better CPU performance at the expense of increased MPI time.
The test paid off and we reached 218 Gflops in our final run!
I hope this has been a helpful example of how to take a scientific approach to performance optimization – forming hypothesis and using the tools available to really understand unexpected problems rather than speculating about their causes and giving up. The same techniques can be applied to any code or benchmark – download Forge and Performance Reports and have fun!