In Part 1 I ran Discovar, a life sciences genome assembly code, on one of our internal systems and optimized it to run the benchmark code 7% faster. Of course, physical hardware often performs very differently to cloud-hosted machines as I discovered to my surprise when I ran the same code on an EC2 instance.
I used cfncluster to create a single c4.8xlarge EC2 instance with 36 hyperthreaded cores and 60GB of RAM. I figured this has a fighting chance of putting our local hardware to shame! Installing and running was just as easy as in part 1 and soon I had my first benchmark results:
$ ./runme.sh discovar version: Discovar r52488 loadaverage: 0.05 0.98 1.36 1/790 16317 2015-07-27 07:57 PERF: REAL 835.857 USER 36.188 SYSTEM 5.441 PERC 4.71
Wait, what? It took 14 minutes? But on our smaller local machine it completed in 8 minutes even before we made improvements! I modified run.sh to run MAP on Discovar and take a peek under the covers:
map --profile --stop-after=300 src/Discovar …
The “--stop-after” tells MAP to stop profiling after 300 seconds (5 minutes) so we don't need to wait for the incredibly slow benchmark run to finish before looking at the results.
This what I found:
What on earth is going on here? Again there's lots of grey time, but we can see from the details bar that this is not OpenMP synchronization overhead. It's time spent sleeping . Unlike most multithreaded profilers, MAP doesn't just profile time spent running on the CPU, it is also capable of showing which lines of code spend time sleeping in kernel calls.
The Functions tab shows which of these calls our 36 threads are waiting for:
All that time is spent waiting for madvise to finish. Why madvise ? Let's look at the Stacks view again:
By following the large grey areas and expanding down we see that it is being called by the _int_free function in glibc. A quick look at the glibc source on github confirms this:
This is very, very bad for our performance. madvise is a system call and as such is very slow, triggering at least two context switches each time. Can we improve performance simply be swapping in a different malloc implementation?
Enter Google's TCMalloc, an optimized malloc for multi-threaded applications. That sounds like what we need! I downloaded as part of gperftools-2.4, compiled and installed it.
Then I repeated the run by adding this in front of the run command:
$ LD_PRELOAD=/usr/local/lib/libtcmalloc_minimal.so ./testme.sh
I left the MAP code in to gather another 5 minute snapshot of the run so I can compare the profiles and see whether this has really improved things or not before trying a full benchmark run of unknown length. This is what I saw:
That's not really better , as such. Now all the time is spent in sys_futex – obviously TCMalloc was designed for fewer simultaneous cores or different patterns of allocation.
Speedup so far: 1.0x (first run is still the best)
The Broad Institute uses jemalloc for their de-novo version of Discovar. Could that help here? I downloaded jemalloc 3.6.0 and installed it. Running with the provided jemalloc.sh script is as easy as:
$ jemalloc.sh ./runme.sh
Again I told MAP to only run for the first 5 minutes, which was just as well:
This is actually slower. We can see it suffers from the same repeated system call to madvise . A look at the stacks shows why:
There's not a lot we can do about this without getting into the Jemalloc internals.
When I want the absolute best performance on Intel hardware, I usually turn to Intel's MKL and TBB libraries. As it happens, Intel provide a malloc wrapper in their Thread Building Blocks library, tbbmalloc_proxy. I downloaded the source fromhttps://www.threadingbuildingblocks.org/download#stable-releases and built it.
Then I changed the preload to use this library instead and ran the first 5 minutes again:
That's a world of difference! The main work of the benchmark has entirely finished within the 5 minute period and shows full CPU utilization in that nice full-height light-green bar.
Speedup so far: 1.4x (14 minutes => 10.1 minutes)
We're getting there, but let's look more closely at the MAP file - what's that long grey area of synchronization at the end?
Selecting it zooms MAP's source and stacks view into that time range. Here MAP takes us straight to an OpenMP loop in an inner loop. All the time is spent synchronizing threads at the start and end of the region. This is similar to the problem we found in Part 1, in which a critical section was dominating the time in a small OpenMP loop.
I commented the OpenMP pragma out – I can't see even 1% of the time being spent on something useful so this probably doesn't need to be run in parallel at all.
After this I ran the entire benchmark right through to get the full picture:
It completed in just 358 seconds, almost 100s faster than our own local hardware did in Part 1! Yet looking at the trace there's still huge areas of grey we can optimize. Zooming in on the large grey area at the end shows yet another OpenMP region in an inner loop:
Once again I simply commented it out – there's no sign of real computation being done in the profile, only the overwhelming overhead of locking and synchronizing threads at the start and end of the region.
The result, once again, is dramatic:
Now the whole benchmark completes in just 304 seconds, or 5.07 minutes. This is getting pretty good, but we're not done yet!
Speedup so far: 2.7x (14 minutes => 5.07 minutes)
What about that last large grey area with the light-blue pthread synchronization time at the start? Examining it with MAP shows that it's a recursive, pthreaded InPlaceParallelSort routine that takes all the time:
It's clearly not scaling well and yet there's no super quick fix to be had here other than careful, diligent engineering – or replacing it with a library routine if a suitable one can be found. However, we can still tune it for this workload.
It only scales up to a small number of cores and the rest of the time is wasted on locks; what if we limit the pthread-based WorkList to only starting a maximum of 8 worker threads? Would that actually be faster? It's easy to do a quick check before engineering a proper solution:
Does it help? Yes! Look at this:
This section now only takes 14 seconds, down from 24 just by using fewer threads ! The total runtime is now 280 seconds, or 4.7 minutes .
Speedup so far: 3.0x (14 minutes => 4.7 minutes)
Let's recap our gains:
The improvements so far place our EC2 instance at the number 1 position on the Broad Institute's benchmark list:
That's top of the list for just $2 per hour thanks to an enlightening afternoon with Arm MAP. Now that's time well spent!
If you enjoyed reading this - why not apply Arm's tools to your software and see if you can achieve the same.
[CTAToken URL = "https://hpc-buy.arm.com/free-trial" target="_blank" text="Free trial" class ="green"]