In episode one we optimized Torch A3C performance on the new Intel Xeon Phi (Knight's Landing) CPU. Arm MAP and Performance Reports identified bottlenecks in our framework and sped up model training by 7x.
To get further gains we found areas of the Torch framework that could benefit from optimization, but that's getting to a level of time investment that most of us don't have.
Fortunately there's another way to train models faster – add more hardware! Vishnu, Siegel and Daily at the Pacific Northwest National Laboratory published a paper in which they scaled TensorFlow over multiple nodes using the high-performance MPI library rather than TensorFlow's RPC work sharing mechanism.
Not everybody has access to an Infiniband-backed supercomputer, but anyone can launch a fleet of EC2 GPU instances for a few dollars per hour. Can we get similar speedups in the cloud? I decided to find out.
The PNNL team provided their MPI + TensorFlow implementation online as part of their Machine Learning Toolkit for Extreme Scale, so I fired up a small fleet of EC2 g2.2xlarge instances and got to work.
The paper was looking at scaling over multiple CPU cores; I wanted to see whether their results held up when using GPUs for deep learning and how using EC2 instances affected the scalability results. I used the same MNIST classification task as the original paper, which was a small model trained rapidly over a fixed number of iterations:
Let's look at our GPU cluster's results side-by-side with the speedups reported by the paper for CPU cores:
At a first glance this appears to be a bad result for our system in the cloud. However, the paper only reports on the scaling per CPU core, not per node. It would be more honest to compare one multi-core CPU node to one GPU node. Unfortunately the paper authors didn't specify the number of workers per node used. If we make a very conservative assumption of 4 per node we see the following:
Despite running in the cloud and without the benefit of low-latency Infiniband networking we see a better speedup using GPUs! This is probably because one GPU performs rather more work than CPU cores, effectively making our EC2 nodes “fatter”. An interesting result – making use of the right hardware for the problem may be more important for scalability than using whatever local hardware is available.
So we've shown that we can get a better multi-node parallel speedup if we use GPU nodes on EC2. Would using an on-site cluster be better? How much better? Let's look at a Performance Report to see:
This is a report of the 32-node run. We can see a few interesting things right away:
The time spent in MPI communication (30.2%) isn't very high considering we seem to have reached the point of diminishing speedup returns. And the networking speed does not appear to be the bottleneck – 10 Gigabit ethernet can do rather better than 11.8 MB/s!
The amount of time spent in memory accesses increases dramatically as the number of nodes increases. It looks as if MPI communication will eventually pass it, but even at 32 nodes we're spending much more time accessing memory than transferring data across the network. Why does this increase so quickly? What extra work are we doing?
The advice from Performance Reports was to use a profiler to investigate further – let's do exactly that. I used Arm MAP to record a 32-node run and uncovered a surprising story:
Here I've zoomed in to look at just two iterations. Arm MAP can't assign the time to individual lines of code in python (yet) but despite that it's fairly straightforward to deduce the flow of execution using the stack traces and function names.
The reconstructed story looks something like this:
We can break this up into three rough phases and assign time to each based on the stack traces MAP reports:
So most of the time we spent “in MPI” is actually spent while some ranks wait for others to finish copying training and test data to the GPU – a task we don't even need to perform every epoch!
Before doing anything else I decided to try eliminating that waste – only calculating and printing the error and accuracy every 30 epochs instead of every single epoch. The results were astonishing:
That's a 2.18x speedup just by changing the frequency of our progress outputs!
This saving came from identifying unnecessary data transfer. This is a hot topic in HPC as computation becomes cheaper than data movement. The uptake of high-level languages like Python to “glue together” low-level C and Fortran numeric libraries carries with it the risk of making multiple implicit copies accidentally.
Performance Reports will identify frequent data movement as high time spent in memory accesses and a tool like Arm MAP helps to pinpoint exactly where this happens – in this case doubling the production performance.
Interestingly the gains made by this software optimization far outweigh the potential benefits of using top-of-the-line dedicated hardware! In this day and age can any of us afford not to profile our code?