Deep Learning Episode 2: Scaling TensorFlow over multiple EC2 GPU nodes

August 3, 2016

6 minute read time.

In episode one we optimized Torch A3C performance on the new Intel Xeon Phi (Knight's Landing) CPU. Arm MAP and Performance Reports identified bottlenecks in our framework and sped up model training by 7x.

To get further gains we found areas of the Torch framework that could benefit from optimization, but that's getting to a level of time investment that most of us don't have.

Fortunately there's another way to train models faster – add more hardware! Vishnu, Siegel and Daily at the Pacific Northwest National Laboratory published a paper in which they scaled TensorFlow over multiple nodes using the high-performance MPI library rather than TensorFlow's RPC work sharing mechanism.

Not everybody has access to an Infiniband-backed supercomputer, but anyone can launch a fleet of EC2 GPU instances for a few dollars per hour. Can we get similar speedups in the cloud? I decided to find out.

The Stack

The PNNL team provided their MPI + TensorFlow implementation online as part of their Machine Learning Toolkit for Extreme Scale, so I fired up a small fleet of EC2 g2.2xlarge instances and got to work.

The paper was looking at scaling over multiple CPU cores; I wanted to see whether their results held up when using GPUs for deep learning and how using EC2 instances affected the scalability results. I used the same MNIST classification task as the original paper, which was a small model trained rapidly over a fixed number of iterations:

MNIST classifcations

Let's look at our GPU cluster's results side-by-side with the speedups reported by the paper for CPU cores:

Speedup factors compared to published results

At a first glance this appears to be a bad result for our system in the cloud. However, the paper only reports on the scaling per CPU core, not per node. It would be more honest to compare one multi-core CPU node to one GPU node. Unfortunately the paper authors didn't specify the number of workers per node used. If we make a very conservative assumption of 4 per node we see the following:

Speedup comparison at 4 procs per node, 1 GPU per node

Despite running in the cloud and without the benefit of low-latency Infiniband networking we see a better speedup using GPUs! This is probably because one GPU performs rather more work than CPU cores, effectively making our EC2 nodes “fatter”. An interesting result – making use of the right hardware for the problem may be more important for scalability than using whatever local hardware is available.

Performance Report: Is this running well?

So we've shown that we can get a better multi-node parallel speedup if we use GPU nodes on EC2. Would using an on-site cluster be better? How much better? Let's look at a Performance Report to see:

Using a site cluster performance report This is a report of the 32-node run. We can see a few interesting things right away:

The time spent in MPI communication (30.2%) isn't very high considering we seem to have reached the point of diminishing speedup returns. And the networking speed does not appear to be the bottleneck – 10 Gigabit ethernet can do rather better than 11.8 MB/s!

The GPU utilization is very low – just 8.5% on average.
The vast majority of wallclock time isn't spent waiting for the GPU or communicating over the network, but shuffling memory on the CPU.
Performance Reports can also export to CSV (and now JSON) format, so let's see how our resources are spent overall as the number of nodes increases:

Breakdown of total cores

The amount of time spent in memory accesses increases dramatically as the number of nodes increases. It looks as if MPI communication will eventually pass it, but even at 32 nodes we're spending much more time accessing memory than transferring data across the network. Why does this increase so quickly? What extra work are we doing?

Diving Deeper with MAP

The advice from Performance Reports was to use a profiler to investigate further – let's do exactly that. I used Arm MAP to record a 32-node run and uncovered a surprising story:

32 node run on Arm Map

Here I've zoomed in to look at just two iterations. Arm MAP can't assign the time to individual lines of code in python (yet) but despite that it's fairly straightforward to deduce the flow of execution using the stack traces and function names.

The reconstructed story looks something like this:

Analysis

We can break this up into three rough phases and assign time to each based on the stack traces MAP reports:

At the start of each iteration 7.8% of the time is spent in GPU matrix operations training the neural network. This is the workload we want to optimize for.
Once training for the epoch is complete, 8.6% of the time is spent in MPI_AllReduce to calculate the mean of the weights. This is pretty quick!
Next, 15.2% of the time is spent copying python objects and another 22.3% CPU time and 6.1% GPU time is spent copying it again in TensorFlow internals and sending it to the GPU. It's not immediately clear whether this is from assigning the new weights or in re-running to calculate the cross-entropy / accuracy. A future version of MAP that assigns time to individual lines will be invaluable. However, we can see that for both of these we are copying the entirety of the training data and the test data to the GPU every single epoch – this seems a likely candidate!
16% time in MPI_Allreduce for the sum of the error. This is a trivial amount of data and MAP confirms this by showing an extremely low MPI transfer rate of just 30.5 kB/s during such a call:

Map MPI rates

So most of the time we spent “in MPI” is actually spent while some ranks wait for others to finish copying training and test data to the GPU – a task we don't even need to perform every epoch!

Before doing anything else I decided to try eliminating that waste – only calculating and printing the error and accuracy every 30 epochs instead of every single epoch. The results were astonishing:

Speeedup by reducing frequency of progress output

That's a 2.18x speedup just by changing the frequency of our progress outputs!

Conclusion

This saving came from identifying unnecessary data transfer. This is a hot topic in HPC as computation becomes cheaper than data movement. The uptake of high-level languages like Python to “glue together” low-level C and Fortran numeric libraries carries with it the risk of making multiple implicit copies accidentally.

Performance Reports will identify frequent data movement as high time spent in memory accesses and a tool like Arm MAP helps to pinpoint exactly where this happens – in this case doubling the production performance.

Interestingly the gains made by this software optimization far outweigh the potential benefits of using top-of-the-line dedicated hardware! In this day and age can any of us afford not to profile our code?

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Deep Learning Episode 2: Scaling TensorFlow over multiple EC2 GPU nodes

The Stack

Performance Report: Is this running well?

Diving Deeper with MAP

Conclusion

Refining MurmurHash64A for greater efficiency in Libstdc++

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3