Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Deep Learning Episode 2: Scaling TensorFlow over multiple EC2 GPU nodes
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • HPC Tools
  • Deep Learning
  • Development Tools
  • Arm MAP
  • Arm Performance Reports
  • google
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Deep Learning Episode 2: Scaling TensorFlow over multiple EC2 GPU nodes

Mark O'Connor
Mark O'Connor
August 3, 2016
6 minute read time.

In episode one we optimized Torch A3C performance on the new Intel Xeon Phi (Knight's Landing) CPU. Arm MAP and Performance Reports identified bottlenecks in our framework and sped up model training by 7x.

To get further gains we found areas of the Torch framework that could benefit from optimization, but that's getting to a level of time investment that most of us don't have.

Fortunately there's another way to train models faster – add more hardware! Vishnu, Siegel and Daily at the Pacific Northwest National Laboratory published a paper in which they scaled TensorFlow over multiple nodes using the high-performance MPI library rather than TensorFlow's RPC work sharing mechanism.

Not everybody has access to an Infiniband-backed supercomputer, but anyone can launch a fleet of EC2 GPU instances for a few dollars per hour. Can we get similar speedups in the cloud? I decided to find out.

The Stack

The PNNL team provided their MPI + TensorFlow implementation online as part of their Machine Learning Toolkit for Extreme Scale, so I fired up a small fleet of EC2 g2.2xlarge instances and got to work.

The paper was looking at scaling over multiple CPU cores; I wanted to see whether their results held up when using GPUs for deep learning and how using EC2 instances affected the scalability results. I used the same MNIST classification task as the original paper, which was a small model trained rapidly over a fixed number of iterations:

MNIST classifcations

Let's look at our GPU cluster's results side-by-side with the speedups reported by the paper for CPU cores:

Speedup factors compared to published results

At a first glance this appears to be a bad result for our system in the cloud. However, the paper only reports on the scaling per CPU core, not per node. It would be more honest to compare one multi-core CPU node to one GPU node. Unfortunately the paper authors didn't specify the number of workers per node used. If we make a very conservative assumption of 4 per node we see the following:

Speedup comparison at 4 procs per node, 1 GPU per node

Despite running in the cloud and without the benefit of low-latency Infiniband networking we see a better speedup using GPUs! This is probably because one GPU performs rather more work than CPU cores, effectively making our EC2 nodes “fatter”. An interesting result – making use of the right hardware for the problem may be more important for scalability than using whatever local hardware is available.

Performance Report: Is this running well?

So we've shown that we can get a better multi-node parallel speedup if we use GPU nodes on EC2. Would using an on-site cluster be better? How much better? Let's look at a Performance Report to see:

Using a site cluster performance reportThis is a report of the 32-node run. We can see a few interesting things right away:

The time spent in MPI communication (30.2%) isn't very high considering we seem to have reached the point of diminishing speedup returns. And the networking speed does not appear to be the bottleneck – 10 Gigabit ethernet can do rather better than 11.8 MB/s!

  • The GPU utilization is very low – just 8.5% on average.
  • The vast majority of wallclock time isn't spent waiting for the GPU or communicating over the network, but shuffling memory on the CPU.
  • Performance Reports can also export to CSV (and now JSON) format, so let's see how our resources are spent overall as the number of nodes increases:

Breakdown of total cores

The amount of time spent in memory accesses increases dramatically as the number of nodes increases. It looks as if MPI communication will eventually pass it, but even at 32 nodes we're spending much more time accessing memory than transferring data across the network. Why does this increase so quickly? What extra work are we doing?

Diving Deeper with MAP

The advice from Performance Reports was to use a profiler to investigate further – let's do exactly that. I used Arm MAP to record a 32-node run and uncovered a surprising story:

32 node run on Arm Map

Here I've zoomed in to look at just two iterations. Arm MAP can't assign the time to individual lines of code in python (yet) but despite that it's fairly straightforward to deduce the flow of execution using the stack traces and function names.

The reconstructed story looks something like this:

Analysis

We can break this up into three rough phases and assign time to each based on the stack traces MAP reports:

  • At the start of each iteration 7.8% of the time is spent in GPU matrix operations training the neural network. This is the workload we want to optimize for.
  • Once training for the epoch is complete, 8.6% of the time is spent in MPI_AllReduce to calculate the mean of the weights. This is pretty quick!
  • Next, 15.2% of the time is spent copying python objects and another 22.3% CPU time and 6.1% GPU time is spent copying it again in TensorFlow internals and sending it to the GPU. It's not immediately clear whether this is from assigning the new weights or in re-running to calculate the cross-entropy / accuracy. A future version of MAP that assigns time to individual lines will be invaluable. However, we can see that for both of these we are copying the entirety of the training data and the test data to the GPU every single epoch – this seems a likely candidate!
  • 16% time in MPI_Allreduce for the sum of the error. This is a trivial amount of data and MAP confirms this by showing an extremely low MPI transfer rate of just 30.5 kB/s during such a call:

Map MPI rates

So most of the time we spent “in MPI” is actually spent while some ranks wait for others to finish copying training and test data to the GPU – a task we don't even need to perform every epoch!

Before doing anything else I decided to try eliminating that waste – only calculating and printing the error and accuracy every 30 epochs instead of every single epoch. The results were astonishing:

Speeedup by reducing frequency of progress output

That's a 2.18x speedup just by changing the frequency of our progress outputs!

Conclusion

This saving came from identifying unnecessary data transfer. This is a hot topic in HPC as computation becomes cheaper than data movement. The uptake of high-level languages like Python to “glue together” low-level C and Fortran numeric libraries carries with it the risk of making multiple implicit copies accidentally.

Performance Reports will identify frequent data movement as high time spent in memory accesses and a tool like Arm MAP helps to pinpoint exactly where this happens – in this case doubling the production performance.

Interestingly the gains made by this software optimization far outweigh the potential benefits of using top-of-the-line dedicated hardware! In this day and age can any of us afford not to profile our code?

Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025