I’ve always enjoyed playing games, but the buzz from writing programs that play games has repeatedly claimed months of my conscious thought at a time. I’m not sure that writing programs that write programs that play games is the perfect solution, but I do know that I can never resist it.
Back in 2015 DeepMind published their landmark Nature article on training deep neural networks to learn to play Atari games using raw pixels and a score function as inputs. This was pretty impressive at the time, and they spent several days of training on multiple GPUs to achieve it.
This spring my attention was caught by another interesting result: using a new approach they were able to beat Atari Pong in just 2 hours of training time on a single node by parallelizing the work into multiple concurrent agents.I knew then that I wanted to see if I could create something that learned even faster. The idea of taking a very simple implementation and running it at extremely large scales appeals to me through its combination of elegant simplicity and sheer brute force. Perhaps we can even train a system that learns faster than the ultimate in game-playing expertise: a seven-year-old child.Fortunately, I work in High Performance Computing and have access to a wide array of supercomputers with hundreds of thousands of cores and the latest in hardware from all major manufacturers – as well as the software tools I need to write and optimize parallel applications.The game, as they say, is on!
Use the link here to trial Arm tools for parallel code optimization yourself
Some of the best advice I’ve ever had in deep learning is “start with a simple model that achieves some degree of learning, then expand from there”. A3C is kinda fancy-pants from an architectural point of view, with multiple networks being trained concurrently and a lot of implementation details to get right. Just reproducing the results of the papers can be an interesting challenge, let alone improving upon them.
At the other end of the spectrum is Andrej Karpathy’s wonderful policy gradient example. It’s 130 lines of python with only numpy and the OpenAI gym as dependencies, written as a tutorial. One of the nice properties of policy gradients are that they’re very easy to explain:
This technique has been well-known for some time and in fact the DeepMind team's latest result is based on a refinement of this approach. Karpathy’s implementation is a lot simpler and it does learn to beat Atari Pong – eventually:
So we need a 100x improvement in order to be competitive even with the state of the art, let alone with a human child. Let’s get to work!
People like to parallelize the training of a neural network in one of two ways. Model-level parallelism runs different parts of the network on different devices:
Data-level parallelism runs multiple instances of the same network on different data streams and combines their learning (usually once per training episode, or per game in this case):
This approach lends itself very well to reinforcement learning in general and playing Atari games in particular, as you can also parallelize the Atari emulator across multiple cores to increase the speed at which you can play and learn.
Modifying Karpathy’s simple policy gradient code to run in parallel was absolutely trivial. The only parallel operation really required is to take the mean of the gradient updates learned after each game and use those to update our policy network. MPI’s Allreduce operation does this concisely and efficiently:
With these few extra lines of code we can now use multiple cores on a single node to see how much closer we are to our 100x speedup goal:
If we need a 100x speedup, which means going from 440 steps/s to 44,000 steps/s. If we can maintain this trajectory across multiple nodes we will need at least (44000 – 220) / 257.4 = 170 cores. This seems plausible. In fact, we’ve got an Intel Xeon Phi Knight’s Landing in-house that has 72*4 = 288 cores. Would this be a good fit for that hardware?
A quick run showed speed barely faster than a regular Xeon. A quick look with our parallel profiler, Arm MAP, shows us why:
This workload isn't bad for Xeon Phi – there’s 50% well-vectorized math in there (dgemv_kernel in particular) which the 512-bit vector units can sink their teeth into, but the rest of the time is spent in the Atari emulator and that’s going to be a huge bottleneck as we run one instance of it per process.
Instead of trying to optimize the Atari emulator for the Xeon Phi’s architecture, I decided to choose a machine to match the problem. In this case, Archer – the UK National Supercomputing Service:
Working on a modern supercomputer is amazing – the hardware is fast, of course, but experts have also prepared highly-optimized software stacks specifically for that system! Archer already had optimized python and numpy installations specific to the CPU architecture on the nodes. I just submitted my job to the queue and sat back to wait for the result!
That’s me at the bottom running train.sh. Casually, I wonder how many jobs are ahead of me in the queue…
Oh, wow. Ok. So now I have a choice. I could fire up a cluster in the cloud and get instant interactive access to as many servers as I need – my own private supercomputer! This sounds better than waiting in the queue, but there’s also… another option (courtesy of the sublime xkcd.com):
As an aside, I’m convinced one of the reasons a lot of smart people are working in deep learning now is that compilers have become too fast, but training deep learning models can take hours or even days. What a time to be alive!
A few swordfights later and I decided to try the cloud after all. Alces Flight is HPC-in-a-box on EC2 with automatic scaling of nodes and many packages already preconfigured. After a few minutes I was ready to go:
So how does our Parallel Policy Gradient implementation fare in the cloud? I tried it on a single 18-core c4.8x large node first:
That’s… not quite what I was expecting. It looks like we’re hitting a scaling wall at 8 cores. After that it gets slower, not faster!
Does this put an end to our dreams for a 100x speedup? I decided to investigate with Arm MAP:
Oddly as soon as computation starts the CPU time drops way down. This is most unusual. What could be causing it? I clicked through to our integrated parallel debugger, Arm DDT, for a closer look:
The stack trace shows that ATL, the BLAS numerical library that numpy is using is a parallel version that happily creates hundreds of short-lived pthreads on each process. These are thrashing around on the system and slowing everything down – each process is behaving as if it has all 18 cores to itself!
You can often infer behaviour like this from profiler traces too, but when things are going crazy there’s nothing as reassuring as the smoking gun of a debugger showing exactly what a program is doing wrong.
Rather than recompile ATL in single-threaded mode I installed Anaconda, which uses the well-behaved MKL library for matrix operations and got a much happier scaling curve:
To reach 100x performance and be competitive with DeepMind’s A3C we need 44,000 steps/s – that would be (44,000 – 746) / 336 = 128 cores, or 8 nodes. Looking good so far, but can we maintain this performance as we move to multiple nodes?
Pretty close! The scaling prediction from our single-node experiment was 31,000 steps/s and this isn’t too far away. It looks like DeepMind-level performance is within our grasp. Perhaps we can even exceed it – Archer has over 100,000 cores after all!
We'll continue this challenge in our next blog piece, in which we take our parallel policy gradient implementation onto ARCHER and answer the greatest question of our age: how many CPUs does it take to learn how to play Atari games faster than a 7-year-old child?