In the previous post we parallelized Andrej Karpathy's policy gradient code to see whether a very simple implementation coupled with supercomputer speeds could learn to play Atari Pong faster than the state-of-the-art (DeepMind's A3C at time of writing) and even train faster in real time than the ultimate in gaming expertise: a 7-year-old child.
DeepMind's paper reports 2 hours training time to defeat Atari Pong using 16 worker threads on a single server. We calculated that to be competitive with that we'd need to accelerate our simple policy gradient code 100x to reach 44,000 steps/s. After some profiling and tuning we achieved 29,000 steps/s on an EC2 cluster:
It looks like DeepMind-level performance is within our grasp! Our production runs will be on ARCHER - the UK National Supercomputing Service managed by University of Edinburgh's EPCC, a Cray XC30 with 118,080 cores.
Will this performance hold up at that sort of scale? I took a quick look to see how our 90-core cloud run looks under a profiler:
Hmm! Here the parallel profiler MAP shows that we’re spending 25% of our time just waiting in the MPI Allreduce operation. These charts in MAP have time on the horizontal axis and cores on the vertical axis. Green is compute (good) and blue is MPI (bad). These sloped blue triangles are a dead giveaway for MPI imbalance – a few processes enter MPI code early then wait for a long time until the last process arrives and finally all of them can synchronize and transfer the data.
Where does the imbalance come from here?
If you remember, we chose to synchronize the weights after each process has finished playing a game of Pong and has got its final score. Some games may be over very quickly and others can go on for a very long time! The more processes you add, the more likely it is that one game will last unnaturally long. During this period the rest of your supercomputer is just waiting.
Games having varying length is a problem if we want to scale efficiently. After thinking about how our algorithm interacts with Pong for a while, I realized something:
In terms of choosing whether to go up or down, the score so far in any given position doesn’t really matter. Each point is played individually – there’s no reason you need to wait until one player has 21 points before you can learn from your rewards.
If one 30-second snippet of Pong looks pretty much like any other, and having variable-length episodes is causing a performance inefficiency when scaling, we can simply choose to perform weight updates after a fixed numbers of steps (or indeed seconds) instead. Rather than wait for the game to end before sharing updates we can share the learning we have every K seconds, restarting games once they reach their natural conclusion. Sometimes the reward discounts will be clipped, but that just adds a small amount of unbiased variance which the benefits of increased parallelism should more than compensate for.
Let’s step back and think for a moment about this. I needed high performance computing (HPC) expertise to see that the performance was limited by workload imbalance between the nodes triggered by late senders. I combined that with domain expertise in playing Pong to reach the insight that source of the imbalance could be eliminated without harming the model’s performance (it's better to have a few learners lose a reward signal than a lot of learners sitting idle).
This dynamic between HPC expertise and domain expertise is central to both the established world of scientific computing and to the newer world of deep learning. The reason Arm focuses on making tools that researchers and HPC experts alike can use is precisely to enable this kind of insight in domains a lot more complex than Atari Pong. In fields from climate modelling through natural language processing to real honest-to-God rocket science teams of people work together and tools that help them communicate effectively about correctness and performance are, well, they're pretty neat.
Back to our challenge, the result of this insight is an immediate speedup:
The time in communication at 90 cores drops from 25.5% down to 3.5% and our speed increases from 28k steps/s to 36k steps/s:
There are really 3 benefits here:
That latter one is interesting. Our neural network is learning from a batch of examples, in which each example is a screenshot, the action it took and the eventual discounted reward it received. Each episode consists of thousands of these, all of which are then learned from in one single update. As we add more processes, we are increasing the number of examples in each update.
Beyond a certain point you hit diminishing returns by increasing the number of examples – when you’ve determined a good enough gradient, adding more samples just makes that estimate ever so slightly more accurate.
The insight into performance gained here is also going to improve the scalability properties of our whole approach by allowing us to choose the number of samples per update whilst increasing the number of parallel learners to accelerate training.
With the code tested and working in the cloud, I took it to ARCHER and submitted some jobs at more interesting scales – hundreds and even thousands of cores. Running on ARCHER was a dream (thanks EPCC! and Cray) – not only were multiple python and numpy versions installed as standard, but they were also pre-optimized for the CPU architecture on the nodes themselves. A single ARCHER node was significantly faster than the Amazon c4.8x large instance type and are connected by extremely high-bandwidth, low-latency networking.
Until this point I’ve been using DeepMind’s A3C implementation as a benchmark, but the real goal was far loftier: to see if a machine could learn to play pong faster than a 7-year-old child.
With a sample size of one I determined that 7-year-old children need 1.3 hours practice to defeat Atari Pong – a good bit faster than the 2 hours in DeepMind’s paper. Does applying HPC techniques and tools to our simple little algorithm outperform that?
Wow! This chart is has a logarithmic scale because the pace of progress the last few years has been so rapid, but I never even dreamt that our little parallel algorithm would do this well. With a mere 1536 cores on our network learns to defeat Atari Pong from scratch in just 3.9 minutes. That’s less time than it takes a human to play a single game!
Here we took one of the simplest reinforcement learning models – policy gradients – and by scaling it across a large number of nodes outperformed DeepMind’s more theoretically-intricate model by an order of magnitude. We’ve only solved one Atari game and we used some domain-specific knowledge to do so, but the level of learning speed unlocked by parallel processing is truly dramatic – at this scale our model needs fewer than 100 weight updates to go from random behaviour to defeating Pong.
The unreasonable effectiveness of data has been discussed in machine learning already, in which simple algorithms like nearest-neighbour outperform more complex models when applied to sufficiently large datasets. Are we entering the age of the unreasonable effectiveness of parallelism?
Most of the refinements that Google and others applied to the naive policy gradient method to get their speedups are to reduce the variance of the score function (one game can be seen as a sample of a stochastic process in which sometimes a bad policy can win a game through chance). A3C is an actor-critic model - it reduces the noise inherent here by simultaneously training a critic network to predict what the mean score would be. Our approach was equally beautiful in its simplicity: we just sample thousands of games at a time to get a mean - a very direct and scalable way to reduce the variance that also has good convergence properties.
These remain open and interesting research questions of importance to AI in general because it’s clear that deep learning is going multi-node at very large scale. Google has been training single models on 500+ nodes on their proprietary infrastructure for some time in order to reduce the wallclock training time and give their researchers an edge in innovating and finding better models. OpenAI have shared their infrastructure for scaling to hundreds of nodes on EC2. In the open science world, anyone running a HPC cluster can expect to see a surge in the number of people wanting to run deep learning workloads over the coming months.
Access to large-scale computational power - and the tools to use it successfully - is becoming incresaingly vital. A new era is dawning. Who will be a part of it?
Is this a chance exception, or are there other cases in which relatively simple models run at sufficiently large scale outperform more complex but less scalable models? Will scalability become an important feature in its own right?
[CTAToken URL = "https://hpc-buy.arm.com/free-trial" target="_blank" text="Take a trial of Arm's scalable debugging and performance tools." class ="green"]