In February, a new paper from Google's DeepMind team appeared on arxiv. This one was interesting – they showed dramatically improved performance and training time of their Atari-playing Deep Q-Learning network. The training speedup was so great that 16 CPU cores outperformed their previous GPU results by an order of magnitude.
Having access to a range of machines with large numbers of CPUs I was, of course, intrigued by the possibilities. The Torch machine learning framework has already added an implementation of the agents described in the new paper and I wanted to see how well it performed and what kind of hardware would suit it best.
Downloading and installing Torch and Kaixhin's implementation of the paper's Asynchronous Advantage Actor-Critic algorithm was straightforward and my 8-core laptop learned to play the Catch demo at 1,590 simulation steps/second.Is 1,590 steps/second good? I don't know! The CPU was pegged out, so maybe yes? For comparison I ran the same test on one of our 24-core Intel Xeon test machines. The results, as expected, were better:
But wait, a 25% speedup for going from an 8-core laptop to a 24-core server doesn't sound right! What's going on here?
Time for a Performance Report. Arm's tuning and analysis tool shows the breakdown of how an application uses its time without requiring instrumentation.
On each system I put "perf-report" in front of Torch's "luajit" command (Torch is written in Lua with a lot of supporting C libraries) and compared the results:
The above is the result from my laptop. Sure enough it spends all its time computing and uses all 8 hyperthreaded cores to the max.
Look at the CPU section more closely - only 4.8% of the time is spent in efficient AVX vector instructions and the rest is all doing scalar arithmetic! This leaves a lot of performance on the table. As most neural networks operations are essentially implemented by a series of matrix multiplications I expected this to be much higher.
The server's results are enlighten us further:
Here the default Torch options have failed to vectorize any loops whatsoever! And in both cases over half the time is spent waiting for memory accesses. The higher clock speed and increased core count of the server won't help if it's waiting for slow DRAM accesses.
To see if I can improve matters I took a look at a MAP profile of one of the runs (replacing 'perf-report' with 'map --profile' in front of the 'luajit' command):
The heavy lifting (AddMM) is being performed by a custom sgemm implementation instead of a high-performance hardware-tuned version. So why did the server perform worse on a core-for-core basis? Why couldn't it make any use of the AVX vector instructions required to approach peak CPU performance?
Here the Torch install process has used a different BLAS implementation – and an inferior one at that!
There are many different high-performance matrix multiplication libraries available. Each has its own quirks and benefits. I've had good results with Intel's MKL library on Xeon CPUs before and decided to try it again here.
Telling torch to use the Intel MKL library for matrix multiplication was straightforward and the results on one of our 24-core test nodes were much improved:
Now we're talking! That's why it's always worth remembering all of Donald Knuth's famous quote:
about 97% of the time] premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%
So are we done now? Let's look at the Performance Report of the new optimized version:
The MKL multiplication kernels are accomplishing the same amount of work in much less time – the power of using the CPU's wide vector units. Now the time spent in memory accesses really dominates the computation! This has become a memory-bound problem. Does MAP give us any hints about how to improve this?
Indeed, the vectorized matrix operations are so fast that a significant amount of time is spent now inside memory-heavy Lua script code, such as the luaT_toudata, luaT_checkudata and lua_call functions in the screenshot above.
We could try to optimize the Lua code that glues the matrix multiplications together, but at that stage you're usually better off rewriting them in C in the first place.
The whole point of a scripting language is that it's easy to write and maintain and the overhead is ideally negligible because there's so much work to do in the deep math routines.
In this case the real question is: does the ratio of time spent inside the math routines increase when we start running a more interestingly-sized model? Let's move away from the toy Catch example and use the same code to training a larger network to play Atari Pong from raw pixel data.
We used the same code to train a larger a network on Atari Pong. This was clearly going to take a rather long time on my laptop, managing just 128 simulation steps per second. Here's how the two machines stacked up:
We can see that the Xeon maintains a strong performance advantage, as expected – note that in this and later results the MKL library is always used where available. Previously the Lua script was becoming a bottleneck. Has this changed with the larger run?
Yes – it's not the Lua code that's the bottleneck now. It's a range of hand-coded matrix max and sqrt routines (THFloatTensor_max in the screenshot above and THFloatTensor_sqrt below):
The common factor in these macros is inefficient memory access patterns coupled with zero vectorization, seen in the time breakdown for the highlighted line in the right-hand panel.
Additionally some spend significant time deallocating memory inside such macros, here seen by the THFree time in the Stacks display below:
Let's have a look at one of these macros in more detail:
No wonder the compiler can't vectorize this loop with all this branching going on. This is a very inefficient way to apply a max or sqrt operation to a matrix, assuming that is intended. We can also see that the 5% time spent in the free call is entirely unnecessary:
The highlighted line above allocates a small amount from the heap and frees it again at the end of the macro. Allocating from the heap requires thread locking! There's no need to allocate this small array of counters on the heap unless the number of dimensions is unusually large.
There are clearly some excellent optimization targets in the Torch library. But why hasn't Facebook – or the wider ML community – found them before?
The answer is that machine learning frameworks are becoming little operating systems in their own right, with scheduling, cross-platform optimization and many other complexities.
Here we have exactly the same Lua code driving two differently-sized neural networks and we're seeing a completely different set of performance constraints between the models.
If you want to have good performance, you need to be ready to look under the covers and see where the bottlenecks for your model are.
Ok, so we've taken the low-hanging fruit from the computational optimization already. We have a pretty well-parallelized algorithm which, unfortunately, is memory bound. Is there different hardware that would run this model better, without rewriting parts of Torch?
Well, the Intel Xeon Phi Knight's Landing has 16GB high-bandwidth memory close to the cores. The memory sections in our Performance Reports tell us that our dataset would fit entirely within that. As thread synchronization is not appearing as a bottleneck we can expect to scale performance to use its larger number of cores too.
This is a valuable prediction, so let's run the experiment. Can we get better performance on a different hardware architecture?
Yes we can! Each core in the new-generation Intel Xeon Phi is less powerful than a normal Xeon, but that's made up for in three ways:
We analyzed the performance of a top machine learning library, Torch, on a new CPU-friendly training method published by Google's DeepMind team.
Performance Reports showed us that the default Torch install was badly tuned for numerical computation and MAP pinpointed the SGEMM routines. Replacing those with hardware-optimized ones boosted performance by 3x.
Arm MAP now showed that more time was spent running Lua script code than performing numerical computation, a sign that the problem size was too small to run efficiently on Torch. We moved up to a larger scale problem (training a network to play Atari Pong) and saw that changing the layout of the neural networks changed the bottlenecks hit in the implementation. Now several inefficient hand-coded matrix routines inside macros were memory-bound.
We now had a choice – rewrite core Torch code or run on a system with faster memory performance. The Performance Reports results suggested the data would fit inside the high-bandwidth memory on an Intel Xeon Phi Knight's Landing. When we ran on the new architecture we achieved a further 2.33x speedup.
Our conclusion is that Performance Reports and Arm MAP will help data scientists and developers determine whether their deep learning algorithms are running efficiently or are hitting limitations in your current framework.
Their results will help you choose better libraries, frameworks and hardware or instances to match the needs of your current network topology, giving order-of-magnitude speedups on cutting-edge algorithms.
A key problem facing researchers and companies is that data scientists are domain experts, not low-level performance engineers. Arm Performance Reports is the tool they need to get the most out of their machine learning frameworks.