Why GPUs and Machine Learning are a Perfect Match

Way back at the start of the year I asked you what you thought the hot topics of tech were going to be in 2017 and which would take the greatest forward leaps.  Whilst there was a lot of support for Virtual Reality (and not quite as much for Augmented Reality as I might have thought), the stand out was Machine Learning (ML), with 57% of you betting it was going to have a massive impact on the technology industry as a whole.

Which tech will see the greatest advancements in 2017?

Since then, you’ve undoubtedly seen that you were right, it’s cropping up just about everywhere. Huawei’s Mate 9, launched late last year and featuring impressive ML capabilities to personalise and enhance your user experience, really kicked things off and from there the trend has exploded. This year has already seen the launch of Arm’s DynamIQ technology and the Computex announcements of brand new Cortex-A75 and Cortex-A55 CPUs and Mali-G72 GPU, all designed and optimized for ML. We’ve talked about how the new Mali-G72 GPU is providing better than ever performance for on-device ML, but we haven’t talked much about why it’s so good so we’re going to look at that now.

What does graphics have to do with Machine learning?

A big part of this ability is down to how GPUs fundamentally work. Whilst they excel at graphics workloads, it doesn’t stop there. They are highly programmable by nature and, in contrast to CPUs, which typically work best with serial workloads, they work with parallel workloads. This of course comes from the original graphics workloads which require each pixel on screen to be handled in parallel. So how does this apply to ML? Well, ML is mainly (though not exclusively) based on an absolutely vast quantity of vector-matrix and matrix-matrix multipliers. Handling this enormous amount of data doesn’t sit too well in a serial workload but is perfect for the data-heavy, parallel load of the GPU. It’s not just about parallel vs serial, however, the GPU is also programmable. It adapts easily to running new types of neural networks through languages like OpenCL. With emerging technologies like ML, it can be really difficult to predict the future direction of the industry and what may be required in just a few short years. This means flexibility and programmability are key when investing in IP that won’t get to consumer devices for a year or more.

We’ve known about the GPU’s capability for ML for some time, last year we showed a Luxoft object recognition demo which uses Caffe framework to identify a huge variety of objects and indicate how confident it is in the result. The bit you might not have noticed is that the demo can be switched between running on the Arm Mali GPU or Arm Cortex CPU. On average, switching to the GPU effectively doubled performance, with significant scope for further optimisation.

We didn’t stop there though, recognising this ability was one thing, but exploiting it, another. That’s why our recently released Mali-G72 high performance GPU has been specifically designed to target substantial improvements in compute efficiency. With Mali-G72, the majority of the focus was put on the arithmetic units of the GPU. The Execution Engines in the Bifrost GPU architecture were effectively refined to consume less power and area whilst at the same time increasing performance. Using GEMM as a proxy for ML, we managed to improve the efficiency by 17% for FP16 precision GEMM.

Mali-G72 machine learning efficiency

Where the Machine Learning magic happens

How did we achieve this? In the Bifrost arithmetic unit, we have one Fuse Multiply-ADD (FMA) unit and one ADD unit per execution engine. Both units are capable of handling different types of instructions, but the idea is that the FMA handles the heavy floating-point operations, whilst the ADD handles simpler housekeeping operations and functions as a special function unit. In Mali-G72 we rebalanced the units slightly and moved several instructions from the FMA to the ADD, making the FMA smaller and less power-hungry, while at the same time making sure it could handle performance critical instructions such as fused-multiply-add with FP32 floating point at full throughput.

Additionally, we have increased the L1 cache in the execution engine, which is beneficial to compute workloads such as GEMM in that it reduces the amount of external memory reads that are needed. This is important as ML is about processing data, lots of it, so any data you can avoid moving in and out of the external memory is a gain in both efficiency and performance.

There are other incremental improvements throughout the GPU that help with ML, but the two above are the biggest contributors to the ML efficiency gain in Mali-G72. In designing our next generation of Graphics Processors we always try to look ahead at the upcoming trends and requirements of the technology industry. This foresight and planning allows us to be ahead of the game in supporting the very latest technologies before they’re even fully realised.