I know, I know… it’s been a while since I wrote my last Arm Compute Library blog, but I promise that your patience will be rewarded. There’s a whole host of freshly integrated functions, features and performance optimizations that I want to share with you, and I also invite you to wish the Compute Library a very happy birthday, since – unbelievably – it’s already two years old!
There’s also been a shift in guardianship: the Compute Library has now moved to mlplatform.org, under the auspices of Linaro’s Machine Intelligence Initiative. Visit its new home to follow and contribute to its day-to-day development. Read more about the launch here:
Linaro announces launch of Machine Intelligence Initiative
And so, formalities aside, allow me to roll out the red carpet for the 19.05 coming public release...
Over the last quarter, the team has continued integrating new functions for the latest deep neural networks and optimizing existing ones to deliver even better performance. Here are just some of the work items for Arm Mali GPU and Cortex-A CPU.
We’ve designed and integrated a new OpenCL tuner in the coming 19.05 release.
In fact, in the last few months we’ve seen a huge amount of interest in this simple but effective method, which finds the optimal Local-Work-Group Size (LWS) for each OpenCL kernel configuration to deliver high performance on Mali GPU.
Don’t worry if you don’t know what the LWS is or how the OpenCL tuner can be used in the Compute Library. I’ve got a couple of resources that will tell you all you need to know.
Until recently, OpenCL tuner strategy was simple: it was a brute force approach. That is, we simply tested all the possible values for the LWS and picked the one which returned the minimum execution time.
As you can imagine, tuning could take several minutes, particularly for deeper networks, which made the OpenCL tuner difficult to deploy in real-world applications.
To overcome this issue, we’ve introduced the option to select the level of tuning. You can now choose from three “flavours” of tuning – EXHAUSTIVE, NORMAL and RAPID – which provide trade-offs between performance uplift and tuning time.
As the names suggest, EXHAUSTIVE offers peak performance with a high tuning time, whilst RAPID offers the shortest tuning time with reduced performance uplift. (I’ll leave you to work out what NORMAL does.)
Below is a graph which reports performance uplift for the three levels of tuning on Mali-G76 MP10 @720MHz.
The performance improvement in the chart above is given comparing the performance of the networks with the three tuning modes against the performance of the networks without the OpenCL tuner. As you can infer, for most applications, NORMAL – the default - and FAST tuning modes – the orange and grey bars respectively –will be enough to achieve a significant performance boost.
I know you’re wondering how the performance uplift for NORMAL and FAST affects tuning time. Well, on average, the former is eight times faster than EXHAUSTIVE; RAPID a mere 45 times. This means that in few seconds (or maybe even less) you should have your network tuned, up and running.
In terms of performance uplift on both Mali GPUs and Cortex-A CPUs, there are integrated optimizations, primarily for F32 and quantized INT8 networks, in Compute Library 19.05.
On Mali GPUs, the main benefit for F32 networks is from the integration of Winograd 7x1/1x7 (i.e. Inception V3/V4 improved up to 20 %) and the FFT convolution layer (i.e. ResNet12 improved up to ~40%). For quantized INT8 networks, the new GEMMLowp implementation helps to boost the performance by up to ~20%.
On Cortex-A CPUs, the main benefit for F32 networks is just from the FFT convolution layer which helps to leverage the performance of ResNet12 up to 40%. For quantized INT8 networks, the new optimization in depthwise convolution – which now fuses the offset contribution and output stage – helps to improve the performance of mobilenet networks by up to 25%.
Below you’ll find several graphs summarizing the performance improvements on Arm Mali-G71/G76 GPU and on Arm Cortex-A73/A76 CPU for F32 and INT8 networks (TunerMode=EXHAUSTIVE).
Finally, a shameless plug for the Embedded Vision Summit – and specifically my presentation! If you’re not familiar with EVS, it’s an event focused on deployable smart vision applications for embedded systems which, this year, takes place next week in Santa Clara.
My session will look at a structured approach for performance analysis of deep learning software implementations, exploring a top-down approach to identifying and fixing performance bottlenecks, just like the one we fixed in ResNet12 with FFT convolution.
If you’ll be at EVS, please come and say hello! I’d be more than happy to have a chat and hear what you’re doing with the Compute Library.
Embedded Vision Summit: Performance Analysis for Optimizing Embedded Deep Learning Inference Software
If next week you are not at EVS, in June, I will also host a live Arm webinar on a similar topic but specifically for Arm devices.
Technical Webinar: Performance Analysis for Embedded Deep Learning Software Optimization
The webinar will be also an opportunity to talk about more about the performance improvements explored in this blog post for the Compute library 19.05. ;)
Ciao for now!