We are pleased to announce the latest quarterly public release of the Compute Library, version 17.9. In this blog, I will highlight some of the new features and functions that we have added.
The key additions you will find in this release are:
We have added many new functions addressing needs of developers who are targeting Arm based platforms. These new routines are written using OpenCL C and C (making use of NEON Intrinsics).
OpenCL C (targeting Mali GPUs):
Direct convolution is an alternative approach for performing the convolution layer based on the classic sliding window. On implementations of the Mali GPU Bifrost architecture this implementation can play a significant role on improving the performance of our CNNs (we have observed up to 1.5x better performance on AlexNet using Direct convolution).
There are many machine learning use-cases where it is possible to reduce computational precision in order to improve efficiency and performance. This has been an important area of focus for our engineers in the last quarter. We have implemented new versions of existing functions using lower precision, such as 8 bit and 16 bit fixed point, for both CPU and GPU.
GPU (OpenCL) - 8 bit fixed point
GPU (OpenCL) - 16 bit fixed-point
NEON - 16 bit fixed-point
When we started the Compute Library project, our primary purpose was to share a comprehensive set of low level functions for computer vision and machine learning that provided good performance - but most importantly that was reliable and portable. The library is there to reduce cost and time efforts by developers and partners targeting Arm processors, whilst at the same time, also to behave well across the many system configurations that our partners implement. This is why we chose to use NEON intrinsics and OpenCL C as the target languages. However, there are cases where it is critical to extract every ounce of performance from the hardware. We therefore looked at adding to the library low-level primitives optimised using hand-coded assembly tailored to the micro-architecture of the target CPU.
To decide what functions we should focus on, our Research team has been investigating machine learning workloads, using the Caffe framework.
The three workloads used were:
The following diagram shows the instructions profile of these workloads:
Our team found that around 50-80% of the computation for the networks was inside of the SGEMM function, which multiplies two floating point matrices together. There were a few other functions that popped up as well, such as power functions and a function to convert the dimensions of a matrix. The rest of the computation spread in a long-tail distribution.
One trend that you can see is that the larger networks tend to have a higher proportion of SGEMM, although this likely has more to do with the configuration of the layers rather than specifically the size. What we can draw from this is that matrix multiplication is really important for neural networks. If there’s a target function to optimize, this is it.
In this release of the library we have added a CPU assembly optimized version of SGEMM (FP32) for Cortex-A53 and Cortex-A72 processors. The performance of these routines will vary depending on the platform, however in our tests we are seeing great general performance improvements. For example, we ran the AlexNet benchmark on Firefly board (64 bit, multi-threaded) and measured performance an uplift on Cortex-A72 of approximately 1.6x.
The following table shows a sample of our benchmarks using the new optimised routines on the same platform.
In our 17.6 release blog, I indicated our plans to implement support for new and upcoming architectural features designed for machine learning in our CPU, starting with FP16 in ARMv8.2 CPUs. You can find more details in this blog.
We are pleased to announce the addition of new functions to the library targeting ARMv8.2 FP16:
Whilst these functions have not been aggressively optimised for performance (they are written using NEON intrinsics and not hand optimised assembly), they provide a significant performance uplift compared to using FP32 and having to convert between formats. The following table shows some of the workloads side by side and illustrate how we are able to compute using less cycles by merit of the new v8.2 CPU instructions.
Many of our mobile partners today are taking advantage of the Mali GPU to accelerate machine learning workloads. Based on feedback from these partners we have targeted optimisations in this area.
New functions for direct convolutions 3x3 and 5x5 were optimized for the Bifrost architecture, helping to achieve a significant performance uplift compared to the routines from our previous release (17.06). We have observed a typical uplift around 2.5x using these new routines on selected test platforms. Furthermore, the new optimizations introduced in GEMM help us achieve 3.5x better performance in case of AlexNet with multiple batches. Performance is going to vary depending on the platform and implementation, however we expect that in general these optimizations will provide a significant uplift on Bifrost GPUs.
The following diagram shows some results measured on the Huawei Mate 9 smartphone, disabling DVFS and taking the minimum execution time over 10 runs. It illustrates how the new routines improve performance compared with the previous release.
Complex workloads (large networks) can require a lot of memory, which is a sticking point in particular for embedded and mobile platforms. Following feedback from our partners we have decided to add a “memory manager” functionality to the runtime component of our library. The memory manager reduces the memory requirements of a generic algorithm/model by recycling temporary buffers.
The memory manager consists of a Lifetime Manager (to keep track of the lifetime of the registered objects) and a Pool Manager (which manages the memory pools). When the developer configures its functions, the runtime tracks the memory requirements. For example, some tensors may be only temporary, so we only allocate the memory needed. Configuration of the memory manager should be performed sequentially from a single thread to ensure better memory utilization.
The following table shows some of the memory savings we have been able to measure on our test platforms when using the memory manager. These will vary depending on platform, workload and configuration. In general, we expect the memory manager to help developers save memory.
Going forward our plans are to continue to perform specific optimisations based on partner and developers' needs. Furthermore, we will focus on integration with machine learning frameworks and well as aligning with upcoming APIs such as Google Android NN.
Our aim is not to provide complete coverage for all data types and functions, we are very selective in the functions we choose to implement, and we base our choice on feedback from developers and partners – we look forward to hearing from you!