An extensive study performed by Facebook in 2019 showed that only a fraction of inference currently runs on mobile GPUs. According to the study, “less than 20% of mobile SoCs have a GPU 3x more powerful than CPUs and, on a median mobile device, GPUs are only as powerful as CPUs”. If this is the reality, then why consider GPU inference?
Well, in the first place, if the GPU does fall in the 20% then you can achieve 3x faster inference performance compared with the CPU. Actually, this figure is an average value. In practice, the gap between GPU and CPU performance can be bigger. GPUs are constantly improving performance. For example, Arm Mali GPUs have been sustaining improved performance and efficiency for Machine Learning (ML) workloads in each new model, so we can expect the fraction of GPUs with higher inference performance over CPU to increase.
In 2019, the Arm Mali-G77 GPU represented a big leap in performance and energy efficiency, with this including a 60% improvement for ML. In 2020, the Arm Mali-G78 GPU took it another step further, delivering 25% more performance and a further ML performance boost compared with previous generations. All while keeping a balance between performance and energy-efficiency improvements.
Although there is no ‘one-size fits-all’ ML solution out there, GPU parallel data processing capability makes it very suitable for running ML workloads. The GPU was originally designed for optimized graphics vector and matrix operations. To achieve these, a single GPU core can execute hundreds of hardware threads in parallel.
Arm has computing IPs across a whole range of processors. And whether you want to do ML inference on GPUs, CPUs or NPUs, they can all be used under a single common framework: Arm NN.
Figure 1: Arm NN and Arm Compute Library.
Arm NN is an open-source inference engine for CPUs, GPUs and NPUs. It bridges the gap between existing NN frameworks and the underlying IP. Arm NN is built on top of the Arm Compute Library (ACL). This contains a collection of highly optimized low-level functions to accelerate inference on the Arm Cortex-A family of CPU processors and the Arm Mali family of GPUs. For GPUs, ACL uses OpenCL as its compute API. (See Figure 1).
The OpenCL memory model closely maps to the GPU architecture. Thanks to this, it is possible to implement optimizations that significantly reduce the accessing of global memory, as we will see in the next section. It means faster convolution calculations with lower power consumption. This is very relevant especially for CNN inference where convolutions represent ~90% of the total operations.
ACL is an open-source project which Arm has worked hard to optimize to provide superior performance compared with other alternatives. To get the most from Arm NN, it is important to know the options it provides to improve inference performance. As a developer, you look for every millisecond you can squeeze, especially when you need to achieve real-time inference. Let us have a look at one of the optimization options available in Arm NN and evaluate the impacts it can produce with some practical examples.
ACL implements the so-called Local Work-group Size (LWS) tuner. The idea is to improve the cache utilization at L1 and L2 levels and reduce accessing global memory as much as possible.
Figure 2 shows a basic representation of OpenCL architecture. The compute device can be a GPU, a CPU, or an accelerator. Inside the compute device we have several compute units (GPU core, CPU core, and so on). Each of them has its own L1 memory cache and can execute N threads in parallel, known as work-items. Each thread executes the same piece of code corresponding to an OpenCL kernel, where the thread Id is used to access different memory locations.
Figure 2: OpenCL architecture and memory caches.
To improve L1 memory cache utilization we want the threads of the same work-group to access consecutive memory addresses (memory coalescing). To optimize L2 cache utilization, we want the compute units to reuse the same memory block. To achieve these optimizations for L1 and L2 memory caches, the ACL implements a Local Work-group Size (LWS) tuner to find the optimal configuration to use for each OpenCL kernel type. For a more detailed explanation, you can read this blog and watch this presentation. The impact on the inference performance of the LWS tuner can be huge. This is between 1.12 and 1.8 for different networks, as you can see in the picture below for the three different CL Tuner modes.
Figure 3: OpenCL tuner performance uplift for different networks and tuning modes.
When the tuner was first introduced, it used a brute force approach. It simply tested all the possible values for the LWS and picked the one that delivered the minimum execution time. For deep networks, this process could take several minutes. In the ACL 19.05 release, the tuner was optimized and now we can choose from three levels of tuning: “Exhaustive”, “Rapid” and “Normal.” These levels provide different trade-offs between tuning performance and tuning time. The tuning process takes place only once and the optimal configuration is saved in a local file. To access this file, an Android application will need permissions to read and write to the external storage. The file can be placed into the app-specific folder.
A code snippet that enables the tuner is shown in the following extract. To enable the tuner, a level between 1 and 3 must be set. Additionally, we must provide the path of the file where the optimal configuration is saved. Once we set the tuning options for the GPU backend, we use it to create the runtime. It takes longer to run the first time the tuner is activated.
IRuntime::CreationOptions options; BackendOptions backendOptions {"GpuAcc", { { "TuningFile", tuningFile }, // Where to save the optimal parameters { "TuningLevel", tuningLevel } // 0 - None, 1 - Rapid, 2 - Normal, 3 - Exhaustive } }; runtime = IRuntime::Create(options);
The final step is to remove from “options” the line that sets the tuning level after the first run. Once the file has been created, this line is not needed anymore. Arm NN only needs to know where the file is. Keeping the line means the tuning process launches every time to find the optimal configuration for kernels, and slowdown every execution.
The Rapid mode is designed to offer the shortest tuning time, but without achieving as good performance uplift. However, Exhaustive offers peak performance, but with the longest tuning time. Normal level gives a balanced trade-off between performance improvement and tuning time. A study performed over a variety of networks shows that Normal and Rapid modes are enough to achieve a significant performance boost.
Figure 4: Streamline capture before and after enabling OpenCL Tuner.
The previous picture shows a Streamline capture before (top) and after (bottom) enabling OpenCL Tuner. Focusing on the non-fragment queue active (orange color curve) in the GPU usage section, the highlighted interval shows the beginning and end of the ML inference process on the GPU. Notice that after enabling the tuner, the inference interval is shorter (18ms) compared with the inference interval before enabling the tuner (24ms). This means that inference performance has improved by 25%. The improvement is different depending on the hardware and network type. The capture shown in the picture corresponds to the inference of a segmentation network running on Mali-G72 MP12 GPU in a Unity app on the video stream from the smartphone.
GPUs were originally designed to accelerate graphics by running in parallel hundreds of threads in each graphic core for vertex and fragment processing. This graphics power is extensively used now also for general-purpose computing (GPGPU) and, in particular, as a back-end for inference engines where operators are implemented as compute shaders. GPU inference has additional advantages other than just performance over CPU inference. Executing deep neural network inference on mobile CPUs comes with the unwanted cost of increased power consumption that negatively impacts the battery life and thermal throttling that leads to computation slowdown. However, what compute unit is best? The answer is: it depends on the workload. A blog published in Arm Community helps to answer this question.
Nevertheless, if your application needs real time performance, for example, when working with CNNs on images coming from the smartphone video stream, then the GPU is the first option to consider. The next step is to enable the OpenCL Tuner to get a free performance boost that you will always appreciate in any real-time app.
In a second blog I write about another great optimization for GPU inference that is available soon in Arm NN. Stay tuned on latest developments of Arm NN in the Arm Developer Community.
[CTAToken URL = "https://developer.arm.com/ip-products/processors/machine-learning/arm-nn" target="_blank" text="Learn more about Arm NN" class ="green"]