Most operations in deep learning involve massive amounts of data, but simple control logic. As a parallel processor, GPUs are very suitable for this type of task. Current high-end mobile GPUs can provide substantial throughput thanks to having hundreds of Arithmetic Logic Units (ALUs). In fact, GPUs are built with a single purpose – parallel data processing, initially for 3D graphics and later for more general parallel computing.
Additionally, GPUs are energy-efficient processors. Nowadays, the number of operations per watt (TOPs/W) is used to evaluate the energy efficiency of mobile processors and embedded devices. GPUs have higher TOPs/W, due to the relatively simple control unit and lower working frequency.
One of the biggest challenges that faces mobile inference (and training) on deep neural networks (DNNs) is memory. Memory in neural networks (NN) is required to store input data, weight parameters and activations, as an input propagates through the network. As an example, the 50-layer ResNet network has ~26 million weight parameters and computes ~16 million activations in the forward pass. Adding on-chip memory is one way of solving the memory bottleneck problem by allowing higher memory bandwidth. However, on-chip memory is an expensive feature.
In a previous blog, I introduced Arm NN GPU inference and one of its performance boosters: OpenCL Tuner. In this blog post, I demonstrate how to significantly reduce memory usage and achieve a substantial inference speed-up by simply enabling some new Arm NN features.
Single-precision floating-point format, also known as FP32, is a format that uses 32 bits in computer memory to represents numbers. This format allows handling numbers in a wide dynamic range 10^{-38} – 10^{38} with an accuracy of 0.000006%. This is the format used in desktop and servers to train NNs. Using the same format for inference on mobile devices allows us to run the model in the same format originally used for training without any special conversion. Therefore, we could expect the same accuracy as well. The downside is the amount of memory required to store weights and activations when the FP32 format is used. Can we use a lower precision value to store these weights and activations and reduce storage requirements?
Let us have a closer look at the FP32 and FP16 numerical formats. Floating-point format does not have a fixed number of bits assigned to integer and fractional parts. Instead, this type of numeric representation reserves some bits for the mantissa, and some bits for the exponent. This means that any number can be represented as (IEEE 754 Floating-Point Standard):
(-1)^{S} x (1.0 + 0.M) x 2^{(E-bias)};
Where S is the sign, M is the mantissa and E is the exponent. The exponent is adjusted by a bias to store a signed value in an unsigned one.
The following figure shows the number of bits used for FP32 and FP16 to represent the sign, exponent, and mantissa. We can see that FP16 uses half of the bits in FP32 representation, so when storing weights and activations FP16 use half of the memory compared with FP32.
Figure 1: FP32 and FP16 numeric representation
Nevertheless, nothing comes for free. As we can see, the dynamic range of FP16 format is very limited compared with FP32 and there is one-ten thousandth of the accuracy. The dynamic range refers to the range of representable numbers. The accuracy expresses how many values can be represented within the dynamic range, that is, it determines the precision of the format. When switching to FP16 format, we forgo both some dynamic range and some accuracy. To limit the impact of these drawbacks the neural networks must be accordingly rewritten. Recent models include normalization layers to avoid going out of range and make the most of the available precision.
In addition to memory saving, the fact that Arm Mali GPUs natively support FP16 data type means that the GPU pretty much can halve all resources needed for FP32. This also delivers twice the performance compared to FP32. For example, the GPU can pack two FP16 operations into a single FP32 instruction. For many workloads, and especially graphics, FP16 has sufficient precision so the 2x improvement from using it is a no-brainer.
The following table, extracted from the Mali GPU datasheet, shows the number of operations per clock cycle for FP16 and FP32 data formats for some Mali GPUs of Bifrost and Valhall architectures.
Table 1: Number of operations per clock cycle for different Arm Mali GPUs.
As we can see, using FP16 format for inference halves the amount of memory and bandwidth while doubling the performance, that is, more performance with less power consumption. We can enjoy all this while the impact on the model accuracy is kept within permissible levels.
Using FP16 in Arm NN is straightforward. We need to specify it in the optimizer options when loading the model into the runtime.
OptimizerOptions optimizerOptions; optimizerOptions.m_ReduceFp32ToFp16 = useFp16; IOptimizedNetworkPtr optNet = Optimize(*network, {Compute::GpuAcc}, runtime->GetDeviceSpec(), optimizerOptions);
Today, AI, deep learning, and NN are widely used to solve many scientific and practical problems. When following an AI/ML approach, convolutional neural networks (CNNs) are the most popular and effective, especially in image processing. However, it is challenging implementing real-time CNN algorithms in the mobile space with limited runtime resources and power. CNNs are formed by several types of layers, but the convolutions are the dominant ones, and the most computationally hungry as well. On average, convolutions consume more than 90% of the execution time in CNNs, as they require many arithmetic operations. It means that optimizing convolutions, and especially two-dimensional convolutions, can have a big impact on CNN training and inference.
Several algorithms have been implemented to optimize convolutions. Fast Fourier Transform (FFT) is well known and has been traditionally used for large length filters. FFT can reduce convolution complexity from O(n^{2}) to O(n Log n). Nevertheless, for small-length, and particularly for two-dimensional, convolutions, Winograd minimal filtering is both the most effective and most widely implemented algorithm in recent years. Here is an excellent description of this algorithm and how to use it to speed up matrix multiplications in convolutions. Just to give an idea of the optimization introduced when using Winograd, let us look at a simple 1D convolution expressed in the form of matrix vector multiplication:
The f.g multiplication operation will require 6 multiplications and 4 additions.
This can be optimized using standard linear libraries like BLAS, but Winograd algorithm goes further. This same multiplication can be used to compute the output of two consecutive 3-tap FIR filters, where only four input values are required:
Figure 2: Winograd optimization when computing two consecutive 3-tap FIR filters.
Having a 3-tap filter with 1D kernel with three elements means that if we want to compute one output element r_{0} we need to perform a dot product between the elements k_{0}, k_{1}, k2, and the elements covered by the kernel inside the input signal w_{0}, w_{1}, w_{2} (see Fig. 2a). To compute the next output element r_{1} we shift the input sliding window by one element, obtaining an overlap with the previous window (see Fig. 2b). As a result, the resulting matrix has six elements but only four input values. Winograd found an algorithm capable of performing this matrix-vector multiplication with only four multiplications. The output still a vector but it has only additions:
Where:
m_{0} = (k_{0} – k_{2}) w_{0}; m_{1} = (k_{1} + k_{2}) (w_{0 }+ w_{1 }+ w_{2})/2; _{}m_{2} = (k_{2} - k_{1}) (w_{0 }- w_{1 }+ w_{2})/2; m_{3} = (k_{1} – k_{3}) w_{2};
For the inference problem the weights are constants and the factors (w_{0 }+ w_{1 }+ w_{2})/2 and (w_{0 }- w_{1 }+ w_{2})/2 can be pre-calculated, leaving the total number of multiplications as four.
Similar optimizations can be extended to 2D kernels by performing some smart grouping of the elements in the input tensor into submatrices. The GPU can achieve even more optimizations by optimally scheduling jobs in OpenCL.
The following picture shows an overall speed-up between 1.1 and 1.8 by applying Winograd on 2x2 output values on a range of networks containing convolutions of different kernels, 3x3, 5x5, 1x1, and so on. For bigger output tiles, for example 4x4, Winograd must be applied carefully, as when handling bigger matrices more transformations need to be applied to input and output tensors. This penalizes the performance speed-up, limiting it to below the theoretical 4x as shown in the following picture.
Figure 3: CNN computational speed-up in GPU inference for 2x2 and 4x4 output values in different well-known nets.
We can specify optimizer options when loading the model into the runtime. When the Arm NN FastMath feature is enabled in GPU inference, Winograd optimizations are used in matrix operations. To enable the use of FP16 data format, we set the optimizer option to “useFP16”. To enable FastMath we need to add “FastMathEnabled” to the optimizer backend options by specifying “GpuAcc” backend. The following code snippet shows how to enable FP16 and FastMath for GPU inference:
OptimizerOptions optimizerOptions; optimizerOptions.m_ReduceFp32ToFp16 = useFp16; if(useFastMath) { BackendOptions fastMathOption { "GpuAcc", { { "FastMathEnabled", true } } }; optimizerOptions.m_ModelOptions.push_back(fastMathOption); } IOptimizedNetworkPtr optNet = Optimize(*network, {Compute::GpuAcc}, runtime->GetDeviceSpec(), optimizerOptions);
When FastMath is not enabled, the FP32 data format still applies Winograd optimizations to some kernel sizes. Enabling FastMath in FP32 extends Winograd optimization to a few other kernels. For FP16, the behavior is more straightforward: Winograd optimizations are applied only when the FastMath option is enabled. The following table summarizes Winograd support in FP32 and FP16 formats.
Table 2: Kernels supporting Winograd optimization for FP32 and FP16 data formats.
We have tested GPU inference with InceptionV3 and different data formats and Winograd optimizations. The following picture shows a capture from the Streamline tool for FP32 with FastMath option not enabled (top) and enabled (bottom). We can see that enabling FastMath reduces the inference time from 122ms to 115ms. The performance boost is only 6% because, as you can see from the table, enabling FastMath activates Winograd only for a couple of extra kernels: 5x5 and 7x7.
Figure 4: Streamline captures showing performance improvement of FP32 inference on InceptionV3 net when enabling FathMat feature.
For FP16, there is a noticeable difference when FastMath option is enabled. As you can see in the following picture, enabling FastMath helps reduce inference time from 139ms to 89ms - that is, a 36% of inference speed-up.
Figure 5: Streamline captures showing performance improvement of FP16 inference on InceptionV3 net when enabling FastMath feature.
In a previous blog we discussed, how the Arm Compute Library (ACL) implements the OpenCL Tuner feature to find the optimal configuration to use for each OpenCL kernel type. In this blog, we have covered two other important performance booster features of Arm NN, the use of FP16 data format and Winograd optimization for matrix operations. FP16 is particularly important as it can double the performance compared with FP32 and halves the memory usage. The reduced memory utilization in turn reduces the bandwidth utilization, which helps reduce the power consumption and makes the battery last longer. Winograd optimization helps improve the inference performance even further through efficient matrix computation. All these performance booster options available in Arm NN are especially relevant when performing intensive real-time inference, for example, on image processing from a camera stream.
New FastMath option is available only in the latest release v.20.11. You can download this from the Arm NN repository. I encourage you to try this and other performance booster options described in both blogs to improve the inference of your models on Mali GPUs. If you are a new to Arm NN, you can start by visiting Arm Developer site where you can find many “how to” guides.
[CTAToken URL = "https://developer.arm.com/ip-products/processors/machine-learning/arm-nn" target="_blank" text="Learn more about Arm NN" class ="green"]