Making the most of Arm NN for GPU inference: FP16 and FastMath

January 26, 2021

10 minute read time.

Most operations in deep learning involve massive amounts of data, but simple control logic. As a parallel processor, GPUs are very suitable for this type of task. Current high-end mobile GPUs can provide substantial throughput thanks to having hundreds of Arithmetic Logic Units (ALUs). In fact, GPUs are built with a single purpose – parallel data processing, initially for 3D graphics and later for more general parallel computing.

Additionally, GPUs are energy-efficient processors. Nowadays, the number of operations per watt (TOPs/W) is used to evaluate the energy efficiency of mobile processors and embedded devices. GPUs have higher TOPs/W, due to the relatively simple control unit and lower working frequency.

One of the biggest challenges that faces mobile inference (and training) on deep neural networks (DNNs) is memory. Memory in neural networks (NN) is required to store input data, weight parameters and activations, as an input propagates through the network. As an example, the 50-layer ResNet network has ~26 million weight parameters and computes ~16 million activations in the forward pass. Adding on-chip memory is one way of solving the memory bottleneck problem by allowing higher memory bandwidth. However, on-chip memory is an expensive feature.

In a previous blog, I introduced Arm NN GPU inference and one of its performance boosters: OpenCL Tuner. In this blog post, I demonstrate how to significantly reduce memory usage and achieve a substantial inference speed-up by simply enabling some new Arm NN features.

FP16 vs FP32

Single-precision floating-point format, also known as FP32, is a format that uses 32 bits in computer memory to represents numbers. This format allows handling numbers in a wide dynamic range 10^-38 – 10³⁸ with an accuracy of 0.000006%. This is the format used in desktop and servers to train NNs. Using the same format for inference on mobile devices allows us to run the model in the same format originally used for training without any special conversion. Therefore, we could expect the same accuracy as well. The downside is the amount of memory required to store weights and activations when the FP32 format is used. Can we use a lower precision value to store these weights and activations and reduce storage requirements?

Let us have a closer look at the FP32 and FP16 numerical formats. Floating-point format does not have a fixed number of bits assigned to integer and fractional parts. Instead, this type of numeric representation reserves some bits for the mantissa, and some bits for the exponent. This means that any number can be represented as (IEEE 754 Floating-Point Standard):

(-1)^S x (1.0 + 0.M) x 2^(E-bias);

Where S is the sign, M is the mantissa and E is the exponent. The exponent is adjusted by a bias to store a signed value in an unsigned one.

The following figure shows the number of bits used for FP32 and FP16 to represent the sign, exponent, and mantissa. We can see that FP16 uses half of the bits in FP32 representation, so when storing weights and activations FP16 use half of the memory compared with FP32.

Figure 1: FP32 and FP16 numeric representation

Nevertheless, nothing comes for free. As we can see, the dynamic range of FP16 format is very limited compared with FP32 and there is one-ten thousandth of the accuracy. The dynamic range refers to the range of representable numbers. The accuracy expresses how many values can be represented within the dynamic range, that is, it determines the precision of the format. When switching to FP16 format, we forgo both some dynamic range and some accuracy. To limit the impact of these drawbacks the neural networks must be accordingly rewritten. Recent models include normalization layers to avoid going out of range and make the most of the available precision.

In addition to memory saving, the fact that Arm Mali GPUs natively support FP16 data type means that the GPU pretty much can halve all resources needed for FP32. This also delivers twice the performance compared to FP32. For example, the GPU can pack two FP16 operations into a single FP32 instruction. For many workloads, and especially graphics, FP16 has sufficient precision so the 2x improvement from using it is a no-brainer.

The following table, extracted from the Mali GPU datasheet, shows the number of operations per clock cycle for FP16 and FP32 data formats for some Mali GPUs of Bifrost and Valhall architectures.

	Mali-G71	Mali-G32	Mali-G31	Mali-G51	Mali-G52	Mali-G76	Mali-G57	Mali-G77	Mali-G68	Mali-G78
FP32 op/clock	24	24	16/32	24	32/64	48	64	64	64	64
FP16 op/clock	48	48	32/64	48	64/96	96	128	128	128	128

Table 1: Number of operations per clock cycle for different Arm Mali GPUs.

As we can see, using FP16 format for inference halves the amount of memory and bandwidth while doubling the performance, that is, more performance with less power consumption. We can enjoy all this while the impact on the model accuracy is kept within permissible levels.

Using FP16 in Arm NN is straightforward. We need to specify it in the optimizer options when loading the model into the runtime.

OptimizerOptions optimizerOptions;

optimizerOptions.m_ReduceFp32ToFp16 = useFp16;
IOptimizedNetworkPtr optNet = Optimize(*network, {Compute::GpuAcc}, runtime->GetDeviceSpec(), optimizerOptions);

Winograd minimal filtering algorithm

Today, AI, deep learning, and NN are widely used to solve many scientific and practical problems. When following an AI/ML approach, convolutional neural networks (CNNs) are the most popular and effective, especially in image processing. However, it is challenging implementing real-time CNN algorithms in the mobile space with limited runtime resources and power. CNNs are formed by several types of layers, but the convolutions are the dominant ones, and the most computationally hungry as well. On average, convolutions consume more than 90% of the execution time in CNNs, as they require many arithmetic operations. It means that optimizing convolutions, and especially two-dimensional convolutions, can have a big impact on CNN training and inference.

Several algorithms have been implemented to optimize convolutions. Fast Fourier Transform (FFT) is well known and has been traditionally used for large length filters. FFT can reduce convolution complexity from O(n²) to O(n Log n). Nevertheless, for small-length, and particularly for two-dimensional, convolutions, Winograd minimal filtering is both the most effective and most widely implemented algorithm in recent years. Here is an excellent description of this algorithm and how to use it to speed up matrix multiplications in convolutions. Just to give an idea of the optimization introduced when using Winograd, let us look at a simple 1D convolution expressed in the form of matrix vector multiplication:

The f.g multiplication operation will require 6 multiplications and 4 additions.

This can be optimized using standard linear libraries like BLAS, but Winograd algorithm goes further. This same multiplication can be used to compute the output of two consecutive 3-tap FIR filters, where only four input values are required:

Figure 2: Winograd optimization when computing two consecutive 3-tap FIR filters.

Having a 3-tap filter with 1D kernel with three elements means that if we want to compute one output element r₀ we need to perform a dot product between the elements k₀, k₁, k2, and the elements covered by the kernel inside the input signal w₀, w₁, w₂ (see Fig. 2a). To compute the next output element r₁ we shift the input sliding window by one element, obtaining an overlap with the previous window (see Fig. 2b). As a result, the resulting matrix has six elements but only four input values. Winograd found an algorithm capable of performing this matrix-vector multiplication with only four multiplications. The output still a vector but it has only additions:

Where:

m₀ = (k₀ – k₂) w₀; m₁ = (k₁ + k₂) (w₀+ w₁+ w₂)/2; m₂ = (k₂ - k₁) (w₀- w₁+ w₂)/2; m₃ = (k₁ – k₃) w₂;

For the inference problem the weights are constants and the factors (w₀+ w₁+ w₂)/2 and (w₀- w₁+ w₂)/2 can be pre-calculated, leaving the total number of multiplications as four.

Similar optimizations can be extended to 2D kernels by performing some smart grouping of the elements in the input tensor into submatrices. The GPU can achieve even more optimizations by optimally scheduling jobs in OpenCL.

The following picture shows an overall speed-up between 1.1 and 1.8 by applying Winograd on 2x2 output values on a range of networks containing convolutions of different kernels, 3x3, 5x5, 1x1, and so on. For bigger output tiles, for example 4x4, Winograd must be applied carefully, as when handling bigger matrices more transformations need to be applied to input and output tensors. This penalizes the performance speed-up, limiting it to below the theoretical 4x as shown in the following picture.

Figure 3: CNN computational speed-up in GPU inference for 2x2 and 4x4 output values in different well-known nets.

Arm NN FP16 and FastMath features

We can specify optimizer options when loading the model into the runtime. When the Arm NN FastMath feature is enabled in GPU inference, Winograd optimizations are used in matrix operations. To enable the use of FP16 data format, we set the optimizer option to “useFP16”. To enable FastMath we need to add “FastMathEnabled” to the optimizer backend options by specifying “GpuAcc” backend. The following code snippet shows how to enable FP16 and FastMath for GPU inference:

OptimizerOptions optimizerOptions;

optimizerOptions.m_ReduceFp32ToFp16 = useFp16;
if(useFastMath)
{
    BackendOptions fastMathOption { "GpuAcc", { { "FastMathEnabled", true } } };
    optimizerOptions.m_ModelOptions.push_back(fastMathOption);
}

IOptimizedNetworkPtr optNet = Optimize(*network, {Compute::GpuAcc}, runtime->GetDeviceSpec(), optimizerOptions);

When FastMath is not enabled, the FP32 data format still applies Winograd optimizations to some kernel sizes. Enabling FastMath in FP32 extends Winograd optimization to a few other kernels. For FP16, the behavior is more straightforward: Winograd optimizations are applied only when the FastMath option is enabled. The following table summarizes Winograd support in FP32 and FP16 formats.

Data format and FastMath	Winograd kernel support
FP32
FastMath not enabled	3x3, 3x1, 1x3, 5x1, 1x5, 7x1, 1x7
FastMath enabled	3x3, 3x1, 1x3, 5x1, 1x5, 7x1, 1x7, 5x5, 7x7
FP16
FastMath not enabled	None
FastMath enabled	3x3, 3x1, 1x3, 5x1, 1x5, 7x1, 1x7, 5x5, 7x7

Table 2: Kernels supporting Winograd optimization for FP32 and FP16 data formats.

We have tested GPU inference with InceptionV3 and different data formats and Winograd optimizations. The following picture shows a capture from the Streamline tool for FP32 with FastMath option not enabled (top) and enabled (bottom). We can see that enabling FastMath reduces the inference time from 122ms to 115ms. The performance boost is only 6% because, as you can see from the table, enabling FastMath activates Winograd only for a couple of extra kernels: 5x5 and 7x7.

Figure 4: Streamline captures showing performance improvement of FP32 inference on InceptionV3 net when enabling FathMat feature.

For FP16, there is a noticeable difference when FastMath option is enabled. As you can see in the following picture, enabling FastMath helps reduce inference time from 139ms to 89ms - that is, a 36% of inference speed-up.

Figure 5: Streamline captures showing performance improvement of FP16 inference on InceptionV3 net when enabling FastMath feature.

Conclusions

In a previous blog we discussed, how the Arm Compute Library (ACL) implements the OpenCL Tuner feature to find the optimal configuration to use for each OpenCL kernel type. In this blog, we have covered two other important performance booster features of Arm NN, the use of FP16 data format and Winograd optimization for matrix operations. FP16 is particularly important as it can double the performance compared with FP32 and halves the memory usage. The reduced memory utilization in turn reduces the bandwidth utilization, which helps reduce the power consumption and makes the battery last longer. Winograd optimization helps improve the inference performance even further through efficient matrix computation. All these performance booster options available in Arm NN are especially relevant when performing intensive real-time inference, for example, on image processing from a camera stream.

New FastMath option is available only in the latest release v.20.11. You can download this from the Arm NN repository. I encourage you to try this and other performance booster options described in both blogs to improve the inference of your models on Mali GPUs. If you are a new to Arm NN, you can start by visiting Arm Developer site where you can find many “how to” guides.

Learn more about Arm NN

0 comments
0 members are here

AI blog

Updates in KleidiCV: Multithreading support and OpenCV 4.11 integration

Mark Horvath

What's new with KleidiCV 0.2.0 and 0.3.0? Updates include new features and performance enhancements.
- February 25, 2025
Part 2: Sing this song in another language, translating Machine Learning Pipelines to Android

Virginia Cangelosi

Part 2 explores the challenges of porting such a complex pipeline to Android, with insight on key design choices to facilitate the process.
- January 15, 2025
Sing this song in another language, Part 1: Creating an ML Pipeline

Virginia Cangelosi

This blog investigates how to extend AI’s capability to recreate songs with lyrics in another language using open-source ML models available today.
- January 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Making the most of Arm NN for GPU inference: FP16 and FastMath

FP16 vs FP32

Winograd minimal filtering algorithm

Arm NN FP16 and FastMath features

Conclusions

Updates in KleidiCV: Multithreading support and OpenCV 4.11 integration

Part 2: Sing this song in another language, translating Machine Learning Pipelines to Android

Sing this song in another language, Part 1: Creating an ML Pipeline