The size and computational complexity of neural network models continues to grow exponentially. The reason for this growth is easy to understand; generally, larger neural networks deliver higher accuracy on many image and language tasks that users care about. For example, the recent GPT-3 transformer-based neural network from OpenAI has over 175 *billion* parameters, and generates human-level text. However, the increase in the computational requirements when executing (inferencing) these massive networks presents a major challenge to their adoption. This challenge is one of the primary avenues of research being pursued by Arm’s Machine Learning Research Lab. Our lab is focused on finding novel ways to efficiently execute advanced machine learning models on Arm-based embedded and mobile platforms. To this end, we have published various research, ranging from AutoML for deeply embedded devices, novel factorization schemes, and hardware designs for executing compressed models.

## Combining low-precision and complexity-reducing techniques

Our recent paper, which will be presented at ECCV in August, attacks the computational problem from a different angle. It is well established that the use of low-precision numbers—such as INT8 parameters and computation--significantly reduces the power, memory, and execution-time requirements for advanced neural networks. It is also well known that transform techniques—in particular, the Winograd transform—can be used to significantly reduce the number of arithmetic operations required for the execution of these networks.

However, the combination of these two techniques – low-precision representation and the complexity-reducing Winograd transform – has, until now, resulted in an unacceptably high loss in prediction accuracy. The loss in accuracy arises due to numerical problems that occur when performing the transform operations required by the Winograd algorithm. As can be seen in the following Figure, several transform coefficients are either very large or very small, and thus cannot be accurately represented with INT8 precision.

Figure 1. The 10 x 10 convolution **y** (in brown, far right) of 12 x 12 input **d** (in blue, far left) and 3 x 3 kernel **g** (in green, center)

*y = A*^{T} ((B^{T} dB) ⊙* (GgG*^{T})) A

^{T}((B

^{T}dB)

^{T})) A

Where

## Maintaining prediction accuracy using a Residue Number System (RNS)

We have developed a technique that allows the complexity-reducing Winograd transform to be applied to convolutional neural networks with INT8 parameters. The foundation of our technique is the use of a residue number system (RNS). An RNS is used to represent integers by their values modulo pairwise co-prime integers, as shown in Figure 2. The RNS representation enables us to perform the transformations and operations required to execute the network in the Winograd domain, without suffering the numerical problems (underflow and overflow) that typically result in a loss of prediction accuracy. This means that the resulting lower-complexity network incurs no degradation of prediction accuracy compared to the original INT8 network.

###### RNS(*m*_{0}, m_{1},...m_{n-1}*)*

_{0}, m

_{1},...m

_{n-1}

###### An integer *x* can be represented by remainder set

* {x mod(m*_{0}), x mod(m_{1}), ... x mod (m_{n-1})}

_{0}), x mod(m

_{1}), ... x mod (m

_{n-1})}

###### where moduli {*m*_{i}} are pairwise co-prime

_{i}

###### Arithmetic operations in RNS: Addition(+), Subtract(-) and Multiply(*)

* x = {x*_{0}, x_{1},...x_{n-1}} and *{y*_{0}, y_{1},...y_{n-1}} ε RNS* (m*_{0}, m_{1},...m_{n-1})

_{0}, x

_{1},...x

_{n-1}}

_{0}, y

_{1},...y

_{n-1}}

_{0}, m

_{1},...m

_{n-1})

* x ± y = { x*_{0} ± y_{0}, x_{1} ± y_{1}, ... x_{n-1} ± y_{n-1}}

_{0}± y

_{0}, x

_{1}± y

_{1}, ... x

_{n-1}± y

_{n-1}}

* x * y = { x*_{0} * y_{0}, x_{1} * y_{1}, ... x_{n-1} * y_{n-1}}

_{0}* y

_{0}, x

_{1}* y

_{1}, ... x

_{n-1}* y

_{n-1}}

###### Division x/y in RNS {m_{i}} is well-defined if *y* is co-prime to moduli {*m*_{i}}

_{i}

* x/y = x * y*^{-1} mod{m_{i}}

^{-1}mod{m

_{i}}

* where y*^{-1} * y = 1 mod{m_{i}}

^{-1}* y = 1 mod{m

_{i}}

* y*^{-1} is the multiplicative inverse of* y*

^{-1}

*Figure 2: RNS representation of integers by their values modulo pairwise co-prime integers*

The following equation shows the same computation for the MxM output ** y **as was shown in Figure 1, except in Figure 3 the calculation is performed using RNS(247, 251, 253). The weight, activation, and output transform matrices for RNS(253) are shown. As shown, the transform coefficients (G, B, A matrices) can all be represented precisely with an INT8 representation, and

**(the result of the convolution) can be reconstructed using either the Chinese Remainder Theorem or Mixed Radix Conversion.**

*y,*

Figure 3. The Winograd convolution **F** (10x10,3x3) over RNS (247,251,253)

In Table 1, we present the speedup achieved on different layers of the VGG16 convolution neural network using our RNS-based Winograd convolution with ImageNet dataset, compared to the baseline INT8 and INT16 approaches. As shown, we achieve around a 2x speedup over the standard im2col+GEMM implementation on an Arm Cortex-A73 platform with our residual number system-based Winograd approach. We anticipate that speedups of this magnitude will enable the next generation of advanced convolution neural networks for image, video, and speech applications to execute efficiently on embedded and mobile platforms.

Table 1: Inference performance of 8-bit activation and 8-bit weight quantized CNN layers of VGG16 with Winograd algorithm F(14 14; 3 3) over RNS(251,241,239) and RNS(4001,4331) on Arm Cortex-A73, having 71.4% top-1 prediction accuracy with ImageNet dataset. The corresponding transforms are in the supplementary materials. The speed-up of RNS(251,241,239) and RNS(4001,4331) are the runtime improvement relative to the standard INT8 and INT16 Im2col+GEMM convolution baselines respectively.

## Find out more

Zhi-Gang Liu from Arm’s ML Research Lab presented the details of this research at ECCV - take a look at the full paper to learn more.

[CTAToken URL = "https://www.arm.com/resources/research/ml" target="_blank" text="Discover more about ML Research at Arm" class ="green"] [CTAToken URL = "https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123640052.pdf" target="_blank" text="Read the full paper" class ="green"]

## If you enjoyed this post...

Take a look at some of the other blogs published recently by our Machine Learning researchers:

*Adapting Models to the Real World: On-Device Training for Edge Model Adaptation*by Mark O'Connor*Even Faster Convolutions: Winograd Convolutions meet Integer Quantization and Architecture Search*by Javier Fernandez-Marques*SCALE-Sim: A cycle-accurate NPU simulator for your research experiments*by Paul Whatmough