Reducing the Cost of Neural Network Inference with Residue Number Systems

August 21, 2020

5 minute read time.

The size and computational complexity of neural network models continues to grow exponentially. The reason for this growth is easy to understand; generally, larger neural networks deliver higher accuracy on many image and language tasks that users care about. For example, the recent GPT-3 transformer-based neural network from OpenAI has over 175 billion parameters, and generates human-level text. However, the increase in the computational requirements when executing (inferencing) these massive networks presents a major challenge to their adoption. This challenge is one of the primary avenues of research being pursued by Arm’s Machine Learning Research Lab. Our lab is focused on finding novel ways to efficiently execute advanced machine learning models on Arm-based embedded and mobile platforms. To this end, we have published various research, ranging from AutoML for deeply embedded devices, novel factorization schemes, and hardware designs for executing compressed models.

Combining low-precision and complexity-reducing techniques

Our recent paper, which will be presented at ECCV in August, attacks the computational problem from a different angle. It is well established that the use of low-precision numbers—such as INT8 parameters and computation--significantly reduces the power, memory, and execution-time requirements for advanced neural networks. It is also well known that transform techniques—in particular, the Winograd transform—can be used to significantly reduce the number of arithmetic operations required for the execution of these networks.

However, the combination of these two techniques – low-precision representation and the complexity-reducing Winograd transform – has, until now, resulted in an unacceptably high loss in prediction accuracy. The loss in accuracy arises due to numerical problems that occur when performing the transform operations required by the Winograd algorithm. As can be seen in the following Figure, several transform coefficients are either very large or very small, and thus cannot be accurately represented with INT8 precision.

10 x 10 convolution

Figure 1. The 10 x 10 convolution y (in brown, far right) of 12 x 12 input d (in blue, far left) and 3 x 3 kernel g (in green, center)

y = A^T ((B^T dB) ⊙ (GgG^T)) A

Where

Convolution with Winograd

Maintaining prediction accuracy using a Residue Number System (RNS)

We have developed a technique that allows the complexity-reducing Winograd transform to be applied to convolutional neural networks with INT8 parameters. The foundation of our technique is the use of a residue number system (RNS). An RNS is used to represent integers by their values modulo pairwise co-prime integers, as shown in Figure 2. The RNS representation enables us to perform the transformations and operations required to execute the network in the Winograd domain, without suffering the numerical problems (underflow and overflow) that typically result in a loss of prediction accuracy. This means that the resulting lower-complexity network incurs no degradation of prediction accuracy compared to the original INT8 network.

RNS(m₀, m₁,...m_n-1)

An integer x can be represented by remainder set

{x mod(m₀), x mod(m₁), ... x mod (m_n-1)}

where moduli {m_i} are pairwise co-prime

Arithmetic operations in RNS: Addition(+), Subtract(-) and Multiply(*)

x = {x₀, x₁,...x_n-1} and {y₀, y₁,...y_n-1} ε RNS (m₀, m₁,...m_n-1)

x ± y = { x₀ ± y₀, x₁ ± y₁, ... x_n-1 ± y_n-1}

x y = { x₀ * y₀, x₁ * y₁, ... x_n-1 * y_n-1}*

Division x/y in RNS {m_i} is well-defined if y is co-prime to moduli {m_i}

x/y = x y^-1 mod{m_i}*

where y^-1 y = 1 mod{m_i}*

y^-1 is the multiplicative inverse of y

Figure 2: RNS representation of integers by their values modulo pairwise co-prime integers

The following equation shows the same computation for the MxM output y as was shown in Figure 1, except in Figure 3 the calculation is performed using RNS(247, 251, 253). The weight, activation, and output transform matrices for RNS(253) are shown. As shown, the transform coefficients (G, B, A matrices) can all be represented precisely with an INT8 representation, and y, (the result of the convolution) can be reconstructed using either the Chinese Remainder Theorem or Mixed Radix Conversion.

Figure 3. The Winograd convolution F (10x10,3x3) over RNS (247,251,253)

In Table 1, we present the speedup achieved on different layers of the VGG16 convolution neural network using our RNS-based Winograd convolution with ImageNet dataset, compared to the baseline INT8 and INT16 approaches. As shown, we achieve around a 2x speedup over the standard im2col+GEMM implementation on an Arm Cortex-A73 platform with our residual number system-based Winograd approach. We anticipate that speedups of this magnitude will enable the next generation of advanced convolution neural networks for image, video, and speech applications to execute efficiently on embedded and mobile platforms.

Inference performance table

Table 1: Inference performance of 8-bit activation and 8-bit weight quantized CNN layers of VGG16 with Winograd algorithm F(14 14; 3 3) over RNS(251,241,239) and RNS(4001,4331) on Arm Cortex-A73, having 71.4% top-1 prediction accuracy with ImageNet dataset. The corresponding transforms are in the supplementary materials. The speed-up of RNS(251,241,239) and RNS(4001,4331) are the runtime improvement relative to the standard INT8 and INT16 Im2col+GEMM convolution baselines respectively.

Find out more

Zhi-Gang Liu from Arm’s ML Research Lab presented the details of this research at ECCV - take a look at the full paper to learn more.

Discover more about ML Research at Arm

Read the full paper

If you enjoyed this post...

Take a look at some of the other blogs published recently by our Machine Learning researchers:

Adapting Models to the Real World: On-Device Training for Edge Model Adaptation by Mark O'Connor
Even Faster Convolutions: Winograd Convolutions meet Integer Quantization and Architecture Search by Javier Fernandez-Marques
SCALE-Sim: A cycle-accurate NPU simulator for your research experiments by Paul Whatmough

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

Reducing the Cost of Neural Network Inference with Residue Number Systems

Combining low-precision and complexity-reducing techniques

y = AT ((BT dB) ⊙ (GgGT)) A

Maintaining prediction accuracy using a Residue Number System (RNS)

RNS(m0, m1,...mn-1)

An integer x can be represented by remainder set

{x mod(m0), x mod(m1), ... x mod (mn-1)}

where moduli {mi} are pairwise co-prime

Arithmetic operations in RNS: Addition(+), Subtract(-) and Multiply(*)

x = {x0, x1,...xn-1} and {y0, y1,...yn-1} ε RNS (m0, m1,...mn-1)

x ± y = { x0 ± y0, x1 ± y1, ... xn-1 ± yn-1}

x * y = { x0 * y0, x1 * y1, ... xn-1 * yn-1}

Division x/y in RNS {mi} is well-defined if y is co-prime to moduli {mi}

x/y = x * y-1 mod{mi}

where y-1 * y = 1 mod{mi}

y-1 is the multiplicative inverse of y