*Javier Fernandez-Marques recently completed his internship with us at Arm ML Research Lab, working with the Machine Learning team in Cambridge UK, and in Boston, US. During his four months, Javier’s research focused on Winograd convolutions, the very work that he presented at MLSys 2020. Now back at the University of Oxford completing his PhD, Javier gives a high-level summary of his work.*

## Introduction

The design of deep learning (DL) neural network (NN) models targeting mobile devices has advanced rapidly over the last couple of years. Important computer vision tasks, such as image classification and object detection, have led a community-wide transition from model designs fixated on metrics, such as accuracy, towards designs that not only perform well, but also do so efficiently. Beyond these examples, mobile devices make use of DL for tasks as diverse as human pose estimation, frame interpolation, speech-to-text, and image super-resolution. In most settings, the design process starts with two questions:

- Which NN architecture would be best for my application?
- How do I make my NN model fit and run efficiently on the target device?

The first of these two questions have been the predominant focus of research within the machine learning (ML) community. As a result, the architectural designs of today’s best performing models are poles apart from their mid-2010s equivalents. Long gone are the days of using 11x11 filters, “just” 16 layers, reaching <70% on ImageNet or manually designing the network architecture. Today, Neural Architecture Search (NAS) frameworks are used to automate the discovery of high-performing architectures. The search space for NAS algorithms is primarily defined by a set of *candidate operations* (for example, different type of layers) and other heuristics determining how the space of architectures is explored as learning progresses. In some cases, the search can be further constrained by informing the NAS about hardware-related (for example, maximum available memory, latency of each *candidate operation*) or application-related (for example, latency and accuracy thresholds) metrics.

*Neural Architecture Search (NAS) frameworks automate the discovery of model architectures by combining layers from a user-defined set of candidate operations. Some NAS frameworks constrain the search space further leveraging real hardware metrics and constrains imposed by the application and scenario where the model will be used.*

The second question is about reducing model size and the amount of compute involved during inference. Commonly, this is jointly addressed by making use of *quantization*. Quantizing a model implies reducing the bit-width of its parameters, which directly translates into a model size reduction. Operating with lower-precision operands (for example, the input data and weight data of a layer), enables the usage of low-precision arithmetic, which often translates into faster inference and lower energy consumption. These are precisely the properties we would like our model to have when aiming for deployment on battery-powered devices such as tablets, smartphones, or wearables. While quantization is still an active research area attracting interest from diverse domains, such as statistics and systems, there is a common consensus that 8-bit quantized models hit the sweet spot between accuracy and compute cost on current mobile systems.

Up until now, the design of high-performing lightweight NN models has primarily revolved around two key ideas: lightweight architecture designs and the use of quantization frameworks.

But there is a third question that has not yet attracted so much attention from the ML community:

- Which algorithms are going to perform each of the operations in the compute graph describing the model?
- And, more concretely for the case of CNNs, Which algorithm is used to do the convolution operation in each layer?

This is an important question to consider when deciding which model architecture to implement, and which quantization strategy to adopt for the model. Without considering which algorithms will be used for convolutions, accounting for the majority of the operations during inference, we might be missing out overheads (for example, some algorithms require more memory than others) or further optimization (for example, some algorithms are faster than others).

## Algorithms for Convolutions

There are a number of previous studies that have evaluated the suitability of several convolution algorithms and their tradeoffs in terms of memory, latency, and numerical degradation in the context of DL and image classification. The current landscape of convolution algorithms can be split into two groups: algorithms that operate on the *input-weight space*, such as the *direct loop* or *im2row* algorithms, and those that operate on a *transformation space*, with FFT or Winograd being the most well-known examples. Here we will limit our analysis to the algorithms studied in Anderson & Gregg (2018). These are compared side by side in terms of their strengths and weaknesses in the following table:

*Trade-offs in terms of memory, latency and striding options for several popular convolution algorithms. If algorithms operating on a transformation space (e.g. FFT or Winograd) don’t expose such space during training, replacing convolutional layers of quantized models with one of these algorithms for deployment would result in a severe drop in accuracy. Despite being the fastest algorithms, their usage in quantized contexts is limited. Adapted from Anderson & Gregg (2018)*

Normally, a model would be first trained using one of the exiting DL frameworks. Then, for deployment, each convolution would be implemented with one of the algorithms shown in the table. If deploying to memory constrains devices, we might opt for a direct loop implementation since it comes with minimal memory overhead. If on the other hand we aim for a faster inference, we would normally make use of *im2row/im2col* or one of the traditional algorithms, such as Winograd. These steps, starting with an *algorithm-agnostic* model training stage, are reasonable and would work well for full precision (FP32) models, regardless of the choice of convolutional algorithm used for deployment.

However, when deploying quantized models, the performance degradation observed is quite significant in algorithms that operate in *transformation spaces* (for example, Winograd and FFT). Why do we observe such degradation? The intuitive answer is because if during training we are not modelling the inaccuracies that quantization introduced in __each__ of the stages involved in these convolution algorithms. During deployment the distributions to each layer would be different from those expected, making the deployed parameters no longer useful. In other words, although in the context of enough precision all algorithms would generate indistinguishable outputs given the same input and weights, this is not the case when the quantization is enforced as *transformation space* algorithms become lossy. Due to this drop in performance in algorithms such as Winograd, researches have instead opted for slower, more reliable algorithms such as im2col for deploying quantized models.

## Winograd Convolutions

The Winograd algorithm for convolutions using linear polynomials guarantees to use the minimum number of elementwise multiplications to compute m × m outputs using an r × r filter. Lavin & Gray (2016) refer to this minimal algorithm as F (m × m, r × r) and present its matrix form where G, B, and A are transformation matrices applied to the filter g, input, and output respectively and ⊙ is the Hadamard or element-wise multiplication:

*Eq.1: Winograd Convolution for linear polynomials.*

These transformation matrices are commonly constructed as described in the Cook-Toom algorithm, which requires choosing a set of so-called *polynomial points *from R^{2}. This choice is not trivial, but for small Winograd kernels, e.g., F(2 × 2,3 × 3) or F(4 × 4,3 × 3), there is a common consensus. For the remainder of this blog post, unless stated otherwise, we are considering 3 × 3 filters and therefore refer to F (2 × 2, 3 × 3) as F 2, F (4 × 4, 3 × 3) as F 4.

The main challenge associated with the use of Winograd convolutions is numerical error. Small F2 and F4 perform well in single and double precision (FP32/64). Because these introduce only marginal numerical error, a network can first be trained using conventional convolutions before replacing appropriate layers with Winograd, without impacting accuracy. However, attempting this with larger Winograd tiles, or with quantization, results significant accuracy loss. The root of the problem is the increasing numerical range in G, B, and A as d increases. As a consequence, the multiple matrix multiplications in the Eq.1 contribute considerable error, ultimately reducing accuracy. This problem is exacerbated in networks using quantized weights and activations, where the range and precision of values is reduced.

*Replacing the convolutional layers in pre-trained ResNet-18 models on CIFAR-10 with F2, F4 and F6 Winograd convolutions. This works well in full precision, but accuracy drops drastically with quantization for configurations beyond F2.*

Winograd convolutions are the fastest known algorithm for spatially small convolutions (for example, 3x3 filters), but exploiting their full potential comes with the burden of numerical error, rendering them unusable in quantized contexts. To combine the speedups of Winograd with those that quantization and reduced precision arithmetic are known to offer we present Winograd-aware networks.

## Winograd-aware networks

Neural networks have proven to be resilient to all kinds of approximations, for example, pruning and quantization. When applying these techniques, better models are consistently generated if these approximations are present during training. In other words, when the training is *aware *of quantization, or when training is *aware *of pruning.

*A Winograd-aware formulation for convolutional layers. This formulation exposes the numerical errors introduced by Winograd convolutions during the training stage. It also enables the learning of the transformation matrices after being constructed via Cook-Toom.*

Following this intuition, we propose an end-to-end Winograd-aware pipeline as shown in the following diagram. In the forward pass we apply Eq.1 to each patch of the activations from the previous layer. We can apply standard backpropagation, since Eq.1 is only a collection of matrix-matrix multiplications.

This implementation allows us to:

**Learn better filters.**Building an explicit implementation of each of the stages involved in the Winograd transform exposes the numerical errors introduced in Eq.1 to the learning of the filters. This prevents the accuracy drops presented previously.**Learn the transforms.**Traditionally, matrices G, B⊤ and A⊤ are fixed. Instead, we can treat them as another set of*learnable*parameters in the layer. This relaxation leads to much improved performance in quantized networks while still maintaining the overall structure of the Winograd convolution algorithm and its speedups.

Experimental results (see the following plots) demonstrate the effectiveness of our Winograd-ware formulation when introducing quantization. Unlike models that make use of Winograd convolutions only for deployment, Winograd-aware models can retain ~90% of their accuracy. When allowing the transformation matrices to evolve during training (experiments shown with --*flex*) the accuracy gap between standard convolution and Winograd convolutions is further reduced.

*Performance of a winograd-aware ResNet-18 at different bit-widths and trained with different Winograd configurations. Winograd-aware layers scale with network’s width. In quantized networks, models that learn the Winograd transforms (-flex configurations), strictly outperforms those models that keep them fixed with the values obtained via Cook-Toom.*

Simultaneously maximizing accuracy and minimizing latency with Winograd convolution isn’t trivial. The reason for this is that large tiles result in lower latency but come at the cost of higher numerical error. This presents a good opportunity to jointly optimize network accuracy and latency.

To this end, we propose wiNAS, a neural architecture search (NAS) framework that jointly optimizes a given macro-architecture for accuracy and latency leveraging Winograd-aware (WA) layers. Our framework makes use of a real latency model obtained by evaluating thousands of convolutional layer configurations (that is, different input sizes, convolution algorithms and tile sizes) on real mobile CPUs. In this work, we focus on Arm Cortex-A73 and Cortex-A53 mobile CPUs. The search space for wiNAS is defined by the set of *candidate operation *shown in the following left diagram.

*(left) Candidate operations for wiNAS. (right) We show that Winograd-aware (WAF2/4) layers combine the speedups of Winograd convolutions with those of INT8 arithmetic, with little to no accuracy loss in some cases. This is not possible with existing Winograd (WF2/4) formulations. For the last two rows, wiNAS found different optimizations for each dataset. We show latencies for CIFAR-10 on the left and CIFAR-100 on the right. Speedups are shown against im2row in FP32.*

In this work, we studied Winograd-aware layers with different tile sizes, three quantization levels and on three popular datasets. We found that allowing the transformation matrices to evolve during training resulted in significantly better models. With wiNAS we leveraged Winograd-aware layers and latency metrics from off-the-shelf mobile CPUs and found architectures that helped minimize the numerical instability of Winograd. A Winograd-aware ResNet-18 quantized to INT8 offers up to 1.32× faster inference for only a marginal accuracy drop compared to existing Winograd implementations, which are limited to FP32. This network is also 1.54× faster than an optimized *im2row *implementation using INT8 arithmetic.

To learn more about this research, read the full paper, and do get in touch to ask a question.

Read the Full Paper Ask a Question

Javier’s internship work really leverages the latest architecture search techniques to get the most of the Winograd algorithm. In practical terms, it means we can make neural network tasks go faster, without changing the hardware. Therefore, this work could potentially provide an immediate benefit for anyone deploying NNs. This work also blurs the lines between machine learning, computer architecture and software, which is the hallmark of many other efforts in the Arm ML Research Lab. Bravo Javier, and best of luck with your PhD at Oxford University.

Paul Whatmough - Senior Principal Research Engineer

## Arm ML Research Lab | The Experience as an Intern

During my four-month internship at Arm Research, I worked alongside many researchers – based in both Cambridge, UK, and Boston, US. From evaluating experiment results to collaborating on innovative projects, I found myself fully immersed into the team and gained insight into ongoing projects. The most valuable element of my internship was the opportunity to present to and get feedback from my research colleagues, networking beyond the Machine Learning team. I would definitely recommend undertaking an internship with Arm.