This blog post has been co-authored with Paul Whatmough, Senior Principal Research Engineer at Arm Research.
Convolutional neural networks (CNNs) have been successful in many applications, including image classification, object detection and segmentation[1-4]. Many practical applications such as unmanned aerial vehicles, medical imaging, and augmented reality inspire fast improvements of CNN architectures and algorithms. CNNs enable fast decision making with high accuracy, but often involve a large amount of computation and storage. Therefore, a highly efficient computing platform is required in typical applications.
Aided by the high levels of parallelism, GPU is the popular hardware platform for DNN training workloads. However, due to its high price and lack of reconfigurability, GPU is not usually the ideal solution for DNN inference acceleration, especially for models with high sparsity or customized architectures. In contrast, ASIC hardware platforms such as the Google TPU typically have the highest levels of energy efficiency. However, their limited configurability can introduce a significant risk of premature obsolescence, as model architectures evolve over time. With DNN algorithms evolving at a fast pace, ASIC designs will always lag behind the cutting edge due to their long design cycles. To that end, FPGAs have a unique advantage, with potentially higher throughput and efficiency than GPUs. They also offer the benefit of a faster time-to-market and potentially a longer life cycle than ASIC solutions.
Most of the conventional FPGA-based accelerators use off-chip memory (for example, DRAM) for data transformation, then perform computation for a single-layer (or a subset of a single-layer) in a time-multiplexed manner. However, the throughput of such designs is often limited by the memory bandwidth and the amount of available resources (for example, DSP, ALM) in the FPGA. Furthermore, frequently accessing the off-chip memory also leads to high energy consumption. Contrary to time-multiplexed inference, fully parallel hardware computation unrolls the convolution across all the input and output channels of each convolutional layer. This can significantly enhance the hardware throughput, albeit with the cost of the large resource consumption.
In previous ASIC work by Arm ML Research and ASU, called “FixyNN”, we presented a fixed-weight feature extractor design. In this design, the weights of a few early CNN layers were hard-coded in the datapath logic without storing them in memory. In the newly proposed “FixyFPGA” work, we employ fixed-weight scalers for all of the CNN layers for a fully on-chip, fully-parallel, and fully pipelined FPGA accelerator design. FixyFPGA not only eliminates off-chip memory access but also efficiently supports elementwise pruning of CNN weights. The major challenge that FixyFPGA needs to resolve is that naïvely deploying a CNN model to FPGA in a fully parallel manner requires a prohibitively large amount of DSP and ALM resources on the FPGA, even for the compact CNN models (for example, MobileNet).
FixyFPGA implements CNN models in a layer-parallel fashion, where every nonzero weight is encoded in the hardware design as a fixed multiplier scalar. This layer-parallel approach leads to significant improvements in latency and energy. This is achieved by 1) removing the energy and bandwidth limitations of memory, and 2) increasing the number of MACs that can be implemented on an FPGA by using fixed-weight scalers. Each weight scalar can be easily employed in hardware with a series of hardwired shifts.
Compared to the conventional programmable multipliers, implementing MAC operations with the fixed weight multipliers is beneficial, as they are significantly smaller, faster and consume less energy.
To meet the stringent hardware constraints of fully parallel on-chip computation, model compression, including pruning and quantization, is required for successful deployment. With respect to pruning, elementwise pruning achieves higher sparsity, but the irregular memory access and the index storage overheads, especially for low-precision DNNs, have hindered efficient hardware implementation to date. However, in FixyFPGA, elementwise sparsity is efficiently implemented without any index requirement. Since the weights are encoded using fixed scalers, pruning out weights is equivalent to disabling the computing operand, which will not generate any actual hardware. As a result, the energy reduction obtained from the pruning can be fully exploited.
In addition to its highly efficient hardware design, FixyFPGA is also empowered by the open source Deep Freeze tool for automatic FPGA deployment workflows. The hardware code is automatically generated by reading the low precision model which is trained by the high-level API (for example, TensorFlow, Pytorch). Such automated code generation enables a direct transition from software training to hardware inference.
By applying hardware-algorithm co-optimization, including pruning, with high sparsity, 4-bit quantization, and fixed-point scalers for the entire MobileNet-V1 model, the compressed MobileNet-V1 model can be successfully mapped onto the targeted FPGA chip. The high sparsity and the elimination of the DRAM communication leads to significant improvements in energy and latency. With the fully parallel, fully pipelined design, the proposed FixyFPGA can achieve over 3 TOPS for ImageNet classification, which is 2.3X higher than TuRF (Zhao, FPL, 2019). Regarding the object detection with the VOC dataset, the FixyFPGA achieves over 100X higher frame rate compared to the previous MobileNet based object detection (Li, ICCV, 2019). While FixyFPGA efficiently supports elementwise sparsity, aggressive pruning could lead to CNN accuracy degradation. Algorithm improvements in sparsity/quantization, coupled with employing larger FPGAs that exhibit more ALMs and DSPs could enable FixyFPGA to map larger CNN models without external DRAM.
Our paper has been published in the main track of FPL 2021. This has been a collaborative project between Arizona State University (Jian Meng, Shreyas Kolala Venkataramanaiah – a former Arm Research intern, Jae-sun Seo) and Arm ML Research in Boston (Chuteng Zhou, Patrick Hansen, Paul Whatmough). We have also open sourced the Deep Freeze tools for automatically generating hardware for fixed low precision neural networks.
[1] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
[2] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
[3] EfficientDet: Scalable and Efficient Object Detection
[4] Deep Learning for Computer Architects