Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Mobile blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Research
Arm Research
Research Articles FixyFPGA: Fully-parallel and fully-pipelined FPGA accelerator for sparse CNNs
  • Research Articles
  • Leaderboard
  • Resources
  • Arm Research Events
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
Arm Research requires membership for participation - click to join
More blogs in Arm Research
  • Research Articles

Tags
  • Arm Research
  • FPGA
  • Neural Network
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

FixyFPGA: Fully-parallel and fully-pipelined FPGA accelerator for sparse CNNs

Jae-sun Seo
Jae-sun Seo
September 28, 2021

This blog post has been co-authored with Paul Whatmough, Senior Principal Research Engineer at Arm Research.

Convolutional neural networks (CNNs) have been successful in many applications, including image classification, object detection and segmentation[1-4]. Many practical applications such as unmanned aerial vehicles, medical imaging, and augmented reality inspire fast improvements of CNN architectures and algorithms. CNNs enable fast decision making with high accuracy, but often involve a large amount of computation and storage. Therefore, a highly efficient computing platform is required in typical applications.

Aided by the high levels of parallelism, GPU is the popular hardware platform for DNN training workloads. However, due to its high price and lack of reconfigurability, GPU is not usually the ideal solution for DNN inference acceleration, especially for models with high sparsity or customized architectures. In contrast, ASIC hardware platforms such as the Google TPU typically have the highest levels of energy efficiency. However, their limited configurability can introduce a significant risk of premature obsolescence, as model architectures evolve over time. With DNN algorithms evolving at a fast pace, ASIC designs will always lag behind the cutting edge due to their long design cycles. To that end, FPGAs have a unique advantage, with potentially higher throughput and efficiency than GPUs. They also offer the benefit of a faster time-to-market and potentially a longer life cycle than ASIC solutions.

Most of the conventional FPGA-based accelerators use off-chip memory (for example, DRAM) for data transformation, then perform computation for a single-layer (or a subset of a single-layer) in a time-multiplexed manner. However, the throughput of such designs is often limited by the memory bandwidth and the amount of available resources (for example, DSP, ALM) in the FPGA. Furthermore, frequently accessing the off-chip memory also leads to high energy consumption. Contrary to time-multiplexed inference, fully parallel hardware computation unrolls the convolution across all the input and output channels of each convolutional layer. This can significantly enhance the hardware throughput, albeit with the cost of the large resource consumption.

In previous ASIC work by Arm ML Research and ASU, called “FixyNN”, we presented a fixed-weight feature extractor design. In this design, the weights of a few early CNN layers were hard-coded in the datapath logic without storing them in memory. In the newly proposed “FixyFPGA” work, we employ fixed-weight scalers for all of the CNN layers for a fully on-chip, fully-parallel, and fully pipelined FPGA accelerator design. FixyFPGA not only eliminates off-chip memory access but also efficiently supports elementwise pruning of CNN weights. The major challenge that FixyFPGA needs to resolve is that naïvely deploying a CNN model to FPGA in a fully parallel manner requires a prohibitively large amount of DSP and ALM resources on the FPGA, even for the compact CNN models (for example, MobileNet).

FixyFPGA framework

FixyFPGA implements CNN models in a layer-parallel fashion, where every nonzero weight is encoded in the hardware design as a fixed multiplier scalar. This layer-parallel approach leads to significant improvements in latency and energy. This is achieved by 1) removing the energy and bandwidth limitations of memory, and 2) increasing the number of MACs that can be implemented on an FPGA by using fixed-weight scalers. Each weight scalar can be easily employed in hardware with a series of hardwired shifts.

Compared to the conventional programmable multipliers, implementing MAC operations with the fixed weight multipliers is beneficial, as they are significantly smaller, faster and consume less energy.

To meet the stringent hardware constraints of fully parallel on-chip computation, model compression, including pruning and quantization, is required for successful deployment. With respect to pruning, elementwise pruning achieves higher sparsity, but the irregular memory access and the index storage overheads, especially for low-precision DNNs, have hindered efficient hardware implementation to date. However, in FixyFPGA, elementwise sparsity is efficiently implemented without any index requirement. Since the weights are encoded using fixed scalers, pruning out weights is equivalent to disabling the computing operand, which will not generate any actual hardware. As a result, the energy reduction obtained from the pruning can be fully exploited.

In addition to its highly efficient hardware design, FixyFPGA is also empowered by the open source Deep Freeze tool for automatic FPGA deployment workflows. The hardware code is automatically generated by reading the low precision model which is trained by the high-level API (for example, TensorFlow, Pytorch). Such automated code generation enables a direct transition from software training to hardware inference.

Results

Results graph

By applying hardware-algorithm co-optimization, including pruning, with high sparsity, 4-bit quantization, and fixed-point scalers for the entire MobileNet-V1 model, the compressed MobileNet-V1 model can be successfully mapped onto the targeted FPGA chip. The high sparsity and the elimination of the DRAM communication leads to significant improvements in energy and latency. With the fully parallel, fully pipelined design, the proposed FixyFPGA can achieve over 3 TOPS for ImageNet classification, which is 2.3X higher than TuRF (Zhao, FPL, 2019). Regarding the object detection with the VOC dataset, the FixyFPGA achieves over 100X higher frame rate compared to the previous MobileNet based object detection (Li, ICCV, 2019). While FixyFPGA efficiently supports elementwise sparsity, aggressive pruning could lead to CNN accuracy degradation. Algorithm improvements in sparsity/quantization, coupled with employing larger FPGAs that exhibit more ALMs and DSPs could enable FixyFPGA to map larger CNN models without external DRAM.

Acknowledgments

Our paper has been published in the main track of FPL 2021. This has been a collaborative project between Arizona State University (Jian Meng, Shreyas Kolala Venkataramanaiah – a former Arm Research intern, Jae-sun Seo) and Arm ML Research in Boston (Chuteng Zhou, Patrick Hansen, Paul Whatmough). We have also open sourced the Deep Freeze tools for automatically generating hardware for fixed low precision neural networks.

References

[1] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

[2] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

[3] EfficientDet: Scalable and Efficient Object Detection

[4] Deep Learning for Computer Architects

Anonymous
Research Articles
  • Hands-on experience at Singapore Management University

    Andrew Pickard
    Andrew Pickard
    SMU has been working on the SAP Next-Gen student project, to develop innovative sustainability solutions using SAP software and real-world IoT devices from Arm's partner ecosystem.
    • May 30, 2022
  • Cryptography: what is under the mask?

    Andrew Pickard
    Andrew Pickard
    Sorbonne Université has been using Arm processor source code for modelling and verification on the hardware at the micro-architectural level.
    • May 26, 2022
  • How about a short walk?

    Ilias Vougioukas
    Ilias Vougioukas
    Current solutions to improve virtual to physical translation performance are impractical. We present an alternative, where a small change has a significant impact.
    • March 10, 2022