## Energy-Efficient Acceleration of RNNs using CGRA

Aviral Shrivastava, Arizona State University





ARM Summit 2018

10/10/18

# **CGRA: Coarse-Grain Reconfigurable Arrays**

- An array of Processing Elements (PEs); each PE has ALU-like functional unit that works on an operation at every cycle.
- Array configurations vary in terms of
  - Array SizeReg. File Architectures
  - Functional Units Interconnect Network



- ► CGRAs can achieve power-efficiency of several 10s of GOps/Sec per Watt.
  - > ADRES CGRA chip, up to 60 GOps/sec per Watt [IMEC, HiPEAC 2008]
  - HyCUBE chip, about 63 MIPS/mW [M. Karunaratne et al., DAC 2017]
- Popular in Embedded Systems and Multimedia [Samsung SRP processor]

# **Mapping Applications on CGRAs**



ARM Summit 2018

Data Dependency Graph

Modulo

1

2

1

onto CGRA

**Schedule** 

• 1 ↔ 2

1x2 CGRA

Performance (loop execution time) critically depends on the mapping obtained by compiler

- Iterative Modulo Scheduling Every operation is executed at II cycles.
- Initiation Interval aka II is performance metric.
- Software Pipelining Operations from different iterations can be executed simultaneously. This empowers to accelerate even non-parallel loops through the CGRAs.



# **CGRAs Becoming a Hotbed of Research**

Recently, several techniques and evaluations for CGRAs or CGRA-like spatial architectures have been presented including

- Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures. Michael Pellauer, Angshuman Parashar, Michael Adler, Bushra Ahsan and others. In ACM TC 2015.
- DRAMA: An Architecture for Accelerated Processing Near Memory. Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow and Nam Sung Kim. In IEEE CAL 2015. [UWisc]
- Control Flow Coalescing on a Hybrid Dataflow/von Neumann GPGPU. Dani Voitsechov Yoav Etsion. In MICRO 2015.
   [Technion, Israel]
- Evaluating Programmable Architectures for Imaging and Vision Applications. Artem Vasilyev, Nikhil Bhagdikar, Ardavan Pedram, Stephen Richardson, Shahar Kvatinsky, Mark Horowitz. In MICRO 2016.
- Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. Yu-Hsin Chen, Joel Emer and Vivienne Sze. In ISCA 2016.
- A space-and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures. Bernhard Egger et al. In CGO 2017. [SNU and Samsung]
- Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra and Li-Shiuan Peh. In DAC 2017. [NUS]



# **Key Features of CGRA Accelerators**

- With software-pipelined execution, CGRA PEs can efficiently accelerate loops with lower parallelism
  - E.g. loops with loop-carried dependence, inter-twined loops, loops with high branch divergence etc.
- Avoids one of the fundamental bottlenecks of Von-Neumann architecture i.e., CGRAs are not subjected to dynamic fetching and decoding of instructions.
  - CGRA instructions are pre-decoded in memory, and PEs transfer data directly among each other, without necessarily going through centralized register file/memory.
- Efficient mapping of loop operations is done by compiler, no programmer intervention is needed.
  - Performance-critical kernels of several irregular applications can benefit from acceleration.



FIG. 21

Fig 21 from recent Intel patent on Configurable Spatial Architecture (CSA)

#### Article:

https://www.nextplatform.com/2018/08/30/intels -exascale-dataflow-engine-drops-x86-and-vonneuman/



#### Web page: aviral.lab.asu.edu

### **CCF: CGRA Compiler+simulation Framework**



# **Our Recent Work on Compiling for CGRAs**

- Efficient software pipeline
  - EPIMap: Using Epimorphism to Map Applications on CGRAs. Mahdi Hamzeh, Aviral Shrivastava and Sarma Vrudhula. In DAC 2012.
- Efficiently using distributed register files
  - REGIMap: Register-aware Application Mapping on CGRAs. Mahdi Hamzeh, Aviral Shrivastava and Sarma Vrudhula. In DAC 2013.
- Register file organization

- URECA: A Compiler Technique to Manage Unified Register File for CGRAs. Shail Dave, Mahesh Balasubramanian and Aviral Shrivastava. In DATE 2018.
- Efficient mapping of if-then-else's
  - LASER: A Hardware/Software Approach to Accelerate Complicated Loops on CGRAs. Mahesh Balasubramanian, Shail Dave, Aviral Shrivastava and Reiley Jeyapaul. In DATE 2018.



# RAMP [DAC 2018]: Selecting a Routing Alternative

• Various routing strategies

ARM Summit 2018



#### Failure Analysis

- Dependent operations are scheduled at distant time; managing the data with large lifetime in registers is not possible
  - Route by PEs, Spill to memory/distributed RFs
- Source operand is a live-in value, and cannot be managed in the registers
  - Load the live-in value from the memory
- Dependent operations are scheduled at the consecutive cycles; routing is not possible due to limited interconnect/unavailability of free PEs
  - Re-compute, Route by a PE, Re-schedule



# RAMP vs. RegiMap and MemMap



#### Table 1: Specifications of CGRA architecture configurations

| Config.<br># | Size | RF          | Reg.<br>in RF | Memory<br>Units (PEs)        | Sharing of<br>Memory Bus |
|--------------|------|-------------|---------------|------------------------------|--------------------------|
| 1            | 2x2  | Centralized | 16            | 3, 4                         | dedicated                |
| 2            | 2x2  | Centralized | 16            |                              |                          |
| 3            | 2x2  | Local       | 2             | Homo-                        | shared                   |
| 4            | 2x2  | Local       | 4             | geneous                      | among                    |
| 5            | 4x4  | Centralized | 64            | PEs                          | PEs of                   |
| 6            | 4x4  | Local       | 2             | (All)                        | a row                    |
| 7            | 4x4  | Local       | 4             |                              |                          |
| 8            | 4x4  | Local       | 4             | 2,4,6,8                      | dedicated                |
| 9            | 8x8  | Centralized | 128           | Homo-                        | shared                   |
| 10           | 8x8  | Local       | 4             | geneous                      | among PEs                |
| 11           | 8x8  | Local       | 8             | PEs                          | of a row                 |
| 12           | 8x8  | Local       | 8             | 1,3,5,7,9,11,<br>13,15,19,21 | dedicated                |

- For the top performance-critical loops from 8 MiBench benchmarks, previous techniques failed to obtain mappings for almost all loops, when highly constrained by the resources.
- RAMP accelerated the top performance-critical loops of 8 embedded applications from MiBench by 23× as compared to sequential execution, and by 2.13× over REGIMap, and by 3.39× over MEMMap.



### **Residual Blocks from ResNet-18**



### **Execution Mechanism for Inference**



## **Executing Residual Block on CGRA**



#### **Output Stationary Dataflow for Convolutions**



## **Streaming Feature Maps for MACs**



### **Output Stationary Dataflow for Convolutions**



# **Output Stationary Dataflow for Convolutions**

- Input feature map streamed through PEs
  - PEs perform MACs on input data and/or passes to neighboring PES
  - Partial sums stored in the RF
- Batch Normalization, Scaling and ReLU performed before storing output feature map





# **Experimental Setup**

- Neural networks evaluated: ResNet-18 and ResNet-34 models
  - Computations on N-dimension feature maps defined in C/C++
- Dataset: ImageNet

- Input feature map: 224x224x3.
- Cycle-count performance comparison of the 2 approaches
  - Baseline: Intel Core i7-870 CPU (2.93GHz, Quad-Core)
    - ▶ 256 kB L1\$, 1 MB L2\$, 8 MB L3\$
    - 8 GB system memory (4 GB DIMMs, DDR3 1.33GHz)
    - Performance measured in terms of execution cycles on a core (linear scaling to 4 cores)
    - Profiling: GNU Perf (stat collection from hardware counters)
    - Compilation with g++ -O3 (aggressive loop optimizations, and auto-vectorization enabled)
    - Algorithmic representation for 4D convolutions
      - © Conventional representation shown in Minimizing Computation in Convolutional Neural Networks by J. Cong et al., in ICANN 2014.
  - Performance model for CGRA (efforts ongoing):
    - In-house C++ simulator (integration with Gem5 ongoing)
    - 4 clusters of PEs; each cluster is mounted with 68kB scratch-pad memory (total 196 PEs)
    - Dataflow execution: Output stationary (streaming data/MAC/compare takes 1 cycle, pipelined)
    - DMA model for scratch-pad memory management: latency (cycles)= 291 + 0.24\*bytes



# **Early Results**



N: Batch size
Baseline: Intel i7-870 Quad-core CPU
OS: Output Stationary dataflow for Convolutions
OPT1: Batch normalization, ReLu, pooling on CGRA
OPT2: Software prefetching on SPM enabled through quad-buffering.

#### **Further Optimizations Possible:**

- Interleave partial sum computations across filters instead of channels, i.e., operate on various filters for an input channel, to better reuse input feature maps. ~ 4X speedup over OS+OPT[1-2]
- Design dataflow execution to consider variation in data reuse opportunities (filter weights vs. ifmap for early residual blocks in model)
- Design-space exploration of the architecture for using all-PE communication through input/output FIFOs instead of streaming the data through a single PE.

# Computing System Stack [Ongoing]

Web page: aviral.lab.asu.edu

ARM Summit 2018





10/10/18

# **Summary and Next Steps**

#### Highlights

- DATE 2018 and DAC 2018 papers on CGRA
  - LASER: A Hardware/Software Approach to Accelerate Complicated Loops on CGRAs [DATE 2018]
  - RAMP: Resource-Aware Application Mapping on CGRAs [DAC 2018]
- Released the first version of the first open source compiler-simulator toolchain for CGRAs
  - CCF: https://github.com/cmlasu/ccf

#### Next steps

- Complete simulation model and performance model.
  - Embed energy model (e.g. through McPAT)
- Design space exploration to settle upon the spatially programmable solution and the corresponding dataflow execution.
  - Set the interconnections, RF size, CGRA-SPM to DRAM bandwidth
- Development of a light-weight compiler from tensor flow to the CGRA accelerator.
  - > Integrate routines for accelerator execution and software prefetching to TensorFlow library routines.

