



# Run-Time Reconfigurable CPU Interlays for Building Flexible ARM SoCs

## **Raul Garcia and Dirk Koch**

School of Computer Science, The University of Manchester {raul.garcia, dirk.koch}@manchester.ac.uk







- Introduction
- CPU Interlays
- Custom Instructions for Interlays
- Interlays in SoC Designs
- Conclusions





# Introduction

Existing general-purpose (GP) CPU architectures tend to provide featurerich instructions sets (Instruction Set Extensions)

- ISEs introduce substantial area/energy overhead
- CPU clock is limited by power







#### Alternatively, hardened GP CPUs can be coupled to reconfigurable fabrics to implement custom accelerators/ALUs

Advantages:

- Customization
- **Resource Sharing**
- In-field updates ٠



(b) Tighly-Coupled Custom ALU

Hardened CPU





# **CPU Interlays**

This work explores the potential advantages that can be obtained by embedding a tiny FPGA fabric into an otherwise hardened CPU. We call this tiny reconfigurable fabric an *Interlay* as it sits between the software layer and the physical substrate



#### **Characteristics:**

- Tightly-coupled to a hardened CPU as a custom ALU
- Targeted to accelerate relatively small kernels
- Leverages the advantages of reconfigurability for an existing design
- Provides efficiency and performance through customization
- Called in user mode





Advanced Processor Technologies Group

### **Interlay Ecosystem and Research Scope**







# **Research Methodology and Case Study**

Top-Down Approach:

 Adding constraints to an off-the-shelf FPGA to emulate the characteristics of the Interlay

Leverage existing design tools

- Case Study: Replacing the hardened NEON with an Interlay [1] Soft NEON allows for:
- Reuse of existing ARM code
- NEON ISA customization
- In-field updates
- Leverage SIMD interface
- Minimal architectural disruptions
- Avoid area/energy overhead



1. Garcia Ordaz, Jose Raul, and Dirk Koch. "Making a case for an ARM Cortex-A9 CPU interlay replacing the NEON SIMD unit." *International Conference on Field Programmable Logic and Applications (FPL),* IEEE, 2017.





#### About the NEON Engine



#### NEON SIMD Engine [2]:

- Exploits data-level parallelism
- Targeted mostly to media applications
- Independent vector register file and datapath

2. ARM, Cortex-A9 Technical Reference Manual, Online: www.arm.com





#### Hardened NEON Area Estimation



| Xilinx Zynq | Chip |
|-------------|------|
|             |      |

| Functional Unit                      | FPGA Primitive Equivalent |     |      |  |
|--------------------------------------|---------------------------|-----|------|--|
|                                      | LUT                       | DSP | BRAM |  |
| Dual Core ARM Cortex A-9 Processor   | 10400                     | 80  | 40   |  |
| Single Core ARM Cortex A-9 Processor | 5200                      | 40  | 20   |  |
| Two NEON Units                       | 2080                      | 16  | 8    |  |
| Single NEON Unit                     | 1040                      | 8   | 4    |  |





#### Soft NEON / Hardened NEON Gap Measurement

| Functional Unit                       | FPGA Primitive |     |      |  |
|---------------------------------------|----------------|-----|------|--|
|                                       | LUT            | DSP | BRAM |  |
| – NEON ALU                            | 10968          | 275 | 0    |  |
| – arithmetic-ops                      | 640            | 64  | 0    |  |
| – boolean-ops                         | 388            | 4   | 0    |  |
| <ul> <li>– comparison-ops</li> </ul>  | 926            | 0   | 0    |  |
| – shift-ops                           | 718            | 0   | 0    |  |
| <ul> <li>multiply-ops</li> </ul>      | 819            | 88  | 0    |  |
| <ul> <li>miscellaneous-ops</li> </ul> | 7699           | 119 | 0    |  |
| – NEON Register File                  | 0              | 0   | 4    |  |
| – NEON Unit                           | 11360          | 275 | 4    |  |

|      |      | FPGA Resources | Operating   | Frequency |            |
|------|------|----------------|-------------|-----------|------------|
| Hard | ened | RTI            | Ĺ           | Hardened  | RTL        |
| NE   | ON   | NEON           |             | NEON      | NEON       |
| LUT  | DSP  | LUT DSP        |             | MHz       | MHz        |
| 1040 | 8    | 11360 (10.9×)  | 275 (34.4×) | 650       | 164 (3.9×) |





#### Media/Security Application Profiling







Soft NEON \_ ISA Subsetting Optimization Operation Folding





#### Closing the Soft NEON / Hardened NEON Gap (Area)







#### Closing the Soft NEON / Hardened NEON Gap (Latency)







#### Closing the Soft NEON / Hardened NEON Gap (Performance)

| Application | Execution Time $(\mu s)$     |                             |  |  |  |
|-------------|------------------------------|-----------------------------|--|--|--|
|             | Hardened CPU + Hardened NEON | Hardened CPU + CPU Interlay |  |  |  |
| adpcm       | 259                          | 285 (1.10×)                 |  |  |  |
| gsm         | 65                           | 69 (1.06×)                  |  |  |  |
| jpeg        | 5411                         | 6096 (1.13×)                |  |  |  |
| motion      | 72                           | 74 (1.03×)                  |  |  |  |
| aes         | 202                          | 213 (1.05×)                 |  |  |  |
| sha         | 597                          | 793 (1.33×)                 |  |  |  |





# **Custom Interlay Examples – Bit Manipulation**







# **Custom Interlay Examples - Crypto**

 We implemented a library with all major crypto primitives (AES, DES, SHA, Montgomery multiplication, TRNG)

| Module          |                 | LUTs | BRAMs |
|-----------------|-----------------|------|-------|
|                 | 2-stage-L       | 809  | 0     |
| AES             | 2-stage-B       | 172  | 4     |
| encrypt         | 3-stage-L       | 899  | 0     |
|                 | 3-stage-B       | 397  | 4     |
| AES             | 2-stage-L       | 1079 | 0     |
| decrypt         | 2-stage-B       | 630  | 4     |
| DES             | structural      | 96   | 0     |
| SHA1            | 2-stage         | 235  | 0     |
| JUAT            | 3-stage         | 295  | 0     |
| SHA2            | 3-stage         | 365  | 0     |
| Montg.<br>mult. | FSM<br>datapath | 1014 | 0     |
| TRNG            | structural      | 544  | 0     |



16





# **Custom Interlay Examples - Crypto**

- Throughput evaluation for Rijndael implementation of MiBench (ARM at 650 MHz coupled with a Zynq-fabric interlay)
- Based on cycle accurate simulation in Gem5







# Automatically Generated Interlay CIs







#### Advanced Processor Technologies Group

| CI     | Primitives |     |              |      |     |     | Spe | edup | Exec |
|--------|------------|-----|--------------|------|-----|-----|-----|------|------|
|        |            | OA  |              |      | OP  |     | OA  | OP   | %    |
|        | LUT        | DSP | Fit          | LUT  | DSP | Fit |     |      |      |
| upzero | 337        | 4   | $\checkmark$ | 1087 | 24  | ×   | 1.4 | 3.3  | 19.1 |
| filtez | 281        | 8   | ✓            | 143  | 24  | ×   | 1.1 | 3.8  | 10.4 |
| uppol2 | 217        | 8   | ✓            | 300  | 8   | ✓   | 2.8 | 3.1  | 4.3  |
| quantl | 108        | 2   | ✓            | 1097 | 54  | ×   | 1.5 | 14.6 | 4.2  |
| uppol1 | 234        | 4   | ~            | 296  | 4   | ~   | 2.2 | 2.9  | 3.4  |

Г

| CI          |      |     | Prim | itives | tives |     | Speedup |      | Exec |
|-------------|------|-----|------|--------|-------|-----|---------|------|------|
|             |      | OA  |      |        | ОР    |     | OA      | OP   | %    |
|             | LUT  | DSP | Fit  | LUT    | DSP   | Fit |         |      |      |
| AddRound    | 2276 | 4   | ✓    | 4971   | 4     | ×   | 8.9     | 18.1 | 49.2 |
| Key_Invers  |      |     |      |        |       |     |         |      |      |
| MixColumn   |      |     |      |        |       |     |         |      |      |
| MixColumn_  | 2156 | 4   | ~    | 2156   | 4     | ~   | 10.3    | 10.3 | 18.8 |
| AddRound    |      |     |      |        |       |     |         |      |      |
| Key         |      |     |      |        |       |     |         |      |      |
| KeySchedule | 2310 | 1   | ~    | 4284   | 1     | ×   | 3.9     | 4.6  | 18.7 |
|             |      |     |      |        |       |     |         |      |      |
| ByteSub_    | 1291 | 0   | ✓    | 1291   | 0     | ✓   | 20.4    | 20.4 | 3.1  |
| ShiftRow    |      |     |      |        |       |     |         |      |      |
| Invers      | 1197 | 0   | ~    | 1197   | 0     | ~   | 20.4    | 20.4 | 3.1  |
| ShiftRow_   |      |     |      |        |       |     |         |      |      |
| ByteSub     |      |     |      |        |       |     |         |      |      |

ADPCM



AES



19





# **Interlays in Dual-Processor Systems**







# Prototype: Partially Run-Time Reconfigurable Shared RISC-V Processing System

SIMD engine:

- 128-Bit SIMD ALU engine
- Placed logically inline with the scalar ALUs
- Shared amongst both CPUs

Additionally:

- Includes its own 16-entry Vector RF
- Shares the decoder unit with scalar CPUs
- Vector LD/ST are executed serially in 4 (32-bit) operations



Dual-Processor System with a Shared 128-bit SIMD Engine





The University of Manchester

# **Prototype System Implementation**



3. Beckhoff, Christian, Dirk Koch, and Jim Torresen. "Go ahead: A partial reconfiguration framework." *International Symposium on Field-Programmable Custom Computing Machines (FCCM),* IEEE, 2012.





#### Static/Dynamic Interface Wiring Arrangement







#### Partially Run-Time Reconfigurable Subsystem

- PR region set to 2082 LUTs 🗪 694 LUTs per slot
- Each slot is 2 CLB columns wide
- PR region is reconfigured through the Zynq Internal Configuration Access Port (ICAP)
  - -Considering ICAP maximum throughput (400 MB/s):
    - → Estimated slot reconfiguration time: 295 µs
- Module configuration prefetching can be used to hide reconf. Latency
   Prefetching instructions inserted in the code





### System Implementation



#### System Floorplan



Zynq Z-7020 (Artix-7 FPGA Fabric)







#### Implemented Custom SIMD instructions

- Resource utilization and latency obtained from Vivado synthesis report
- Kernel speedup is calculated from assembly code analysis and execution traces generated from simulation







#### Systems Floorplan Comparison

#### Soft Dual Core RISC-V Prototype



- Realistic approach
- Fits existing SoC architecture
- Enables further research

#### ARM Cortex-A9 SoC





# Conclusions

- Interlays can be a solution to provide enhanced flexibility and performance in SoCs
- We described implementation details (Interlay integration/management) and examples of CIs
- With this work we aim to stimulate research in the field of hybrid systems that can meet the challenges faced by existing SoC architectures





# Future Work

- Customization of the Interlay fabric
  - Coarse/Fine grain mix, enhancing DSP blocks
- Enhancement of the emulation system
  - Scaling vector interface, number of Interlay slots
- Exploration of further application domains
  - Machine learning (customized precision)
- Building additional ecosystem components
  - Custom compilers



The University of Manchester



Advanced Processor Technologies Group

# **Thank You !**

# **Questions?**



The University of Manchester



Advanced Processor Technologies Group

# Backup





### Adding architectural support for the SIMD engine

**Slot-Based Resource Sharing Infrastructure**: Propagates operands from the CPUs to the slots and results from individual slots back to the CPUs







# SIMD Engine Configuration Controller: Controls (re)configuration of the slots

- Used to time-share the SIMD engine amongst both CPUs
- Controls the integration of vector instructions at run-time
- Leverages vector compute and I/O bursts in each CPU

#### **SIMD** Instructions

- Identified offline through profiling
- Implemented only if it provides substantial kernel speedups

$$S_k = \frac{t_{kISA}}{t_{kSIMD}}$$

- And, it is invoked a significant number of consecutive times to amortize reconfiguration overhead

$$t_{SIMD} * n > t_{RCFG}$$





# Programmable Instruction Decoder: Enables changing instruction settings at run-time

- An extension of the Decode Unit
- Handles SIMD instruction information:
  - 1) Target slot
  - 2) SIMD instr. Cycles
  - 3) Operation folding factor
- These settings can be overwritten at run-time
- Setting configuration as an OS task

|                        | CPU 0                            |                        | CPU 1                            |
|------------------------|----------------------------------|------------------------|----------------------------------|
| SIMD Instr<br>Encoding | Instruction<br>Settings          | SIMD Instr<br>Encoding | Instruction<br>Settings          |
| 0                      | Trap                             | 0                      | Slot 0, 2 Cycles, $\lambda = 1x$ |
| 1                      | Slot 1, 1 Cycle, $\lambda = 4x$  | 1                      | Тгар                             |
| 2                      | Slot 2, 2 Cycles, $\lambda = 2x$ | 2                      | Slot 2, 2 Cycles, $\lambda = 2x$ |
| 3                      | Trap                             | 3                      | Trap                             |





#### Case Study

- Divided in 2 parts:
  - 1) Each CPU executes its own program using its corresponding memory space
    - Kernel for mixing 16-bit PCM audio signals
    - Kernel for computing Sum of Absolute Differences values (motion estimation)
    - Kernel for computing 8-bit Add-Compare-Select values (Viterbi decoder)







#### Static Subsystem

- Includes scalar CPUs, memory subsystem
- Operating frequency 96 MHz
- Component resource breakdown:

| System Component                     | FPGA Primitive |     |      |  |  |
|--------------------------------------|----------------|-----|------|--|--|
|                                      | LUT            | DSP | BRAM |  |  |
| Soft Dual Processor RISC-V<br>System | 6917           | 8   | 36   |  |  |
| RISC-V CPU 0                         | 2728           | 4   | 10   |  |  |
| RISC-V CPU 1                         | 2728           | 4   | 10   |  |  |
| Common Infrastructure                | 1461           | 0   | 16   |  |  |





#### Classification of Reconfigurable Processor (RP) Architectures







Advanced Processor Technologies Group

#### Leverage built-in custom data type libraries for developing HLS-friendly C code







#### Vector Accelerators NEON







**Advanced Processor Technologies Group** 

# **Custom Instructions for CPU Interlays**

Design Aspects:

- Interface
- Area
- Energy
- Granularity

Instruction Set Extension Problem

#### **Explored Approaches**

| Manual CI Generation             | Automated CI Generation           |
|----------------------------------|-----------------------------------|
| Relatively slow development time | Enhanced design productivity      |
| Full control over the            | The generated RTL code can be     |
| generated RTL code               | influenced through C coding style |
| Most suitable for relatively     | Suitable for relatively           |
| small kernels                    | large functions                   |
| Hardware design expertise        | A hardware design background      |
| is required                      | is not necessarily required       |