



# Cost Modelling for Vectorization on ARM

Angela Pohl, Biagio Cosenza and Ben Juurlink ARM Research Summit 2018





# **Challenges of Auto-Vectorization in Compilers**

- 1. Is it **possible** to vectorize the code?
  - Passes: Loop Level Vectorization (LLV), Superword-Level Parallelism (SLP) ...
  - Algorithms: loop transformations, partial vectorization, SLP padding ...
  - Challenges: dependences, missing instructions ...
- 2. Is it **beneficial** to vectorize the code?
  - Transformations might add overhead
  - Code can run slower afterwards









# Short Introduction to Cost Modelling







# State of the Art Analysis

- LLV pass of LLVM 6.0 on ARMv8
- TSVC Benchmark: 151 basic loop patterns
- Vectorization with overwritten cost model
- No unrolling, no interleaving

| Metric           | Baseline |
|------------------|----------|
| Vectorized loops | 61       |
| $f\ominus$       | 14       |
| $f \oplus$       | 0        |
| $t_{norm}$       | 117.2    |







# Linear Modelling Approach

• Formulate basic block as linear equation for each test pattern

$$c_{vec} = \sum n_i c_i$$

• Determine basic block target cost based on measurement

$$c_{target} = \frac{c_{scalar}}{S_{meas}} = \sum n_i c_{i'}$$

• Apply mathematical fitting to set of linear equations





# Linear Modelling: Example

for (int i = 0; i < LEN; i++){  
 a[i] = b[k] + 1.0;  
}  
for (int i = 0; i < LEN; i++){  
 prod \*= a[i];  
}
$$c_{scalar} = 8$$
  $S_{meas} = 2.76 \Rightarrow c_{target} = 2.89$ 

$$\mathbf{2.89} = \mathbf{1} \times c_{load} + \mathbf{1} \times c_{fadd} + \mathbf{0} \times c_{fmult} + \cdots$$

$$\mathbf{2.61} = \mathbf{1} \times c_{load} + \mathbf{0} \times c_{fadd} + \mathbf{1} \times c_{fmult} + \cdots$$





# Modelling Speedup Instead of Cost

#### Problem:

- Target costs vary significantly, i.e. data has to be fitted across large interval
- But: fitting benefits from smaller target intervals

#### Idea:

- Model speedup instead of cost
- Limits interval to (0, vectorization factor)
- Linear equations become  $S_{meas} = \sum n_i w_i$

with  $n_i$  = numer of instructions of same type,  $w_i$  = corresponding weight





## **Results: Fitted for Speedup**



- L2: Least Squares (minimizes Euclidian L2 Norm)
- NNLS: Non-Negative Least Squares (all coefficients > 0)





# Adding Block Composition as Feature

#### Problem:

- Current cost model only cares about individual instructions
- Arithmetic intensity can have major impact on speedup, e.g. if code is memory bound

#### Idea:

• Replace count of instruction type with overall percentage, e.g. 20% load, 10% cmp, ...

$$S_{meas} = \sum \frac{n_i}{n_{total}} w_i$$







# **Results: Fitted with Rated Instruction Count**







# Leave One Out Cross Validation: NNLS

- Fitting is done with whole training data set except for one kernel
- Kernel is then predicted with that model







# Conclusion

- Compilers need more accurate cost models to avoid mispredictions
- Aligned cost models enable comparison of different transformation options
- With our refined cost model we
  - Increase the correlation between estimated and measured speedup
  - Decrease the number of false predictions
  - Lower execution times
- Next steps: add more code features and tests to cover all instruction types







# THANK YOU





# BACKUP





# Why a More Accurate Cost Model?

- 1. Improve classification results to decide whether code should be vectorized or not.
- 2. Enable comparison of different transformations with each other, not just to baseline.

|                      | LLV  | SLP  |
|----------------------|------|------|
| Predicted<br>Speedup | 0.96 | 1.00 |
| Measured<br>Speedup  | 1.13 | 1.20 |

Example code run on Intel i5





# L2- LOOCV Validation Results







# State of the Art Analysis – x86



- TSVC benchmark: 151 basic loop patterns
- SLP vectorization applied after loop unrolling
- Results shown for Intel Xeon E5 with AVX2



### Results: Fitted for Cost – x86



- Correlation between estimated and measured speedup in LLVM's LLV pass after fitting
- L2: Least Squares (minimizes Euclidian L2 Norm)
- NNLS: Non-Negative Least Squares (all coefficients > 0)
- SVR: Support Vector Regression Cost Modelling for Vectorization on ARM, A. Pohl et al.





# Results: Fitted for Speedup – x86



- All three approaches improve correlation further
- False negatives reduced (L2) or eliminated (NNLS, SVR)
- But: small increase in false positives

