I am optimizing a simple l2-distance calculation program target at Cortex-A7. Initially, i choose to unroll the calculation loop like below:
```cvoid l2_naive_f32(float *mat, uint32_t m, uint32_t n, float *vec, float *dst) { for (size_t i = 0; i < m; i++) { float res0 = 0; float res1 = 0; for (size_t j = 0; j < n; j+=2) { float t0 = mat[i * n + j] - vec[j]; float t1 = mat[i * n + j + 1] - vec[j + 1]; t0 *= t0; t1 *= t1;
res0 += t0; res1 += t1; } dst[i] = res0 + res1; }}```I've observed that in the target Cortex-A7 CPU, unroll 8 times will reach the peak performance. Unrolling 16 times will cause massive register spilling godbolt.org/.../sdzovT73P.
In this [armv7-reference-manual](developer.arm.com/.../Advanced-SIMD-and-Floating-point-Extension-registers, i've learned that even though VFP register set contains thrity-two 64-bit doubleword registers, it can only view it as thirty-two 32-bit single word registers, S0-S31. Thus only half of the set is accessible in this view.
[![enter image description here][1]][1]
It's just a statement written in the manual that tells me about the truth. But my question is, what's the reason of this design? why only half of register bank is accessible in VFP's S0-31 view of register bank?
[1]: i.stack.imgur.com/NGInP.png
If you use NEON (or SIMD), you can use the full size of the FPU registers.