Why only half of register bank is accessible in VFP's S0-31 view of register bank?

I am optimizing a simple l2-distance calculation program target at Cortex-A7. Initially, i choose to unroll the calculation loop like below:

```c
void l2_naive_f32(float *mat, uint32_t m, uint32_t n, float *vec, float *dst) {
for (size_t i = 0; i < m; i++) {
float res0 = 0;
float res1 = 0;
for (size_t j = 0; j < n; j+=2) {
float t0 = mat[i * n + j] - vec[j];
float t1 = mat[i * n + j + 1] - vec[j + 1];

t0 *= t0;
t1 *= t1;

res0 += t0;
res1 += t1;
}
dst[i] = res0 + res1;
}
}
```
I've observed that in the target Cortex-A7 CPU, unroll 8 times will reach the peak performance. Unrolling 16 times will cause massive register spilling godbolt.org/.../sdzovT73P.

In this [armv7-reference-manual](developer.arm.com/.../Advanced-SIMD-and-Floating-point-Extension-registers, i've learned that even though VFP register set contains thrity-two 64-bit doubleword registers, it can only view it as thirty-two 32-bit single word registers, S0-S31. Thus only half of the set is accessible in this view.

[![enter image description here][1]][1]

It's just a statement written in the manual that tells me about the truth. But my question is, what's the reason of this design? why only half of register bank is accessible in VFP's S0-31 view of register bank?

[1]: i.stack.imgur.com/NGInP.png

Parents Reply Children
No data
More questions in this forum
There are no posts to show. This could be because there are no posts in this forum or due to a filter.