This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why only half of register bank is accessible in VFP's S0-31 view of register bank?

I am optimizing a simple l2-distance calculation program target at Cortex-A7. Initially, i choose to unroll the calculation loop like below:

```c
void l2_naive_f32(float *mat, uint32_t m, uint32_t n, float *vec, float *dst) {
for (size_t i = 0; i < m; i++) {
float res0 = 0;
float res1 = 0;
for (size_t j = 0; j < n; j+=2) {
float t0 = mat[i * n + j] - vec[j];
float t1 = mat[i * n + j + 1] - vec[j + 1];

t0 *= t0;
t1 *= t1;

res0 += t0;
res1 += t1;
}
dst[i] = res0 + res1;
}
}
```
I've observed that in the target Cortex-A7 CPU, unroll 8 times will reach the peak performance. Unrolling 16 times will cause massive register spilling godbolt.org/.../sdzovT73P.

In this [armv7-reference-manual](developer.arm.com/.../Advanced-SIMD-and-Floating-point-Extension-registers, i've learned that even though VFP register set contains thrity-two 64-bit doubleword registers, it can only view it as thirty-two 32-bit single word registers, S0-S31. Thus only half of the set is accessible in this view.

[![enter image description here][1]][1]

It's just a statement written in the manual that tells me about the truth. But my question is, what's the reason of this design? why only half of register bank is accessible in VFP's S0-31 view of register bank?

[1]: i.stack.imgur.com/NGInP.png

Parents Reply Children
No data