This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why only half of register bank is accessible in VFP's S0-31 view of register bank?

I am optimizing a simple l2-distance calculation program target at Cortex-A7. Initially, i choose to unroll the calculation loop like below:

```c
void l2_naive_f32(float *mat, uint32_t m, uint32_t n, float *vec, float *dst) {
for (size_t i = 0; i < m; i++) {
float res0 = 0;
float res1 = 0;
for (size_t j = 0; j < n; j+=2) {
float t0 = mat[i * n + j] - vec[j];
float t1 = mat[i * n + j + 1] - vec[j + 1];

t0 *= t0;
t1 *= t1;

res0 += t0;
res1 += t1;
}
dst[i] = res0 + res1;
}
}
```
I've observed that in the target Cortex-A7 CPU, unroll 8 times will reach the peak performance. Unrolling 16 times will cause massive register spilling godbolt.org/.../sdzovT73P.

In this [armv7-reference-manual](developer.arm.com/.../Advanced-SIMD-and-Floating-point-Extension-registers, i've learned that even though VFP register set contains thrity-two 64-bit doubleword registers, it can only view it as thirty-two 32-bit single word registers, S0-S31. Thus only half of the set is accessible in this view.

[![enter image description here][1]][1]

It's just a statement written in the manual that tells me about the truth. But my question is, what's the reason of this design? why only half of register bank is accessible in VFP's S0-31 view of register bank?

[1]: i.stack.imgur.com/NGInP.png

Parents
  • Depending on which VFP version is being used, number of D registers could be 16 or 32.

    VFPv3-D16, VFPv4-D16:   16 D registers

    VFPv3-D32, VFPv4-D32 :   32 D registers

    Since Armv7-A instruction is 16bits or 32bits encoding, the register number could be encoded in the instruction is limited, 64 S registers might be too much. And it would might different instruction encoding with FPU instructions for double FP.

Reply
  • Depending on which VFP version is being used, number of D registers could be 16 or 32.

    VFPv3-D16, VFPv4-D16:   16 D registers

    VFPv3-D32, VFPv4-D32 :   32 D registers

    Since Armv7-A instruction is 16bits or 32bits encoding, the register number could be encoded in the instruction is limited, 64 S registers might be too much. And it would might different instruction encoding with FPU instructions for double FP.

Children
No data