I am needing to optimize code that will be running on an Arm Cortex-M55 that has helium/MVE with floating-point support. The algorithm is quite recursive in nature so the compiler is struggling to infer any vector operations. The code is parallelizing across 4 channels and the math operations are mostly multiply, add/sub, divide and sqrt. The bottleneck is that I can run half the operations on MVE but the other (e.g. divide and sqrt) need to be performed on vFPU. The question is if I have my values stored in for example Q0 can I subsequently call FPU operations using the corresponding S registers (S0, S1, S2, S3)?
Take the following function (similar to what I am optimizing):
float32_t update(float32_t old, float32_t a, float32_t b) { float32_t tmpa = 1.0f - a * a; float32_t tmpb = 1.0f - b * b; return old * sqrtf(tmpa * tmpb) + a * b; }
I rewrote with straight asm as follows:
void update_mve(float32_t *restrict old, float32_t *restrict a, float32_t *restrict b) { __asm volatile( "vldrw.s32 q1, [%[a]] \n" "vldrw.s32 q2, [%[b]] \n" "vldrw.s32 q0, [%[old]] \n" "vmov.f32 q7, #1.0 \n" "vmul.f32 q3, q1, q1 \n" "vsub.f32 q3, q7, q3 \n" "vmul.f32 q4, q2, q2 \n" "vsub.f32 q4, q7, q4 \n" "vmul.f32 q3, q3, q4 \n" "vsqrt.f32 s12, s12 \n" "vsqrt.f32 s13, s13 \n" "vsqrt.f32 s14, s14 \n" "vsqrt.f32 s15, s15 \n" "vmul.f32 q0, q0, q3 \n" "vfma.f32 q0, q1, q2 \n" "vstrw.32 q0, [%[old]] \n" : [old] "+r" (old) : [a] "r" (a), [b] "r" (b) : "q0", "q1", "q2", "q3", "q4", "q7", "memory" ); }
Notice I am *assuming* that after the MVE operations on Q registers the S registers will have the same content. The idea is to reduce the number of load/store/movs between FPU and MVE but not sure this assumption is correct.
Thanks