Dear All,
this is my first post and I hope I do not make any serious mistakes.
My question is regarding the use case of the cortex-m7 VFMA/VMLA instruction.
I am evaluating a polinomial for which the C compiler emits VFMA.F32 instructions. Out of curiosity I implemented VADD.F32+VMUL.F32 version of the same algorithm, which seemed to be faster in terms of CPU cycles. I used the DWT cycle counter to count the clock cycles. To get to the reason why the implementation with VADD+VMUL is faster I did some assembly benchmarking and it seems, that any floating point instruction together with VFMA.F32 causes serious stalls in the pipeline.
The test cases I checked were ('independent' in the context below means, that the instructions access independent registers so that there are no pipeline stalls):
The code for case 8 is:
.rept 50 vfma.f32 s2, s1, s0 vmov.32 s4, s3 vfma.f32 s7, s6, s5 vmov.32 s9, s8 vfma.f32 s12, s11, s10 vmov.32 s14, s13 vfma.f32 s17, s16, s15 vmov.32 s19, s18 vfma.f32 s22, s21, s20 vmov.32 s24, s23 vfma.f32 s27, s26, s25 vmov.32 s29, s28 .endr
For case 9 it is (using [sp] might not have been the best idea):
.rept 50 vfma.f32 s2, s1, s0 vldr.f32 s3, [sp] vfma.f32 s7, s6, s5 vldr.f32 s8, [sp] vfma.f32 s12, s11, s10 vldr.f32 s13, [sp] vfma.f32 s17, s16, s15 vldr.f32 s18, [sp] vfma.f32 s22, s21, s20 vldr.f32 s23, [sp] vfma.f32 s27, s26, s25 vldr.f32 s28, [sp] .endr
In some ARM documentation for the Cortex-M7 the suggestion is to interleave load/store instructions with other (math) instructions, but for VFMA this seems not really useful.
My questions is, that is this the expected behaviour of the VFMA instruction? Or am I doing something wrong? Are there other float instructions with the same behaviour? (Except for VMLA.F32, which seems to behave the same way.)
(For the polinomial evaluation it seems, that using a group of VFMA.F32 instructions preceeded by load and followed by store instructions is somewhat faster than the VMUL.F32/VADD.F32 alternative, but it seems, that the processor can execute load/store operations in parallel with those instructions while this is not true for VFMA.F32. Also this seems to be the case for VMLA.F32 as well. GCC does not seem to be aware of this, so it uses VFMA interleaving it with other instructions.)
I also attached the complete benchmark file just for information. (The MCU is an STM32H7, code running from ITCM, data stored in DTCM. I also checked the function locations in the map file.)
Thank you for your help!
Best regards,
GDzsudzsak