This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M7 VFMA usage

Dear All,

this is my first post and I hope I do not make any serious mistakes.

My question is regarding the use case of the cortex-m7 VFMA/VMLA instruction.

I am evaluating a polinomial for which the C compiler emits VFMA.F32 instructions. Out of curiosity I implemented VADD.F32+VMUL.F32 version of the same algorithm, which seemed to be faster in terms of CPU cycles. I used the DWT cycle counter to count the clock cycles. To get to the reason why the implementation with VADD+VMUL is faster I did some assembly benchmarking and it seems, that any floating point instruction together with VFMA.F32 causes serious stalls in the pipeline.

The test cases I checked were ('independent' in the context below means, that the instructions access independent registers so that there are no pipeline stalls):

Independent VMOV instructions seem to execute in parallel, so 512 VMOV execute in ~256 cycles (I did this as a sanity check).
independent VADD instructions execute 1 instrucion per cycle (this is what I expected)
Independent VMUL instructions execute1 instruction per cycle (again as expected)
Independent VMUL.F32+VADD.F32 instructions interleaved execute 1 instruction/cycle (so they do not execute in parallel)
Independent VFMA.F32 instructions execute 1 instruction/cycle (no other instructions interleaved)
Independent VADD.F32 + VMOV instructions interleaved execute 2 instructions/cycle (so the move happens parallel with VADD)
Independent VMUL.F32 + VMOV instructions interleaved execute 2 instructions/cycle
Independent VFMA.F32 + VMOV instructions interleaved execute with 0.5 instructions/cycle (=2 cycles/instruction)
Independent VFMA.F32 + VLDR.F32 (load from DTCM) instructions interleaved execure with 0.5 instructions/cycle
Pairwise dependent VLDR.F32 + VMUL.F32 (so VMUL uses the result of the VLDR instruction) execute 1 instruction/cycle, so the loaded data can be used in the next cycle
Independent VFMA.F32 + VADD.F32 instructions seem to execute with 2.5 cycles/instruction (I have no explanation for this, it could be some kind of measurement error, but 250 VFMA.F32 + 250 VADD.F32 interleaved executed in 1253 cycles in my testing...)

The code for case 8 is:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
.rept 50
    vfma.f32 s2, s1, s0
    vmov.32 s4, s3
    vfma.f32 s7, s6, s5
    vmov.32 s9, s8
    vfma.f32 s12, s11, s10
    vmov.32 s14, s13
    vfma.f32 s17, s16, s15
    vmov.32 s19, s18
    vfma.f32 s22, s21, s20
    vmov.32 s24, s23
    vfma.f32 s27, s26, s25
    vmov.32 s29, s28
.endr
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

.rept 50
	vfma.f32 s2, s1, s0
	vmov.32 s4, s3
	vfma.f32 s7, s6, s5
	vmov.32 s9, s8
	vfma.f32 s12, s11, s10
	vmov.32 s14, s13
	vfma.f32 s17, s16, s15
	vmov.32 s19, s18
	vfma.f32 s22, s21, s20
	vmov.32 s24, s23
	vfma.f32 s27, s26, s25
	vmov.32 s29, s28
.endr

For case 9 it is (using [sp] might not have been the best idea):

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
.rept 50
    vfma.f32 s2, s1, s0
    vldr.f32 s3, [sp]
    vfma.f32 s7, s6, s5
    vldr.f32 s8, [sp]
    vfma.f32 s12, s11, s10
    vldr.f32 s13, [sp]
    vfma.f32 s17, s16, s15
    vldr.f32 s18, [sp]
    vfma.f32 s22, s21, s20
    vldr.f32 s23, [sp]
    vfma.f32 s27, s26, s25
    vldr.f32 s28, [sp]
.endr
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

.rept 50
	vfma.f32 s2, s1, s0
	vldr.f32 s3, [sp]
	vfma.f32 s7, s6, s5
	vldr.f32 s8, [sp]
	vfma.f32 s12, s11, s10
	vldr.f32 s13, [sp]
	vfma.f32 s17, s16, s15
	vldr.f32 s18, [sp]
	vfma.f32 s22, s21, s20
	vldr.f32 s23, [sp]
	vfma.f32 s27, s26, s25
	vldr.f32 s28, [sp]
.endr

In some ARM documentation for the Cortex-M7 the suggestion is to interleave load/store instructions with other (math) instructions, but for VFMA this seems not really useful.

My questions is, that is this the expected behaviour of the VFMA instruction? Or am I doing something wrong? Are there other float instructions with the same behaviour? (Except for VMLA.F32, which seems to behave the same way.)

(For the polinomial evaluation it seems, that using a group of VFMA.F32 instructions preceeded by load and followed by store instructions is somewhat faster than the VMUL.F32/VADD.F32 alternative, but it seems, that the processor can execute load/store operations in parallel with those instructions while this is not true for VFMA.F32. Also this seems to be the case for VMLA.F32 as well. GCC does not seem to be aware of this, so it uses VFMA interleaving it with other instructions.)

I also attached the complete benchmark file just for information. (The MCU is an STM32H7, code running from ITCM, data stored in DTCM. I also checked the function locations in the map file.)

Thank you for your help!

Best regards,

GDzsudzsak