This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M7 VFMA usage

Dear All,

this is my first post and I hope I do not make any serious mistakes.

My question is regarding the use case of the cortex-m7 VFMA/VMLA instruction.

I am evaluating a polinomial for which the C compiler emits VFMA.F32 instructions. Out of curiosity I implemented VADD.F32+VMUL.F32 version of the same algorithm, which seemed to be faster in terms of CPU cycles. I used the DWT cycle counter to count the clock cycles. To get to the reason why the implementation with VADD+VMUL is faster I did some assembly benchmarking and it seems, that any floating point instruction together with VFMA.F32 causes serious stalls in the pipeline.

The test cases I checked were ('independent' in the context below means, that the instructions access independent registers so that there are no pipeline stalls):

Independent VMOV instructions seem to execute in parallel, so 512 VMOV execute in ~256 cycles (I did this as a sanity check).
independent VADD instructions execute 1 instrucion per cycle (this is what I expected)
Independent VMUL instructions execute1 instruction per cycle (again as expected)
Independent VMUL.F32+VADD.F32 instructions interleaved execute 1 instruction/cycle (so they do not execute in parallel)
Independent VFMA.F32 instructions execute 1 instruction/cycle (no other instructions interleaved)
Independent VADD.F32 + VMOV instructions interleaved execute 2 instructions/cycle (so the move happens parallel with VADD)
Independent VMUL.F32 + VMOV instructions interleaved execute 2 instructions/cycle
Independent VFMA.F32 + VMOV instructions interleaved execute with 0.5 instructions/cycle (=2 cycles/instruction)
Independent VFMA.F32 + VLDR.F32 (load from DTCM) instructions interleaved execure with 0.5 instructions/cycle
Pairwise dependent VLDR.F32 + VMUL.F32 (so VMUL uses the result of the VLDR instruction) execute 1 instruction/cycle, so the loaded data can be used in the next cycle
Independent VFMA.F32 + VADD.F32 instructions seem to execute with 2.5 cycles/instruction (I have no explanation for this, it could be some kind of measurement error, but 250 VFMA.F32 + 250 VADD.F32 interleaved executed in 1253 cycles in my testing...)

The code for case 8 is:

.rept 50
	vfma.f32 s2, s1, s0
	vmov.32 s4, s3
	vfma.f32 s7, s6, s5
	vmov.32 s9, s8
	vfma.f32 s12, s11, s10
	vmov.32 s14, s13
	vfma.f32 s17, s16, s15
	vmov.32 s19, s18
	vfma.f32 s22, s21, s20
	vmov.32 s24, s23
	vfma.f32 s27, s26, s25
	vmov.32 s29, s28
.endr

For case 9 it is (using [sp] might not have been the best idea):

.rept 50
	vfma.f32 s2, s1, s0
	vldr.f32 s3, [sp]
	vfma.f32 s7, s6, s5
	vldr.f32 s8, [sp]
	vfma.f32 s12, s11, s10
	vldr.f32 s13, [sp]
	vfma.f32 s17, s16, s15
	vldr.f32 s18, [sp]
	vfma.f32 s22, s21, s20
	vldr.f32 s23, [sp]
	vfma.f32 s27, s26, s25
	vldr.f32 s28, [sp]
.endr

In some ARM documentation for the Cortex-M7 the suggestion is to interleave load/store instructions with other (math) instructions, but for VFMA this seems not really useful.

My questions is, that is this the expected behaviour of the VFMA instruction? Or am I doing something wrong? Are there other float instructions with the same behaviour? (Except for VMLA.F32, which seems to behave the same way.)

(For the polinomial evaluation it seems, that using a group of VFMA.F32 instructions preceeded by load and followed by store instructions is somewhat faster than the VMUL.F32/VADD.F32 alternative, but it seems, that the processor can execute load/store operations in parallel with those instructions while this is not true for VFMA.F32. Also this seems to be the case for VMLA.F32 as well. GCC does not seem to be aware of this, so it uses VFMA interleaving it with other instructions.)

I also attached the complete benchmark file just for information. (The MCU is an STM32H7, code running from ITCM, data stored in DTCM. I also checked the function locations in the map file.)

Thank you for your help!

Best regards,

GDzsudzsak

Parents

0 GDzsudzsak over 6 years ago

Since the file upload did not work I attach some of the functions as code:

Benchmark overhead calculation:

// uint32_t benchmarkOverhead(void)
// calculate benchmark overhead and return as uint32_t in CPU cycles
.align 3
.global benchmarkOverheadASM
.section  .itcmram.benchmarkOverheadASM
.type  benchmarkOverheadASM, %function
benchmarkOverheadASM:
	//vpush {s0-s15}
	//vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b benchmarkOverheadASM1
.align 3
.type  benchmarkOverheadASM1, %function
benchmarkOverheadASM1:
	isb
	ldr r0, [r3]
	sub r0, r0, r1
	//vpop {s16-s31}
	//vpop {s0-s15}
	bx lr

Independent VMOV instructions benchmark:

// uint32_t floatMoveOnly(uint32_t overhead)
// vmov instructions, totally independent
// 32 * 16 instructions = 512 instructions
// benchmark 257 cycles --> two instructions in parallel
.align 3
.global floatMoveOnlyASM
.section  .itcmram.floatMoveOnlyASM
.type  floatMoveOnlyASM, %function
floatMoveOnlyASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatMoveOnlyASM1
.align 3
.type  floatMoveOnlyASM1, %function
floatMoveOnlyASM1:
.rept 32
	vmov s1, s0
	vmov s17, s16
	vmov s3, s2
	vmov s19, s18
	vmov s5, s4
	vmov s21, s20
	vmov s7, s6
	vmov s23, s22
	vmov s9, s8
	vmov s25, s24
	vmov s11, s10
	vmov s27, s26
	vmov s13, s12
	vmov s29, s28
	vmov s15, s14
	vmov s31, s30
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

Independent VADD instructions:

// uint32_t floatAddOnlyASM(uint32_t overhead)
// vadd.f32 instructions, totally independent
// 50 * 10 instructions
// benchmark 503 cycles --> no parallelism (1 clock/instruction)
.align 3
.global floatAddOnlyASM
.section  .itcmram.floatAddOnlyASM
.type  floatAddOnlyASM, %function
floatAddOnlyASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatAddOnlyASM1
.align 3
.type  floatAddOnlyASM1, %function
floatAddOnlyASM1:
.rept 50
	vadd.f32 s2, s1, s0
	vadd.f32 s5, s4, s3
	vadd.f32 s8, s7, s6
	vadd.f32 s11, s10, s9
	vadd.f32 s14, s13, s12
	vadd.f32 s17, s16, s15
	vadd.f32 s20, s19, s18
	vadd.f32 s23, s22, s21
	vadd.f32 s26, s25, s24
	vadd.f32 s29, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

Independent VMUL instructions:

// uint32_t floatMulOnlyASM(uint32_t overhead)
// vmul.f32 instructions, totally independent
// 50 * 10 instructions
// benchmark 501 instructions --> no parallelism (1 clock/instruction)
.align 3
.global floatMulOnlyASM
.section  .itcmram.floatMulOnlyASM
.type  floatMulOnlyASM, %function
floatMulOnlyASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatMulOnlyASM1
.align 3
.type  floatMulOnlyASM1, %function
floatMulOnlyASM1:
.rept 50
	vmul.f32 s2, s1, s0
	vmul.f32 s5, s4, s3
	vmul.f32 s8, s7, s6
	vmul.f32 s11, s10, s9
	vmul.f32 s14, s13, s12
	vmul.f32 s17, s16, s15
	vmul.f32 s20, s19, s18
	vmul.f32 s23, s22, s21
	vmul.f32 s26, s25, s24
	vmul.f32 s29, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

VADD and VMUL interleaved:

// uint32_t floatMulAddASM(uint32_t overhead)
// vmul.f32 and vadd.f32 instructions, totally independent
// 50 * 10 instructions
// benchmark 503 instructions --> no parallelism (1 instruction/clock)
.align 3
.global floatMulAddASM
.section  .itcmram.floatMulAddASM
.type  floatMulAddASM, %function
floatMulAddASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatMulAddASM1
.align 3
.type  floatMulAddASM1, %function
floatMulAddASM1:
.rept 50
	vmul.f32 s2, s1, s0
	vadd.f32 s5, s4, s3
	vmul.f32 s8, s7, s6
	vadd.f32 s11, s10, s9
	vmul.f32 s14, s13, s12
	vadd.f32 s17, s16, s15
	vmul.f32 s20, s19, s18
	vadd.f32 s23, s22, s21
	vmul.f32 s26, s25, s24
	vadd.f32 s29, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

Independent VFMA instructions:

// uint32_t floatFMAOnlyASM(uint32_t overhead)
// vfma.f32 instructions, totally independent
// 50 * 10 instructions
// benchmark 503 instructions --> no parallelism (1 clock/instruction)
.align 3
.global floatFMAOnlyASM
.section  .itcmram.floatFMAOnlyASM
.type  floatFMAOnlyASM, %function
floatFMAOnlyASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatFMAOnlyASM1
.align 3
.type  floatFMAOnlyASM1, %function
floatFMAOnlyASM1:
.rept 50
	vfma.f32 s2, s1, s0
	vfma.f32 s5, s4, s3
	vfma.f32 s8, s7, s6
	vfma.f32 s11, s10, s9
	vfma.f32 s14, s13, s12
	vfma.f32 s17, s16, s15
	vfma.f32 s20, s19, s18
	vfma.f32 s23, s22, s21
	vfma.f32 s26, s25, s24
	vfma.f32 s29, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

VFMA accumulating into the same register:

// uint32_t floatFMAOnlyDepASM(uint32_t overhead)
// vfma.f32 instructions, accumulating into the same register
// 50 * 10 instructions
// benchmark 1501 instructions --> stalls, 2 cycles each instruction (3 clocks/instruction)
.align 3
.global floatFMAOnlyDepASM
.section  .itcmram.floatFMAOnlyDepASM
.type  floatFMAOnlyDepASM, %function
floatFMAOnlyDepASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatFMAOnlyDepASM1
.align 3
.type  floatFMAOnlyDepASM1, %function
floatFMAOnlyDepASM1:
.rept 50
	vfma.f32 s31, s1, s0
	vfma.f32 s31, s4, s3
	vfma.f32 s31, s7, s6
	vfma.f32 s31, s10, s9
	vfma.f32 s31, s13, s12
	vfma.f32 s31, s16, s15
	vfma.f32 s31, s19, s18
	vfma.f32 s31, s22, s21
	vfma.f32 s31, s25, s24
	vfma.f32 s31, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

Interleaved VFMA and VADD instructions:

// uint32_t floatFMAAddASM(uint32_t overhead)
// vfma.f32 and vadd.f32 instructions, totally independent
// 50 * 10 instructions
// benchmark 1253 instructions --> pipeline stall, 2.5 clocks/instruction
.align 3
.global floatFMAAddASM
.section  .itcmram.floatFMAAddASM
.type  floatFMAAddASM, %function
floatFMAAddASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatFMAAddASM1
.align 3
.type  floatFMAAddASM1, %function
floatFMAAddASM1:
.rept 50
	vfma.f32 s2, s1, s0
	vadd.f32 s5, s4, s3
	vfma.f32 s8, s7, s6
	vadd.f32 s11, s10, s9
	vfma.f32 s14, s13, s12
	vadd.f32 s17, s16, s15
	vfma.f32 s20, s19, s18
	vadd.f32 s23, s22, s21
	vfma.f32 s26, s25, s24
	vadd.f32 s29, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

Interleaved VMOV and VADD instructions:

// uint32_t floatMovAddASM(uint32_t overhead)
// vmov.32 and vadd.f32 instructions, totally independent
// 50 * 10 instructions
// benchmark 251 instructions --> can run parallel (2 instructions/clock)
.align 3
.global floatMovAddASM
.section  .itcmram.floatMovAddASM
.type  floatMovAddASM, %function
floatMovAddASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatMovAddASM1
.align 3
.type  floatMovAddASM1, %function
floatMovAddASM1:
.rept 50
	vmov.32 s1, s0
	vadd.f32 s5, s4, s3
	vmov.32 s7, s6
	vadd.f32 s11, s10, s9
	vmov.32 s13, s12
	vadd.f32 s17, s16, s15
	vmov.32 s19, s18
	vadd.f32 s23, s22, s21
	vmov.32 s25, s24
	vadd.f32 s29, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

Interleaved VMOV and VMUL instructions:

// uint32_t floatMovMulASM(uint32_t overhead)
// vmov.32 and vmul.f32 instructions, totally independent
// 50 * 10 instructions
// benchmark 253 instructions --> can run parallel, 2 instructions/clock
.align 3
.global floatMovMulASM
.section  .itcmram.floatMovMulASM
.type  floatMovMulASM, %function
floatMovMulASM:
	vpush {s0-s15}
	vpush {s16-s31}
	dsb
	ldr r3, =0xE0001004
	dsb
	isb
	ldr r1, [r3]
	isb
	b floatMovMulASM1
.align 3
.type  floatMovMulASM1, %function
floatMovMulASM1:
.rept 50
	vmov.32 s1, s0
	vmul.f32 s5, s4, s3
	vmov.32 s7, s6
	vmul.f32 s11, s10, s9
	vmov.32 s13, s12
	vmul.f32 s17, s16, s15
	vmov.32 s19, s18
	vmul.f32 s23, s22, s21
	vmov.32 s25, s24
	vmul.f32 s29, s28, s27
.endr
	isb
	ldr r2, [r3]
	sub r2, r2, r1
	sub r0, r2, r0
	vpop {s16-s31}
	vpop {s0-s15}
	bx lr

Reply