Dear All,
this is my first post and I hope I do not make any serious mistakes.
My question is regarding the use case of the cortex-m7 VFMA/VMLA instruction.
I am evaluating a polinomial for which the C compiler emits VFMA.F32 instructions. Out of curiosity I implemented VADD.F32+VMUL.F32 version of the same algorithm, which seemed to be faster in terms of CPU cycles. I used the DWT cycle counter to count the clock cycles. To get to the reason why the implementation with VADD+VMUL is faster I did some assembly benchmarking and it seems, that any floating point instruction together with VFMA.F32 causes serious stalls in the pipeline.
The test cases I checked were ('independent' in the context below means, that the instructions access independent registers so that there are no pipeline stalls):
The code for case 8 is:
.rept 50 vfma.f32 s2, s1, s0 vmov.32 s4, s3 vfma.f32 s7, s6, s5 vmov.32 s9, s8 vfma.f32 s12, s11, s10 vmov.32 s14, s13 vfma.f32 s17, s16, s15 vmov.32 s19, s18 vfma.f32 s22, s21, s20 vmov.32 s24, s23 vfma.f32 s27, s26, s25 vmov.32 s29, s28 .endr
For case 9 it is (using [sp] might not have been the best idea):
.rept 50 vfma.f32 s2, s1, s0 vldr.f32 s3, [sp] vfma.f32 s7, s6, s5 vldr.f32 s8, [sp] vfma.f32 s12, s11, s10 vldr.f32 s13, [sp] vfma.f32 s17, s16, s15 vldr.f32 s18, [sp] vfma.f32 s22, s21, s20 vldr.f32 s23, [sp] vfma.f32 s27, s26, s25 vldr.f32 s28, [sp] .endr
In some ARM documentation for the Cortex-M7 the suggestion is to interleave load/store instructions with other (math) instructions, but for VFMA this seems not really useful.
My questions is, that is this the expected behaviour of the VFMA instruction? Or am I doing something wrong? Are there other float instructions with the same behaviour? (Except for VMLA.F32, which seems to behave the same way.)
(For the polinomial evaluation it seems, that using a group of VFMA.F32 instructions preceeded by load and followed by store instructions is somewhat faster than the VMUL.F32/VADD.F32 alternative, but it seems, that the processor can execute load/store operations in parallel with those instructions while this is not true for VFMA.F32. Also this seems to be the case for VMLA.F32 as well. GCC does not seem to be aware of this, so it uses VFMA interleaving it with other instructions.)
I also attached the complete benchmark file just for information. (The MCU is an STM32H7, code running from ITCM, data stored in DTCM. I also checked the function locations in the map file.)
Thank you for your help!
Best regards,
GDzsudzsak
Since the file upload did not work I attach some of the functions as code:
Benchmark overhead calculation:
// uint32_t benchmarkOverhead(void) // calculate benchmark overhead and return as uint32_t in CPU cycles .align 3 .global benchmarkOverheadASM .section .itcmram.benchmarkOverheadASM .type benchmarkOverheadASM, %function benchmarkOverheadASM: //vpush {s0-s15} //vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b benchmarkOverheadASM1 .align 3 .type benchmarkOverheadASM1, %function benchmarkOverheadASM1: isb ldr r0, [r3] sub r0, r0, r1 //vpop {s16-s31} //vpop {s0-s15} bx lr
Independent VMOV instructions benchmark:
// uint32_t floatMoveOnly(uint32_t overhead) // vmov instructions, totally independent // 32 * 16 instructions = 512 instructions // benchmark 257 cycles --> two instructions in parallel .align 3 .global floatMoveOnlyASM .section .itcmram.floatMoveOnlyASM .type floatMoveOnlyASM, %function floatMoveOnlyASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatMoveOnlyASM1 .align 3 .type floatMoveOnlyASM1, %function floatMoveOnlyASM1: .rept 32 vmov s1, s0 vmov s17, s16 vmov s3, s2 vmov s19, s18 vmov s5, s4 vmov s21, s20 vmov s7, s6 vmov s23, s22 vmov s9, s8 vmov s25, s24 vmov s11, s10 vmov s27, s26 vmov s13, s12 vmov s29, s28 vmov s15, s14 vmov s31, s30 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
Independent VADD instructions:
// uint32_t floatAddOnlyASM(uint32_t overhead) // vadd.f32 instructions, totally independent // 50 * 10 instructions // benchmark 503 cycles --> no parallelism (1 clock/instruction) .align 3 .global floatAddOnlyASM .section .itcmram.floatAddOnlyASM .type floatAddOnlyASM, %function floatAddOnlyASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatAddOnlyASM1 .align 3 .type floatAddOnlyASM1, %function floatAddOnlyASM1: .rept 50 vadd.f32 s2, s1, s0 vadd.f32 s5, s4, s3 vadd.f32 s8, s7, s6 vadd.f32 s11, s10, s9 vadd.f32 s14, s13, s12 vadd.f32 s17, s16, s15 vadd.f32 s20, s19, s18 vadd.f32 s23, s22, s21 vadd.f32 s26, s25, s24 vadd.f32 s29, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
Independent VMUL instructions:
// uint32_t floatMulOnlyASM(uint32_t overhead) // vmul.f32 instructions, totally independent // 50 * 10 instructions // benchmark 501 instructions --> no parallelism (1 clock/instruction) .align 3 .global floatMulOnlyASM .section .itcmram.floatMulOnlyASM .type floatMulOnlyASM, %function floatMulOnlyASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatMulOnlyASM1 .align 3 .type floatMulOnlyASM1, %function floatMulOnlyASM1: .rept 50 vmul.f32 s2, s1, s0 vmul.f32 s5, s4, s3 vmul.f32 s8, s7, s6 vmul.f32 s11, s10, s9 vmul.f32 s14, s13, s12 vmul.f32 s17, s16, s15 vmul.f32 s20, s19, s18 vmul.f32 s23, s22, s21 vmul.f32 s26, s25, s24 vmul.f32 s29, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
VADD and VMUL interleaved:
// uint32_t floatMulAddASM(uint32_t overhead) // vmul.f32 and vadd.f32 instructions, totally independent // 50 * 10 instructions // benchmark 503 instructions --> no parallelism (1 instruction/clock) .align 3 .global floatMulAddASM .section .itcmram.floatMulAddASM .type floatMulAddASM, %function floatMulAddASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatMulAddASM1 .align 3 .type floatMulAddASM1, %function floatMulAddASM1: .rept 50 vmul.f32 s2, s1, s0 vadd.f32 s5, s4, s3 vmul.f32 s8, s7, s6 vadd.f32 s11, s10, s9 vmul.f32 s14, s13, s12 vadd.f32 s17, s16, s15 vmul.f32 s20, s19, s18 vadd.f32 s23, s22, s21 vmul.f32 s26, s25, s24 vadd.f32 s29, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
Independent VFMA instructions:
// uint32_t floatFMAOnlyASM(uint32_t overhead) // vfma.f32 instructions, totally independent // 50 * 10 instructions // benchmark 503 instructions --> no parallelism (1 clock/instruction) .align 3 .global floatFMAOnlyASM .section .itcmram.floatFMAOnlyASM .type floatFMAOnlyASM, %function floatFMAOnlyASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatFMAOnlyASM1 .align 3 .type floatFMAOnlyASM1, %function floatFMAOnlyASM1: .rept 50 vfma.f32 s2, s1, s0 vfma.f32 s5, s4, s3 vfma.f32 s8, s7, s6 vfma.f32 s11, s10, s9 vfma.f32 s14, s13, s12 vfma.f32 s17, s16, s15 vfma.f32 s20, s19, s18 vfma.f32 s23, s22, s21 vfma.f32 s26, s25, s24 vfma.f32 s29, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
VFMA accumulating into the same register:
// uint32_t floatFMAOnlyDepASM(uint32_t overhead) // vfma.f32 instructions, accumulating into the same register // 50 * 10 instructions // benchmark 1501 instructions --> stalls, 2 cycles each instruction (3 clocks/instruction) .align 3 .global floatFMAOnlyDepASM .section .itcmram.floatFMAOnlyDepASM .type floatFMAOnlyDepASM, %function floatFMAOnlyDepASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatFMAOnlyDepASM1 .align 3 .type floatFMAOnlyDepASM1, %function floatFMAOnlyDepASM1: .rept 50 vfma.f32 s31, s1, s0 vfma.f32 s31, s4, s3 vfma.f32 s31, s7, s6 vfma.f32 s31, s10, s9 vfma.f32 s31, s13, s12 vfma.f32 s31, s16, s15 vfma.f32 s31, s19, s18 vfma.f32 s31, s22, s21 vfma.f32 s31, s25, s24 vfma.f32 s31, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
Interleaved VFMA and VADD instructions:
// uint32_t floatFMAAddASM(uint32_t overhead) // vfma.f32 and vadd.f32 instructions, totally independent // 50 * 10 instructions // benchmark 1253 instructions --> pipeline stall, 2.5 clocks/instruction .align 3 .global floatFMAAddASM .section .itcmram.floatFMAAddASM .type floatFMAAddASM, %function floatFMAAddASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatFMAAddASM1 .align 3 .type floatFMAAddASM1, %function floatFMAAddASM1: .rept 50 vfma.f32 s2, s1, s0 vadd.f32 s5, s4, s3 vfma.f32 s8, s7, s6 vadd.f32 s11, s10, s9 vfma.f32 s14, s13, s12 vadd.f32 s17, s16, s15 vfma.f32 s20, s19, s18 vadd.f32 s23, s22, s21 vfma.f32 s26, s25, s24 vadd.f32 s29, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
Interleaved VMOV and VADD instructions:
// uint32_t floatMovAddASM(uint32_t overhead) // vmov.32 and vadd.f32 instructions, totally independent // 50 * 10 instructions // benchmark 251 instructions --> can run parallel (2 instructions/clock) .align 3 .global floatMovAddASM .section .itcmram.floatMovAddASM .type floatMovAddASM, %function floatMovAddASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatMovAddASM1 .align 3 .type floatMovAddASM1, %function floatMovAddASM1: .rept 50 vmov.32 s1, s0 vadd.f32 s5, s4, s3 vmov.32 s7, s6 vadd.f32 s11, s10, s9 vmov.32 s13, s12 vadd.f32 s17, s16, s15 vmov.32 s19, s18 vadd.f32 s23, s22, s21 vmov.32 s25, s24 vadd.f32 s29, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr
Interleaved VMOV and VMUL instructions:
// uint32_t floatMovMulASM(uint32_t overhead) // vmov.32 and vmul.f32 instructions, totally independent // 50 * 10 instructions // benchmark 253 instructions --> can run parallel, 2 instructions/clock .align 3 .global floatMovMulASM .section .itcmram.floatMovMulASM .type floatMovMulASM, %function floatMovMulASM: vpush {s0-s15} vpush {s16-s31} dsb ldr r3, =0xE0001004 dsb isb ldr r1, [r3] isb b floatMovMulASM1 .align 3 .type floatMovMulASM1, %function floatMovMulASM1: .rept 50 vmov.32 s1, s0 vmul.f32 s5, s4, s3 vmov.32 s7, s6 vmul.f32 s11, s10, s9 vmov.32 s13, s12 vmul.f32 s17, s16, s15 vmov.32 s19, s18 vmul.f32 s23, s22, s21 vmov.32 s25, s24 vmul.f32 s29, s28, s27 .endr isb ldr r2, [r3] sub r2, r2, r1 sub r0, r2, r0 vpop {s16-s31} vpop {s0-s15} bx lr