This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

VFMA instruction timings on ARM Cortex-M4

Good morning at all,

I am profiling an application code and I am currently dealing with timing-related issue.

1. I would like to understand how the pipeline is working with vfma instruction. The official documentation says vfma takes 3 cycles. In the following piece of ASM code from my performance counter I got 9 cycles for the execution.

8000730: vfma.f32 s11, s7, s15
8000734: vfma.f32 s12, s15, s8
8000738: vfma.f32 s13, s15, s9
800073c: vfma.f32 s14, s15, s10
8000740: bne.n 800071a <dualClass_svmPredict_unroll4+0x6a>

I am thinking that, due to pipelining, the timings are the following:

vfma.f32 s11, s7, s15      3 cycles
vfma.f32 s12, s15, s8      2 cycles
vfma.f32 s13, s15, s9      2 cycles
vfma.f32 s14, s15, s10    2 cycles

Am I right?

2. If the code is the following:

800073c: vfma.f32 s14, s15, s10
8000740: bne.n 800071a <dualClass_svmPredict_unroll4+0x6a>

The cycles required to execute vfma are 3?

3. Why when vfma is pipelined takes 2 cycles and not 1? 

Thanks,