Good morning at all,
I am profiling an application code and I am currently dealing with timing-related issue.
1. I would like to understand how the pipeline is working with vfma instruction. The official documentation says vfma takes 3 cycles. In the following piece of ASM code from my performance counter I got 9 cycles for the execution.
8000730: vfma.f32 s11, s7, s15 8000734: vfma.f32 s12, s15, s8 8000738: vfma.f32 s13, s15, s9800073c: vfma.f32 s14, s15, s108000740: bne.n 800071a <dualClass_svmPredict_unroll4+0x6a>
I am thinking that, due to pipelining, the timings are the following:
vfma.f32 s11, s7, s15 3 cyclesvfma.f32 s12, s15, s8 2 cyclesvfma.f32 s13, s15, s9 2 cyclesvfma.f32 s14, s15, s10 2 cycles
Am I right?
2. If the code is the following:
800073c: vfma.f32 s14, s15, s108000740: bne.n 800071a <dualClass_svmPredict_unroll4+0x6a>
The cycles required to execute vfma are 3?
3. Why when vfma is pipelined takes 2 cycles and not 1?
Thanks,