NEON pipeline stages in instruction timing

Note: This was originally posted on 3rd April 2012 at http://forums.arm.com

I'm trying to understand more detail about the instruction timing in Cortex-A8/A9.

In TRM of A8, the timing is described as E1 or N2, which means pipeline stage "Execution 1" in ARM pipeline and "Execution 2"  in NEON pipeline, is that right?
I think before executing there must be cycles for fetching and decoding. What is the value of cycles that fetching and decoding take? Are they the same for ARM and NEON?

I got such a figure after googling.


Is that a right description for A8 pipeline?

Assuming it's right, the decoding of NEON instruction is after the ARM pipeline. Does it mean that NEON instructions have to pass through the entire ARM pipeline first then get decoded? And when does dual issue happen, after decoding before pipeline?  Why NEON instructions need to be decoded twice? Isn't it a waste of time and die size?

The summing up question: how to calculate the number of cycles that a NEON instruction takes in total, from fetch to write back and taking dual issue into consideration?

Thank you so much.
Parents
  • Note: This was originally posted on 4th April 2012 at http://forums.arm.com

    Thank you so much for such a careful answer.

    Here I got some more questions:
    dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.
    What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right?

    The other similar one:
    In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and  then the result is forwarded to the fp add pipeline.
    Does this "forward" mean a shortcut? skip the instruction queue?

    The final one:
    What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?

    Thanks
Reply
  • Note: This was originally posted on 4th April 2012 at http://forums.arm.com

    Thank you so much for such a careful answer.

    Here I got some more questions:
    dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.
    What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right?

    The other similar one:
    In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and  then the result is forwarded to the fp add pipeline.
    Does this "forward" mean a shortcut? skip the instruction queue?

    The final one:
    What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?

    Thanks
Children
No data
More questions in this forum