This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON pipeline stages in instruction timing

Note: This was originally posted on 3rd April 2012 at http://forums.arm.com

I'm trying to understand more detail about the instruction timing in Cortex-A8/A9.

In TRM of A8, the timing is described as E1 or N2, which means pipeline stage "Execution 1" in ARM pipeline and "Execution 2"  in NEON pipeline, is that right?
I think before executing there must be cycles for fetching and decoding. What is the value of cycles that fetching and decoding take? Are they the same for ARM and NEON?

I got such a figure after googling.


Is that a right description for A8 pipeline?

Assuming it's right, the decoding of NEON instruction is after the ARM pipeline. Does it mean that NEON instructions have to pass through the entire ARM pipeline first then get decoded? And when does dual issue happen, after decoding before pipeline?  Why NEON instructions need to be decoded twice? Isn't it a waste of time and die size?

The summing up question: how to calculate the number of cycles that a NEON instruction takes in total, from fetch to write back and taking dual issue into consideration?

Thank you so much.
  • Note: This was originally posted on 4th April 2012 at http://forums.arm.com

    Thank you so much for such a careful answer.

    Here I got some more questions:
    dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.
    What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right?

    The other similar one:
    In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and  then the result is forwarded to the fp add pipeline.
    Does this "forward" mean a shortcut? skip the instruction queue?

    The final one:
    What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?

    Thanks
  • Note: This was originally posted on 4th April 2012 at http://forums.arm.com

    Thank you for your reply. I have gotten a better understand of NEON instruction decoding now.

    My question to you is: why do you care?

    I want to draw a pipeline timing table to show the status of NEON instruction in pipeline, which instruction in which stage. I want to know the usage of the pipeline.

    The important question isn't how many cycles happen from the "start" to the "end" of the pipeline, it's how many cycles need to pass in order to traverse a critical path from one pipeline stage to another. The fetch stage F1 has as its input dependencies its PC. So the worst case latency depends on when PC is resolved. This will be resolved by stage E5. So you won't see a critical patch from N6 to F1.

    Sorry I cannot understand your idea here, especially
    So you won't see a critical patch from N6 to F1.

    could you please explain a little bit more?

    Thanks
  • Note: This was originally posted on 6th April 2012 at http://forums.arm.com




    Yes, it is shared fetch and decode unit for the initial NEON decode before issue to the ARM pipeline. The NEON backend does some more to work out exactly what to do with it.



    NEON reuses some ARM registers for things like address generation which the NEON unit does not have direct access to, so things like address calculation happens "on the way through" the ARM pipeline.


    NEON has it's own issue queue after the ARM pipeline. The two are decoupled to some extent.


    See above - ARM functional units are reused for part of NEON operation.


    Could you please help me about the multi-cycle instruction and multiply-accumulation instruction?

    Here I got some more questions:
    dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.
    What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right?

    The other similar one:
    In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and then the result is forwarded to the fp add pipeline.
    Does this "forward" mean a shortcut? skip the instruction queue?

    The final one:What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?
  • Note: This was originally posted on 3rd April 2012 at http://forums.arm.com

    Assuming it's  right, the decoding of NEON instruction is after the ARM pipeline. Does  it mean that NEON instructions have to pass through the entire ARM  pipeline first then get decoded? And when does dual issue happen, after  decoding before pipeline?  Why NEON instructions need to be decoded  twice? Isn't it a waste of time and die size?


    The decoding in stage D0-D4 only partially decodes NEON instructions.  Basically, it'll perform some issue work for NEON loads and stores.  Otherwise it just determines that it's a NEON instruction and, from the  point of view of the ARM pipeline, looks like a NOP. So it's never  really decoded twice, just has different stages address different parts  of the decoding. With that in mind your question is kind of like asking  why there are 5 stages for integer decoding instead of just one.

    The  summing up question: how to calculate the number of cycles that a NEON  instruction takes in total, from fetch to write back and taking dual  issue into consideration?


    Like Isogen says, it's a difficult question because NEON execution is  decoupled from normal integer execution by a queue. That means that the  instruction that finishes in stage E5 is not necessarily the one that passes to stage M0.

    My question to you is: why do you care? The important question isn't how  many cycles happen from the "start" to the "end" of the pipeline, it's  how many cycles need to pass in order to traverse a critical path from  one pipeline stage to another. The fetch stage F1 has as its input  dependencies its PC. So the worst case latency depends on when PC is  resolved. This will be resolved by stage E5. So you won't see a critical  patch from N6 to F1.

    Probably the worst path that can be taken is moving from NEON to ARM  registers, where you have to go from N6 to somewhere in D0-E0 or so. So  this is important to try to avoid on Cortex-A8. But other than that, in  normal usage, you usually only care about how NEON instructions interact  with each other - and since their input dependencies happen in the N  stages it's the N stages that the Cortex-A8 TRM bothers to document.
  • Note: This was originally posted on 3rd April 2012 at http://forums.arm.com

    What is the value of cycles that fetching and decoding take? Are they the same for ARM and NEON?


    Yes, it is shared fetch and decode unit for the initial NEON decode before issue to the ARM pipeline. The NEON backend does some more to work out exactly what to do with it.

    Assuming it's right, the decoding of NEON instruction is after the ARM pipeline. Does it mean that NEON instructions have to pass through the entire ARM pipeline first then get decoded?


    NEON reuses some ARM registers for things like address generation which the NEON unit does not have direct access to, so things like address calculation happens "on the way through" the ARM pipeline.

    And when does dual issue happen, after decoding before pipeline?

    NEON has it's own issue queue after the ARM pipeline. The two are decoupled to some extent.

    Why NEON instructions need to be decoded twice? Isn't it a waste of time and die size?

    See above - ARM functional units are reused for part of NEON operation.
  • Can you share the reply you got to the original question about how to find instruction latencies?

    Thank you in advance.

  • It looks like some of the content of this thread was lost when the community was ported over from its old incarnation to this new platform. Perhaps bradnemire might be able to help track down the old content here?

  • As Joe mentioned, this discussion thread was migrated from our former forums and it looks like the organization of the replies are a bit mixed up. I have now marked Peter Harris' reply as correct (as this was the original reply) - now if you go to the top of this thread you will see his reply embedded. Hope this helps solve your question and apologies for the confusion.

  • There's an interesting (but experimental) Cortex-A8 NEON cycle counter at http://pulsar.webshaker.net/ccc/index.php?lng=us