NEON pipeline stages in instruction timing

Note: This was originally posted on 3rd April 2012 at http://forums.arm.com

I'm trying to understand more detail about the instruction timing in Cortex-A8/A9.

In TRM of A8, the timing is described as E1 or N2, which means pipeline stage "Execution 1" in ARM pipeline and "Execution 2"  in NEON pipeline, is that right?
I think before executing there must be cycles for fetching and decoding. What is the value of cycles that fetching and decoding take? Are they the same for ARM and NEON?

I got such a figure after googling.


Is that a right description for A8 pipeline?

Assuming it's right, the decoding of NEON instruction is after the ARM pipeline. Does it mean that NEON instructions have to pass through the entire ARM pipeline first then get decoded? And when does dual issue happen, after decoding before pipeline?  Why NEON instructions need to be decoded twice? Isn't it a waste of time and die size?

The summing up question: how to calculate the number of cycles that a NEON instruction takes in total, from fetch to write back and taking dual issue into consideration?

Thank you so much.
Parents
  • Note: This was originally posted on 3rd April 2012 at http://forums.arm.com

    Assuming it's  right, the decoding of NEON instruction is after the ARM pipeline. Does  it mean that NEON instructions have to pass through the entire ARM  pipeline first then get decoded? And when does dual issue happen, after  decoding before pipeline?  Why NEON instructions need to be decoded  twice? Isn't it a waste of time and die size?


    The decoding in stage D0-D4 only partially decodes NEON instructions.  Basically, it'll perform some issue work for NEON loads and stores.  Otherwise it just determines that it's a NEON instruction and, from the  point of view of the ARM pipeline, looks like a NOP. So it's never  really decoded twice, just has different stages address different parts  of the decoding. With that in mind your question is kind of like asking  why there are 5 stages for integer decoding instead of just one.

    The  summing up question: how to calculate the number of cycles that a NEON  instruction takes in total, from fetch to write back and taking dual  issue into consideration?


    Like Isogen says, it's a difficult question because NEON execution is  decoupled from normal integer execution by a queue. That means that the  instruction that finishes in stage E5 is not necessarily the one that passes to stage M0.

    My question to you is: why do you care? The important question isn't how  many cycles happen from the "start" to the "end" of the pipeline, it's  how many cycles need to pass in order to traverse a critical path from  one pipeline stage to another. The fetch stage F1 has as its input  dependencies its PC. So the worst case latency depends on when PC is  resolved. This will be resolved by stage E5. So you won't see a critical  patch from N6 to F1.

    Probably the worst path that can be taken is moving from NEON to ARM  registers, where you have to go from N6 to somewhere in D0-E0 or so. So  this is important to try to avoid on Cortex-A8. But other than that, in  normal usage, you usually only care about how NEON instructions interact  with each other - and since their input dependencies happen in the N  stages it's the N stages that the Cortex-A8 TRM bothers to document.
Reply
  • Note: This was originally posted on 3rd April 2012 at http://forums.arm.com

    Assuming it's  right, the decoding of NEON instruction is after the ARM pipeline. Does  it mean that NEON instructions have to pass through the entire ARM  pipeline first then get decoded? And when does dual issue happen, after  decoding before pipeline?  Why NEON instructions need to be decoded  twice? Isn't it a waste of time and die size?


    The decoding in stage D0-D4 only partially decodes NEON instructions.  Basically, it'll perform some issue work for NEON loads and stores.  Otherwise it just determines that it's a NEON instruction and, from the  point of view of the ARM pipeline, looks like a NOP. So it's never  really decoded twice, just has different stages address different parts  of the decoding. With that in mind your question is kind of like asking  why there are 5 stages for integer decoding instead of just one.

    The  summing up question: how to calculate the number of cycles that a NEON  instruction takes in total, from fetch to write back and taking dual  issue into consideration?


    Like Isogen says, it's a difficult question because NEON execution is  decoupled from normal integer execution by a queue. That means that the  instruction that finishes in stage E5 is not necessarily the one that passes to stage M0.

    My question to you is: why do you care? The important question isn't how  many cycles happen from the "start" to the "end" of the pipeline, it's  how many cycles need to pass in order to traverse a critical path from  one pipeline stage to another. The fetch stage F1 has as its input  dependencies its PC. So the worst case latency depends on when PC is  resolved. This will be resolved by stage E5. So you won't see a critical  patch from N6 to F1.

    Probably the worst path that can be taken is moving from NEON to ARM  registers, where you have to go from N6 to somewhere in D0-E0 or so. So  this is important to try to avoid on Cortex-A8. But other than that, in  normal usage, you usually only care about how NEON instructions interact  with each other - and since their input dependencies happen in the N  stages it's the N stages that the Cortex-A8 TRM bothers to document.
Children
No data
More questions in this forum