We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
My question to you is: why do you care?
The important question isn't how many cycles happen from the "start" to the "end" of the pipeline, it's how many cycles need to pass in order to traverse a critical path from one pipeline stage to another. The fetch stage F1 has as its input dependencies its PC. So the worst case latency depends on when PC is resolved. This will be resolved by stage E5. So you won't see a critical patch from N6 to F1.
So you won't see a critical patch from N6 to F1.
Yes, it is shared fetch and decode unit for the initial NEON decode before issue to the ARM pipeline. The NEON backend does some more to work out exactly what to do with it. NEON reuses some ARM registers for things like address generation which the NEON unit does not have direct access to, so things like address calculation happens "on the way through" the ARM pipeline. NEON has it's own issue queue after the ARM pipeline. The two are decoupled to some extent. See above - ARM functional units are reused for part of NEON operation.
Here I got some more questions:dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right? The other similar one:In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and then the result is forwarded to the fp add pipeline.Does this "forward" mean a shortcut? skip the instruction queue? The final one:What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?
Assuming it's right, the decoding of NEON instruction is after the ARM pipeline. Does it mean that NEON instructions have to pass through the entire ARM pipeline first then get decoded? And when does dual issue happen, after decoding before pipeline? Why NEON instructions need to be decoded twice? Isn't it a waste of time and die size?
The summing up question: how to calculate the number of cycles that a NEON instruction takes in total, from fetch to write back and taking dual issue into consideration?
What is the value of cycles that fetching and decoding take? Are they the same for ARM and NEON?
Assuming it's right, the decoding of NEON instruction is after the ARM pipeline. Does it mean that NEON instructions have to pass through the entire ARM pipeline first then get decoded?
And when does dual issue happen, after decoding before pipeline?
Why NEON instructions need to be decoded twice? Isn't it a waste of time and die size?
Can you share the reply you got to the original question about how to find instruction latencies?
Thank you in advance.
It looks like some of the content of this thread was lost when the community was ported over from its old incarnation to this new platform. Perhaps bradnemire might be able to help track down the old content here?
As Joe mentioned, this discussion thread was migrated from our former forums and it looks like the organization of the replies are a bit mixed up. I have now marked Peter Harris' reply as correct (as this was the original reply) - now if you go to the top of this thread you will see his reply embedded. Hope this helps solve your question and apologies for the confusion.
There's an interesting (but experimental) Cortex-A8 NEON cycle counter at http://pulsar.webshaker.net/ccc/index.php?lng=us