We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Most of the new cores use this "consume in N{X}" and "produce in N{Y}" syntax - the pipelines are now too complex for the simpler early and late register model for timing using in ARM9 and ARM11 cores. I agree it is a bit of a pain, but it's not too bad once you get used to it.As you suggest the {X} and {Y} numbers are pipeline stages when registers are consumed or results are produced. You don't actually need the pipeline details to use the tables you can infer everything you need from the pipeline stage numbers. The important facts:All pipeline stages take one cycleYou can dual issue some instructions, so to optimally fill an N cycle interlock gap you need 2N instructions of suitable pairingsMoving data from NEON to ARM registers is Cortex-A8 is expensive, so NEON in Cortex-A8 is best used for large blocks of work with little ARM pipeline interaction. [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s05s02.html] (Cortex-A9 is much better at this).Using the tables:If you have an instruction which consumes a register in N1 and produces a result in N3 then the result value is not available to the next instruction until N4 - effectively 3 cycles (4 - 1) after the initial instruction was issued. You would need to fill the gap between the 2 dependent instructions with 6 other (because we dual issue) ARM or NEON instructions.If you have an instruction which consumes in N2 and produces in N5 (result ready in N6), then a dependent which consumes in N1 then you have a 5 cycle latency. Four cycles for the first instructions latency (6-2) and one cycle because the second instruction consumes a cycle earlier in the pipeline than the first (2-1)Worth noting: only certain pairs of instructions can be dual issued, and it is alignment sensitive because the core is in-order (i.e. you may have a sequence ABCD where AB and CD can be dual issued as pairs, but you happen to actually execute xA, then B, then CD. In this case x (the previous instruction) and A hit the pipeline together and were a valid dual issue target so were issued. BC in this case are not valid dual issue pairs so only B is issued, and finally CD are issued. You sequence of ABCD looks like it might take 2 cycles but actually took 3. Worth noting: One of the biggest killers on modern cores where clock frequency is much higher than memory clock is not really pipeline cycles for arithmetic instructions, but memory latency when you miss a data load. If you miss in L2 cache for data loads it can take _hundreds_ of cycles to fetch that data from external memory, if you miss in the TLB (MMU cache) and in the L2 data cache it can take a couple of multiples of that. If you know what your data set is going to be then issue PLD instructions or even just manually touch the data with a LDR (even if you then don't use it that time around and reload it later) as early as possible to maximize the chance it is in cache when you actually need it.Iso