This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Neon instruction timing/latency

Note: This was originally posted on 7th July 2010 at http://forums.arm.com

Hello!

I am having trouble deciphering the tables in the Cortex-A8 technical reference manual that contains the NEON advanced SIMD instruction timings. There is no explanation anywhere of what the different N values mean. I suspect that they are different steps in the pipeline, but since I have as of yet not been able to find any info on the NEON pipeline, they don't tell me anything.

What I would really like to see is the information that was available in the ARM1136 reference manual, specifically which registers are needed as early/late registers, result latency and so on. It is probably possible to use the supplied N-values to get something similar, but I havent managed yet.

There is clearly some latency in the NEON instructions since I can gain quite  a bit of performance by rearranging the instructions, but I would like to be able to do this in a more scientific manner where I can actually determine beforehand if I would gain anything by rearranging and not like now where I simply try to place instructions depending on each other as far apart as possible.

Best regards,
//Leo
Parents
  • Note: This was originally posted on 7th July 2010 at http://forums.arm.com

    Most of the new cores use this "consume in N{X}" and "produce in N{Y}" syntax - the pipelines are now too complex for the simpler early and late register model for timing using in ARM9 and ARM11 cores. I agree it is a bit of a pain, but it's not too bad once you get used to it.

    As you suggest the {X} and {Y} numbers are pipeline stages when registers are consumed or results are produced. You don't actually need the pipeline details to use the tables you can infer everything you need from the pipeline stage numbers. The important facts:


    • All pipeline stages take one cycle
    • You can dual issue some instructions, so to optimally fill an N cycle interlock gap you need 2N instructions of suitable pairings
    • Moving data from NEON to ARM registers is Cortex-A8 is expensive, so NEON in Cortex-A8 is best used for large blocks of work with little ARM pipeline interaction. [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s05s02.html] (Cortex-A9 is much better at this).

    Using the tables:

    If you have an instruction which consumes a register in N1 and produces a result in N3 then the result value is not available to the next instruction until N4 - effectively 3 cycles (4 - 1) after the initial instruction was issued. You would need to fill the gap between the 2 dependent instructions with 6 other (because we dual issue) ARM or NEON instructions.

    If you have an instruction which consumes in N2 and produces in N5 (result ready in N6), then a dependent which consumes in N1 then you have a 5 cycle latency. Four cycles for the first instructions latency (6-2) and one cycle because the second instruction consumes a cycle earlier in the pipeline than the first (2-1)

    Worth noting: only certain pairs of instructions can be dual issued, and it is alignment sensitive because the core is in-order (i.e. you may have a sequence ABCD where AB and CD can be dual issued as pairs, but you happen to actually execute xA, then B, then CD. In this case x (the previous instruction) and A hit the pipeline together and were a valid dual issue target so were issued. BC in this case are not valid dual issue pairs so only B is issued, and finally CD are issued. You sequence of ABCD looks like it might take 2 cycles but actually took 3.

    Worth noting: One of the biggest killers on modern cores where clock frequency is much higher than memory clock is not really pipeline cycles for arithmetic instructions, but memory latency when you miss a data load. If you miss in L2 cache for data loads it can take _hundreds_ of cycles to fetch that data from external memory, if you miss in the TLB (MMU cache) and in the L2 data cache it can take a couple of multiples of that. If you know what your data set is going to be then issue PLD instructions or even just manually touch the data with a LDR (even if you then don't use it that time around and reload it later) as early as possible to maximize the chance it is in cache when you actually need it.

    Iso

    Thanks, that cleared up a lot. I will have to look into dual issuing a bit more since I more or less missed that completely. Have I understood it correctly in that you cannot dual issue two data-processing instructions (in other words, a vadd and a vmul cannot be dual issued, while a vadd and a vmov can since vmov is a permute instruction)? And am I correct in interpreting the shift operations as data-processing and not permute instructions?

    EDIT:
    At which N{X} is the instruction loaded, 1 or 0? Or more clearly:
    VADD loads the sources in N2 and produces the result in N3. Assume that VADD is issued at t = 0 (measured in cycles). Will the source registers be loaded at t = 2 or t = 1? And when saying that result is produced in N3, is that t = 3 or t = 2?

    From your example I gather that even though the result is produced in N3, it isn't actually available for use by another instruction until N4. Is this correct?


    //Leo
Reply
  • Note: This was originally posted on 7th July 2010 at http://forums.arm.com

    Most of the new cores use this "consume in N{X}" and "produce in N{Y}" syntax - the pipelines are now too complex for the simpler early and late register model for timing using in ARM9 and ARM11 cores. I agree it is a bit of a pain, but it's not too bad once you get used to it.

    As you suggest the {X} and {Y} numbers are pipeline stages when registers are consumed or results are produced. You don't actually need the pipeline details to use the tables you can infer everything you need from the pipeline stage numbers. The important facts:


    • All pipeline stages take one cycle
    • You can dual issue some instructions, so to optimally fill an N cycle interlock gap you need 2N instructions of suitable pairings
    • Moving data from NEON to ARM registers is Cortex-A8 is expensive, so NEON in Cortex-A8 is best used for large blocks of work with little ARM pipeline interaction. [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch16s05s02.html] (Cortex-A9 is much better at this).

    Using the tables:

    If you have an instruction which consumes a register in N1 and produces a result in N3 then the result value is not available to the next instruction until N4 - effectively 3 cycles (4 - 1) after the initial instruction was issued. You would need to fill the gap between the 2 dependent instructions with 6 other (because we dual issue) ARM or NEON instructions.

    If you have an instruction which consumes in N2 and produces in N5 (result ready in N6), then a dependent which consumes in N1 then you have a 5 cycle latency. Four cycles for the first instructions latency (6-2) and one cycle because the second instruction consumes a cycle earlier in the pipeline than the first (2-1)

    Worth noting: only certain pairs of instructions can be dual issued, and it is alignment sensitive because the core is in-order (i.e. you may have a sequence ABCD where AB and CD can be dual issued as pairs, but you happen to actually execute xA, then B, then CD. In this case x (the previous instruction) and A hit the pipeline together and were a valid dual issue target so were issued. BC in this case are not valid dual issue pairs so only B is issued, and finally CD are issued. You sequence of ABCD looks like it might take 2 cycles but actually took 3.

    Worth noting: One of the biggest killers on modern cores where clock frequency is much higher than memory clock is not really pipeline cycles for arithmetic instructions, but memory latency when you miss a data load. If you miss in L2 cache for data loads it can take _hundreds_ of cycles to fetch that data from external memory, if you miss in the TLB (MMU cache) and in the L2 data cache it can take a couple of multiples of that. If you know what your data set is going to be then issue PLD instructions or even just manually touch the data with a LDR (even if you then don't use it that time around and reload it later) as early as possible to maximize the chance it is in cache when you actually need it.

    Iso

    Thanks, that cleared up a lot. I will have to look into dual issuing a bit more since I more or less missed that completely. Have I understood it correctly in that you cannot dual issue two data-processing instructions (in other words, a vadd and a vmul cannot be dual issued, while a vadd and a vmov can since vmov is a permute instruction)? And am I correct in interpreting the shift operations as data-processing and not permute instructions?

    EDIT:
    At which N{X} is the instruction loaded, 1 or 0? Or more clearly:
    VADD loads the sources in N2 and produces the result in N3. Assume that VADD is issued at t = 0 (measured in cycles). Will the source registers be loaded at t = 2 or t = 1? And when saying that result is produced in N3, is that t = 3 or t = 2?

    From your example I gather that even though the result is produced in N3, it isn't actually available for use by another instruction until N4. Is this correct?


    //Leo
Children
No data