This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Neon instruction timing/latency

Note: This was originally posted on 7th July 2010 at http://forums.arm.com

Hello!

I am having trouble deciphering the tables in the Cortex-A8 technical reference manual that contains the NEON advanced SIMD instruction timings. There is no explanation anywhere of what the different N values mean. I suspect that they are different steps in the pipeline, but since I have as of yet not been able to find any info on the NEON pipeline, they don't tell me anything.

What I would really like to see is the information that was available in the ARM1136 reference manual, specifically which registers are needed as early/late registers, result latency and so on. It is probably possible to use the supplied N-values to get something similar, but I havent managed yet.

There is clearly some latency in the NEON instructions since I can gain quite  a bit of performance by rearranging the instructions, but I would like to be able to do this in a more scientific manner where I can actually determine beforehand if I would gain anything by rearranging and not like now where I simply try to place instructions depending on each other as far apart as possible.

Best regards,
//Leo
Parents
  • Note: This was originally posted on 8th July 2010 at http://forums.arm.com

    And am I correct in interpreting the shift operations as data-processing and not permute instructions?


    Yes. See [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Babfjcjb.html]

    At which N{X} is the instruction loaded, 1 or 0?

    VADD loads the sources in N2 and produces the result in N3. Assume that VADD is issued at t = 0 (measured in cycles). Will the source registers be loaded at t = 2 or t = 1? And when saying that result is produced in N3, is that t = 3 or t = 2?


    The registers used will be loaded in the consumed N{x} cycle (or fetched from forwarding paths then). What this means in real pipleline terms isn't important - all of the N{x} numbers in the tables are consistent with each other.

    From your example I gather that even though the result is produced in N3, it isn't actually available for use by another instruction until N4. Is this correct?


    Yes , that's right.
Reply
  • Note: This was originally posted on 8th July 2010 at http://forums.arm.com

    And am I correct in interpreting the shift operations as data-processing and not permute instructions?


    Yes. See [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Babfjcjb.html]

    At which N{X} is the instruction loaded, 1 or 0?

    VADD loads the sources in N2 and produces the result in N3. Assume that VADD is issued at t = 0 (measured in cycles). Will the source registers be loaded at t = 2 or t = 1? And when saying that result is produced in N3, is that t = 3 or t = 2?


    The registers used will be loaded in the consumed N{x} cycle (or fetched from forwarding paths then). What this means in real pipleline terms isn't important - all of the N{x} numbers in the tables are consistent with each other.

    From your example I gather that even though the result is produced in N3, it isn't actually available for use by another instruction until N4. Is this correct?


    Yes , that's right.
Children
No data