This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Neon instruction timing/latency

Note: This was originally posted on 7th July 2010 at http://forums.arm.com

Hello!

I am having trouble deciphering the tables in the Cortex-A8 technical reference manual that contains the NEON advanced SIMD instruction timings. There is no explanation anywhere of what the different N values mean. I suspect that they are different steps in the pipeline, but since I have as of yet not been able to find any info on the NEON pipeline, they don't tell me anything.

What I would really like to see is the information that was available in the ARM1136 reference manual, specifically which registers are needed as early/late registers, result latency and so on. It is probably possible to use the supplied N-values to get something similar, but I havent managed yet.

There is clearly some latency in the NEON instructions since I can gain quite a bit of performance by rearranging the instructions, but I would like to be able to do this in a more scientific manner where I can actually determine beforehand if I would gain anything by rearranging and not like now where I simply try to place instructions depending on each other as far apart as possible.

Best regards,
//Leo

Parents

Peter Harris over 12 years ago

Note: This was originally posted on 8th July 2010 at http://forums.arm.com

And am I correct in interpreting the shift operations as data-processing and not permute instructions?

Yes. See [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Babfjcjb.html]

At which N{X} is the instruction loaded, 1 or 0?

VADD loads the sources in N2 and produces the result in N3. Assume that VADD is issued at t = 0 (measured in cycles). Will the source registers be loaded at t = 2 or t = 1? And when saying that result is produced in N3, is that t = 3 or t = 2?

The registers used will be loaded in the consumed N{x} cycle (or fetched from forwarding paths then). What this means in real pipleline terms isn't important - all of the N{x} numbers in the tables are consistent with each other.

From your example I gather that even though the result is produced in N3, it isn't actually available for use by another instruction until N4. Is this correct?

Yes , that's right.
Cancel
Vote up 0 Vote down

Cancel

Reply

Peter Harris over 12 years ago

Note: This was originally posted on 8th July 2010 at http://forums.arm.com

And am I correct in interpreting the shift operations as data-processing and not permute instructions?

Yes. See [http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/Babfjcjb.html]

At which N{X} is the instruction loaded, 1 or 0?

VADD loads the sources in N2 and produces the result in N3. Assume that VADD is issued at t = 0 (measured in cycles). Will the source registers be loaded at t = 2 or t = 1? And when saying that result is produced in N3, is that t = 3 or t = 2?

The registers used will be loaded in the consumed N{x} cycle (or fetched from forwarding paths then). What this means in real pipleline terms isn't important - all of the N{x} numbers in the tables are consistent with each other.

From your example I gather that even though the result is produced in N3, it isn't actually available for use by another instruction until N4. Is this correct?

Yes , that's right.
Cancel
Vote up 0 Vote down

Cancel

Children

No data