This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Exec latency for ASIMD instructions taking the V pipelines

sjub over 5 years ago

Hi,

From the optimization guides for different Cortex A, for example Cortex A76, it seems that all ASIMD instructions (Integer or FP) taking the V pipeline (V0 or V1) have at least a latency of 2, even really simple ones (AND, NOT, NEG, SHL, etc) which may be easily implemented in 1 cycle (like on x86 SSE/AVX).

Is it related to a specific implementation of the ASIMD instruction set ? maybe the FPU is 64-bit only and operates on the first half of the 128-bit register, then the second one like in old Cortex A8/9, at least for some instructions ?

Never checked at this level before but a little suprised by the latency values !

ARM Cortex A76 Software Optimization Guide is available here : https://developer.arm.com/docs/swog307215/a

See p.24-25 for ASIMD Integer Instructions for AARCH64.

0 42Bastian Schick over 5 years ago

You should not look on the latency but the throughput. Which is 1 often 2. The latency is of interest if two instruction depend on each other.
Cancel
Vote up 0 Vote down

Cancel
0 sjub over 5 years ago in reply to 42Bastian Schick

I must look at the latency and also at the throughtput to determine dependencies between instructions and critical paths for optimization purpose. Furthermore, if you look at the SVE instructions, you can see that the minimum latencies for the same instructions (AND, NOT, etc) is 4 (9 for the FMA instead of 4-5 on Cortex A) which becomes non negligible when you want to optimize your code. So my next question is how do you justify that and AND instruction requires 4 cycles instead of 2 on Cortex A or only 1 on x86 SSE/AVX/AVX-512 for both the NEON and the SVE instruction set on the A64FX ?

See A64FX microarchitecture manual v1.1 for AND instruction p103 for ASIMD and 116 for SVE:

https://github.com/fujitsu/A64FX/tree/master/doc

If you look further at the FDIV or FSQRT instructions you may also be surprised by the latency depending on the width of the vector. It clearly indicates that these instructions are decomposed into 4 128-bit ones, since the latency for a 512-bit vector is 4 times the one on a 128-bit vector.

Or maybe I missed something...

I know that this last question may be asked on a Fujitsu forum but did not found the place to ask it on their website.
Cancel
Vote up 0 Vote down

Cancel