Exec latency for ASIMD instructions taking the V pipelines


From the optimization guides for different Cortex A, for example Cortex A76, it seems that all ASIMD instructions (Integer or FP) taking the V pipeline (V0 or V1) have at least a latency of 2, even really simple ones (AND, NOT, NEG, SHL, etc) which may be easily implemented in 1 cycle (like on x86 SSE/AVX).

Is it related to a specific implementation of the ASIMD instruction set ? maybe the FPU is 64-bit only and operates on the first half of the 128-bit register, then the second one like in old Cortex A8/9, at least for some instructions ?

Never checked at this level before but a little suprised by the latency values !

ARM Cortex A76 Software Optimization Guide is available here : https://developer.arm.com/docs/swog307215/a

See p.24-25 for ASIMD Integer Instructions for AARCH64.

Parents Reply Children
More questions in this forum