Hi,
From the optimization guides for different Cortex A, for example Cortex A76, it seems that all ASIMD instructions (Integer or FP) taking the V pipeline (V0 or V1) have at least a latency of 2, even really simple ones (AND, NOT, NEG, SHL, etc) which may be easily implemented in 1 cycle (like on x86 SSE/AVX).
Is it related to a specific implementation of the ASIMD instruction set ? maybe the FPU is 64-bit only and operates on the first half of the 128-bit register, then the second one like in old Cortex A8/9, at least for some instructions ?
Never checked at this level before but a little suprised by the latency values !
ARM Cortex A76 Software Optimization Guide is available here : https://developer.arm.com/docs/swog307215/a
See p.24-25 for ASIMD Integer Instructions for AARCH64.