We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi,
From the optimization guides for different Cortex A, for example Cortex A76, it seems that all ASIMD instructions (Integer or FP) taking the V pipeline (V0 or V1) have at least a latency of 2, even really simple ones (AND, NOT, NEG, SHL, etc) which may be easily implemented in 1 cycle (like on x86 SSE/AVX).
Is it related to a specific implementation of the ASIMD instruction set ? maybe the FPU is 64-bit only and operates on the first half of the 128-bit register, then the second one like in old Cortex A8/9, at least for some instructions ?
Never checked at this level before but a little suprised by the latency values !
ARM Cortex A76 Software Optimization Guide is available here : https://developer.arm.com/docs/swog307215/a
See p.24-25 for ASIMD Integer Instructions for AARCH64.