I notice that my intrinsic code:
LD4 {v0.16B, v1.16B, v2.16B, v3.16B}, [x1], #64
is resolved to:
ldp q0, q1, [x1], #32 ldp q2, q3, [x1], #32
It's quite confusing:
1. why LD4 is resolved to two ldp? is this some compiler optimization? but I think 1 ld4 is faster than 2 ldp?
2. why v registers are resolved to q registers? I think q registers are only used in AArch32, and this is AArch64.
I also tried inline assembly:
ld4 {v8.2d, v9.2d, v10.2d, v11.2d}, [" src_r "], #64
this is resolved as expected:
ld4 {v8.2d, v9.2d, v10.2d, v11.2d}, [x3], #64