I notice that my intrinsic code:
LD4 {v0.16B, v1.16B, v2.16B, v3.16B}, [x1], #64
is resolved to:
ldp q0, q1, [x1], #32 ldp q2, q3, [x1], #32
It's quite confusing:
1. why LD4 is resolved to two ldp? is this some compiler optimization? but I think 1 ld4 is faster than 2 ldp?
2. why v registers are resolved to q registers? I think q registers are only used in AArch32, and this is AArch64.
I also tried inline assembly:
ld4 {v8.2d, v9.2d, v10.2d, v11.2d}, [" src_r "], #64
this is resolved as expected:
ld4 {v8.2d, v9.2d, v10.2d, v11.2d}, [x3], #64
Hi Shanshan
This does not seem to be correct, the LD4 instruction would interleave data, which the LDP instruction would not do.
https://developer.arm.com/documentation/102159/0400/Load-and-store---data-structures
To properly understand, can you provide a full code example, as well as the build options and compiler version used.
You may be best served to raise an official support case with Arm from the support menu above, so that this can be properly analysed.