Why Neon LD4 insturction is resolved to 2 ldp instructions?

I notice that my intrinsic code:

LD4 {v0.16B, v1.16B, v2.16B, v3.16B}, [x1], #64

is resolved to:

ldp q0, q1, [x1], #32
ldp q2, q3, [x1], #32

It's quite confusing:

1. why LD4 is resolved to two ldp? is this some compiler optimization? but I think 1 ld4 is faster than 2 ldp?

2. why v registers are resolved to q registers? I think q registers are only used in AArch32, and this is AArch64. 

I also tried inline assembly:

ld4 {v8.2d, v9.2d, v10.2d, v11.2d}, [" src_r "], #64

this is resolved as expected:

ld4 {v8.2d, v9.2d, v10.2d, v11.2d}, [x3], #64

Parents Reply Children
No data
More questions in this forum