Sigh, I had a nice big post with a bunch of details written out but much of it got deleted when I posted. Oh well, here's the short version.
Is there any documentation describing how NEON performs the actual memory accesses across the AXI bus for its ld4 instruction? I'm trying to read from a hardware FIFO with the assumption that it would be reading the data in order (and then de-interleaving it), but the actual results I'm seeing on the hardware imply that it is either performing the accesses out of order, performing more accesses than I would expect, or otherwise not behaving like I'd imagine it should based on what it should be doing. I expect an ld4 {vN.4s-vM.4s} [x] instruction to read 16 bytes from addresses x, x+16, x+32, and x+48, in that order, but that does not seem to be the case.
ld4 {vN.4s-vM.4s} [x]
The memory in question is uncached, device memory. Interleaved writes to this memory appear to happen in order, like I'd expect. From some of the AXI transaction information I'm logging, I can see that it's performing single-beat 16 byte reads, but I am not tracking the actual addresses so I'm not sure of the order. Any details on the exact behavior would be greatly appreciated!