A53 NEON memory access behavior

Sigh, I had a nice big post with a bunch of details written out but much of it got deleted when I posted. Oh well, here's the short version.

Is there any documentation describing how NEON performs the actual memory accesses across the AXI bus for its ld4 instruction? I'm trying to read from a hardware FIFO with the assumption that it would be reading the data in order (and then de-interleaving it), but the actual results I'm seeing on the hardware imply that it is either performing the accesses out of order, performing more accesses than I would expect, or otherwise not behaving like I'd imagine it should based on what it should be doing. I expect an ld4 {vN.4s-vM.4s} [x] instruction to read 16 bytes from addresses x, x+16, x+32, and x+48, in that order, but that does not seem to be the case.

The memory in question is uncached, device memory. Interleaved writes to this memory appear to happen in order, like I'd expect. From some of the AXI transaction information I'm logging, I can see that it's performing single-beat 16 byte reads, but I am not tracking the actual addresses so I'm not sure of the order. Any details on the exact behavior would be greatly appreciated!

0