In
ARM Cortex-A Series Programmer’s Guide for ARMv8-A: 13.2.4. Non-temporal load and store pair
it talks about a relaxation of the memory ordering requirements and then gives the example
LDR X0, [X3]
DMB NSHLD
LDNP X2, X1, [X0]
saying the memory barrier is needed otherwise it might read from an unpredictable address. I don't follow this at all,it just seems wrong to me.