This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Please explain non-temporal example in programmer's guide

daith over 10 years ago

ARM Cortex-A Series Programmer’s Guide for ARMv8-A: 13.2.4. Non-temporal load and store pair

it talks about a relaxation of the memory ordering requirements and then gives the example

LDR X0, [X3]

DMB NSHLD

LDNP X2, X1, [X0]

saying the memory barrier is needed otherwise it might read from an unpredictable address. I don't follow this at all,it just seems wrong to me.

Top replies

Peter Harris over 10 years ago +1 verified

What's wrong about it? It's a relaxation in the memory model which allows faster performance for some use cases (uncached streaming reads from an external media device, for example) if the microarchitecture...

Parents

0 daith over 10 years ago in reply to Peter Harris

Is it actually saying that without the memory barrier the processor would use a wrong address, that it wouldn't wait till X0 was loaded before using it as an address?
I could believe there might be a problem with
STR X3 ,[X0]
LDRP X2, X1, [X0]
but that's nothing to do with the address being wrong
Um - and why is it sticking in these correct answer marker in this reply like what I'm replying to?
Cancel
Vote up 0 Vote down

Cancel

Reply

0 daith over 10 years ago in reply to Peter Harris

Is it actually saying that without the memory barrier the processor would use a wrong address, that it wouldn't wait till X0 was loaded before using it as an address?
I could believe there might be a problem with
STR X3 ,[X0]
LDRP X2, X1, [X0]
but that's nothing to do with the address being wrong
Um - and why is it sticking in these correct answer marker in this reply like what I'm replying to?
Cancel
Vote up 0 Vote down

Cancel

Children

0 Juha Aaltonen over 10 years ago in reply to daith

I wonder if it's about the pipeline. Maybe the execution of LDNP can start before the data coloured for the LDR arrives from the memory, and the execution starts with evaluating X0. In heavily pipelined processors there are often mechanisms that hold the execution phase of the instruction until the coloured data is present.
Faster processors don't really execute instructions sequentially one by one.
This made me think so:

the LDNP instruction might be observed before the preceding LDR instruction

Maybe the answer is somewhere here:
ARM Information Center
I didn't find any description about the pipeline when I took a quick look into the manuals.
Cancel
Vote up 0 Vote down

Cancel
0 Chris Shore over 10 years ago in reply to Juha Aaltonen

In the ARMv8-A Architecture Reference Manual, there is this statement:
"Where an address dependency exists between two reads, and the second read was generated by a Load Non-temporal Pair instruction, then in the absence of any other barrier mechanism to achieve order, those memory accesses can be observed in any order by other observers within the shareability domain of the memory addresses being accessed." (Section B2.7.2)
The LDNP/STNP instructions are explicitly provided to allow the programmer to specify that, for this particular operation, the order of observation does not matter. As Pete has explained above, this might be used in the context of reading streaming data from some media device. If you use this instruction in a sequence where the order of observation _does_ matter, then you are misusing the instruction and should expect possible unintended behaviour.
Hope this helps.
Chris
Cancel
Vote up 0 Vote down

Cancel
0 daith over 10 years ago in reply to Chris Shore

I think I see now what it is trying to say, but the language is a bit obscure and the trick of using an address to ensure data is okay to read isn't described in that publication. I think I'd say something like
Non-temporal loads and stores relax the memory ordering requirements. Memory address dependencies are not guaranteed. For instance in

     Process 1 writes to a buffer then puts the address into Buffer address
     Process 2 reads buffer address then uses LDNP to access the buffer

the read may get data from before the buffer was written to unless a memory barrier is used as in for example
     LDR     X0, [X3]
     DMB     NSHLD
     LDNP   X2, X1, [X0]
Does that sound right?
Cancel
Vote up 0 Vote down

Cancel
0 Chris Shore over 10 years ago in reply to daith

I think you have complicated it unnecessarily by introducing the need for two processes. It is more fundamental than that.
The sequence right at the start of this thread shows that this is fundamental behaviour at the level of instruction execution on a single processor. In this sequence...
LDR X0, [X3]
LDNP X2, X1, [X0]
...what the architecture tells you is that the second instruction may complete (i.e. access the buffer at [x0]) before the first instruction completes (setting the value of x0 by reading from [x3]). If that matters to you, then you either need to insert a barrier (as in the example) or use a standard LDP instruction, rather than LDNP.
I hope that makes it clear!
Chris
Cancel
Vote up 0 Vote down

Cancel
0 daith over 10 years ago in reply to Chris Shore

The wording in the document makes it seem like the wrong address may be used for LDNP - not the one loaded from [X3].I can't believe that is so. I really do think it is talking about address dependencies not being observed. With that as you say it could use data from [X0] from before when the load of X0 is done.
If the problem is at the level of a single process then an example like
store 1 in buffer
store 2 in buffer
use LDNP to load from the buffer - it may get 1
would do the trick. Sounds ghastly but if true that would get the message across.
In fact just looking again at that document in 13.1.1 it talks about 'address dependencies' but uses the term to refer to a store-load dependency.
Looking at the ARM site I see that
Barrier Litmus Tests and Cookbook
is superceded but I don't know by what so I don't know what the status of address dependency as a method to implement barriers is. I wouldn't mourn its loss but it looked like it was there for some good reason. In the ARMv8 ARM it talks about address dependency in the same way as this document - as a dependency between two reads or a read and a write o the same location but the structure of the example here is as in the Litmus test document.
Cancel
Vote up 0 Vote down

Cancel
0 Chris Shore over 10 years ago in reply to daith

"The wording in the document makes it seem like the wrong address may be used for LDNP - not the one loaded from [X3].I can't believe that is so."

It is indeed so! That is exactly what it is saying. And, in some circumstances, that is the behaviour which the programmer wants. Clearly, when using these instructions, you must be careful not to use them in ways which give undesirable behaviour.

"I really do think it is talking about address dependencies not being observed. With that as you say it could use data from [X0] from before when the load of X0 is done."

Yes, that's exactly what is is saying. Strange though it may seem!

Chris

PS - the Barrier Litmus Test document has been superseded since all of its content is now included in the ARMv8 ARM. It may not be expressed in exactly same wording but all the content has been included.
Cancel
Vote up 0 Vote down

Cancel
0 daith over 10 years ago in reply to Chris Shore

Looking at the ARMv8 ARM I see that it does describe address dependency in the way I mean in section B2.7.2 and it is consistent with the way the term is used in the Litmus test, so yes the Litmus test has been incorporated thanks.
So you are basically saying that a LDNP instruction does not even follow the basic register data dependency as described in 'Address dependencies and order' in that section? I am afraid, I think you have somehow got the wrong end of the stick somewhere as I think this type behavior is completely broken. The two things you said yes to above are different - address dependencies are not the same as register data dependencies, the first was a register data dependency and the second was an address dependency.
Cancel
Vote up 0 Vote down

Cancel