Hello arm experts,
I am trying to understand when a load access of a memory location might produce side effects that other observers in the system may care about. So far all the examples I can find around dmb memory barriers in the ARMV8 reference material, are focused on observability of *writes*, whose importance and shareability domains are fairly self-explanatory. What I have not been able to find, is an example of when one might prefer dmb ishld over dmb nshld, for example. Whether the memory address is in shareable memory or not, or visible to coherent caches or not, surely a read access cannot produce observable effects that would affect the correctness of the PE executing the dmb instruction?
dmb
dmb ishld
dmb nshld
If this is correct, then why does ARMV8 offer various domains instead of simply some dmb ld with the least restrictive domain possible? And, if this is not correct, then what would be a practical example where the difference between dmb nshld, dmb ishld, and dmb oshld, would matter?
dmb ld
dmb oshld
Thanks!
The share-ability argument is saying who has to see the guarantees of the barrier (NSH=just this observer, ISH=observers in the Inner domain...). So the answer to your question is that you'd use ISHLD over NSHLD when it mattered that observers in the Inner share-ability domain saw the loads in order.
Let's take an example;
Thread 0 | Thread 1 ; MOV W0,#1 | LDR W0,[FLAG]; STR W0,[MSG] | DMB xLD ; DMB xST | LDR W2,[MSG] ; MOV W2,#1 | ; STR W2,[FLAG]| ;
Thread 0 | Thread 1 ;
MOV W0,#1 | LDR W0,[FLAG];
STR W0,[MSG] | DMB xLD ;
DMB xST | LDR W2,[MSG] ;
MOV W2,#1 | ;
STR W2,[FLAG]| ;
Thread 0 is going to write a message (STR to MSG), then it is going to write a flag to say the message is valid (STR to FLAG). Thread 1 does the reverse, it reads the Flag first and then the message.
What we care about is that if Thread 1 sees the flag set then it MUST the message written also. To ensure that, we put a DMB xST in Thread 0 and a DMB xLD in Thread 1.
Now, share-ability. If Thread 0 and Thread 1 both run on the same PE (i.e. same non-shareable domain), then we could replace x with OSH. However, if the two threads might run on different PEs within the same Inner domain, then we need to replace x with ISH to get the guarantee we need. Similarly, if the two threads ran on different PEs in the same Outer domain, then we'd need OSH.
There's a great tool for experimenting with ordering type questions Memory Model Tool (arm.com). We can actually ask it that type of question. Here's the above test converted into the format used by the tool:
{0:X1=x; 0:X3=y;1:X1=x; 1:X3=y;} P0 | P1 ; MOV W0,#1 | LDR W0,[X3] ; STR W0,[X1] | DMB NSHLD ; DMB NSHST | LDR W2,[X1] ; MOV W2,#1 | ; STR W2,[X3] | ;exists(1:X0=1 /\ 1:X2=0)
{
0:X1=x; 0:X3=y;
1:X1=x; 1:X3=y;
}
P0 | P1 ;
MOV W0,#1 | LDR W0,[X3] ;
STR W0,[X1] | DMB NSHLD ;
DMB NSHST | LDR W2,[X1] ;
STR W2,[X3] | ;
exists
(1:X0=1 /\ 1:X2=0)
The "exists (1:X0=1 /\ 1:X2=0)" line is a question to the tool. It's saying "is it possible for P1 to end the test with X0=1 and X2=0", or "is it possible for P1 to end the test having seen the Flag but not the Message".
If we run the test in the tool it says:
Test MP Allowed States 4 1:X0=0; 1:X2=0; 1:X0=0; 1:X2=1; 1:X0=1; 1:X2=0; 1:X0=1; 1:X2=1; Ok Witnesses Positive: 1 Negative: 3
So... yes, it is possible! There are four legal outcomes, one of which mataches the pattern we asked the tool to look for.
Meaning, if we specify NSH as the share-ability for the barriers, the ordering guarantee only applies to the Non-shareable domain. As these are two different PEs, and therefore in different Non-shareable domains, the barriers are not enough to get the desired effect.
Now, lets change it from NSH to ISH:
{0:X1=x; 0:X3=y;1:X1=x; 1:X3=y;}P0 | P1 ;MOV W0,#1 | LDR W0,[X3] ;STR W0,[X1] | DMB ISHLD ;DMB ISHST | LDR W2,[X1] ;MOV W2,#1 | ;STR W2,[X3] | ;exists(1:X0=1 /\ 1:X2=0)
STR W0,[X1] | DMB ISHLD ;
DMB ISHST | LDR W2,[X1] ;
Now the model says:
Test MP Allowed States 3 1:X0=0; 1:X2=0; 1:X0=0; 1:X2=1; 1:X0=1; 1:X2=1; No Witnesses Positive: 0 Negative: 3Now the tool is saying there is no possible/legal result where P1 sees the flag but not the message.
Thanks for the examples Martin! I tried simulating the following, thinking that the observation of loads shouldn't affect the result:
AArch64 MP"PodWW Rfe PodRR Fre"Cycle=Rfe PodRR Fre PodWWGenerator=diycross7 (version 7.54+01(dev))Prefetch=0:x=F,0:y=W,1:y=F,1:x=TCom=Rf FrOrig=PodWW Rfe PodRR Fre{0:X1=x; 0:X3=y;1:X1=y; 1:X3=x;} P0 | P1 ; MOV W0,#1 | LDR W0,[X1] ; STR W0,[X1] | DMB NSHLD ; DMB ISHST | LDR W2,[X3] ; MOV W2,#1 | ; STR W2,[X3] | ;exists(1:X0=1 /\ 1:X2=0)
AArch64 MP
"PodWW Rfe PodRR Fre"
Cycle=Rfe PodRR Fre PodWW
Generator=diycross7 (version 7.54+01(dev))
Prefetch=0:x=F,0:y=W,1:y=F,1:x=T
Com=Rf Fr
Orig=PodWW Rfe PodRR Fre
1:X1=y; 1:X3=x;
MOV W0,#1 | LDR W0,[X1] ;
DMB ISHST | LDR W2,[X3] ;
(1:X0=1 /\ 1:X2=0
And, surprisingly got:
Test MP AllowedStates 41:X0=0; 1:X2=0;1:X0=0; 1:X2=1;1:X0=1; 1:X2=0;1:X0=1; 1:X2=1;OkWitnessesPositive: 1 Negative: 3Flag Assuming-common-inner-shareable-domainCondition exists (1:X0=1 /\ 1:X2=0)Observation MP Sometimes 1 3
Looking at the execution diagrams for this as well as substituting dmb nsh, dmb nshld, and dmb nshst, I noticed that the execution flow was labelled as po or program-order for the NSHx cases. Looking at armfences.cat and aarch64fences.cat, it doesn't look like the NSHx barriers are implemented in the simulator, and don't barrier memory accesses on even the single-same PE. Is that correct?
dmb nsh
dmb nshst
po
NSHx
armfences.cat
aarch64fences.cat
What I am trying to determine, is if there is a practical situation where correctness between threads running on different PEs could depend on other PEs having observed that a particular PE performed loads in a certain order.
Thanks for the examples Martin! I wanted to test if DMB NSHx would at least barrier accesses on the self-same PE, so I tried:
And the output was:
Looking through armfences.cat and aarch64fences.cat, it looks like DMB NSHx are not actually implemented in the simulator. Is that correct?
I am trying to determine if there's a practical case where PE-X could depend on having observed a certain order of loads by PE-Y, or if DMB NSHLD would be generally safe to use.
I thought NSH was covered by the model, but you could ask the team who work on it. There's a contact email address on the page that describes the model.
Vijay G said:What I am trying to determine, is if there is a practical situation where correctness between threads running on different PEs could depend on other PEs having observed that a particular PE performed loads in a certain order.
I'm not sure I understand. Isn't the mail box example just that? For the message to be passed correctly the reads would appear to happen in order.
In the mailbox example, P0 is writing the message and the flag, and P1 is reading the message and the flag. Does P0 need to observe P1's loads, in order to ensure program correctness?
Hmm. If you extend the mailbox example and say P1 clears the flag to acknowledge receipt of the message. When P0 sees the flag cleared, it is permitted to write the message field again. The property we'd need to guarantee is that a write by P0 to message after seeing the cleared flag can't change the value of message read by P1 before it cleared the flag.
This would give you a chain of dependencies.
I don't know if that's what you meant by one PE observing another's reads. But it's a real (if simplified) example of where the writes by one PE must be ordered with respect to "earlier" reads by a different PE.
If I understand correctly, in the extended example above, we would expect the following:
Martin Weidmann said:P0's write to the flag must not be re-ordered relative to its first write of the message.
This would be satisfied by a DMB ISH or DMB ISHST instruction on P0 between writing the message and writing the flag.
Martin Weidmann said:P1's read of the message must not be re-ordered relative to the read of the flag.
This could be satisfied by a DMB NSHLD instruction on P1 between loading the flag and loading the message (i.e. we would not need to use DMB ISH or DMB ISHLD to ensure P0 observed P1 loading the flag.)
Martin Weidmann said:P1's write to the flag must not be re-ordered relative to its read of message.
This could be satisfied by a DMB NSHLD instruction on P1 between loading the message and writing the flag. Aside: If P1 were to produce some other state where observers expected to see the flag cleared before seeing state from P1, then P1 should use DMB ISHST after writing the flag, or, perhaps write the flag using STLR.
Martin Weidmann said:P0's second write to message must no be re-ordered relative to its reads of the cleared flag.
And this could be satisfied by a DMB NSHx instruction on P0 between loading the flag and writing the message (i.e. we would not need to use DMB ISHx to ensure P1 observed P0 loading the flag.) But, you would probably use DMB ISHx here just to prevent the second message write from being observed before the first message write (depending exactly how you wrote your flag polling loop on P0.)
Martin Weidmann said:I don't know if that's what you meant by one PE observing another's reads.
Not quite I don't think. Some more context here might help to clarify. My team would like to ensure that a given PE does not re-order its own loads relative to each other -- you could say it is quite like P1 in the original mailbox example. We would like to use the least restrictive barrier possible for this, and based on the documentation and our own tests, DMB NSHLD appears to be sufficient for this. But, a question has been raised as to what exactly are the effects of P1 loading a value that other observers can observe, and, if there are any practical cases where observers could need to see those effects (thus necessitating the use of DMB ISHLD or DMB OSHLD instead.)
Vijay G said:And, surprisingly got:
It might be that the P1 expects a LD barrier that corresponds to the ST barrier, in order for it to respond to the invalidation of [X3] upon seeing the LD barrier. NSHLD isn't expected to pair with ISHST. In the absence of the appropriate LD barrier, P1 can delay the invalidation, and thus read the stale value.
NSHLD guarantees ordering of only LD-LD and LD-ST instructions occurring on the PE executing the barrier, and the scope of the ordering is limited to that PE. From the test you showed, it cannot be concluded that NSHLD failed - it seems that NSHLD did not involve itself with the invalidation of [X3] which originated outside of this PE. Tests on an actual hardware can perhaps show the expected behaviour, if it occurs frequently enough.
Vijay G said:My team would like to ensure that a given PE does not re-order its own loads relative to each other
Why?
Also, barriers are not needed if the PE itself is the only observer of the effects of the its LD/ST instructions (unless there actually are multiple observers being considered for this single PE, such as in the cases of self-modifying code, cache/tlb/page-table mgmt, etc.. Or there actually are multiple PEs involved, though that doesn't seem to be likely since you are looking at NSH scope.)
Have you looked at introducing artificial dependencies between the loads (for e.g. ANDing the value returned by the first load with 0, and adding the result to the address of the second load?)
Vijay G said:What I have not been able to find, is an example of when one might prefer dmb ishld over dmb nshld, for example
The LD barriers are usually paired with ST barriers. The ST barrier decides the scope, and the observers in that scope, that wish to read from the affected stores in an orderly fashion, must employ a suitable LD barrier.