I have some questions about DMB and DSB in armv8.
In armv8 Reference Manual doc, it says "The DMB instruction does not ensure the completion of any of the memory accesses for which it ensures relative order".
But in ARM Cortex-A Series Programmer’s Guide for ARMv8-A doc, it explains some dmb/dsb parameters.for example :
<option> | Ordered Accesses (before – after) | Shareability Domain LD | Load –Load, Load – Store | Full system
Load - Load/Store:This means that the barrier requires all loads to complete before the barrier butdoes not require stores to complete. Both loads and stores that appear after thebarrier in program order must wait for the barrier to complete.
Since Load - Load/Store means the barrier requires all loads to complete before the barrier, I think it has ensured the completion of memory access, so I am confused.
In ARM Cortex-A Series Programmer’s Guide for ARMv8-A doc, it also says DSB "enforces the same ordering as the Data Memory Barrier, but has the additional effect of blocking execution of any further instructions, not just loads or stores, or both, until synchronization is complete".
Since DSB can block any instructions, what's "ST" in "DSB ST" for?
I have already knew that DSB can replace DMB safely, but in what situation should we only use "DSB", not "DMB"? what's the difference between DSB and DMB? An example should be great.
(1) Completion != order. For example, if you write to register A then B then C. This order can be changed by the memory system for many reasons (cache, bus etc.). If you place a DMB after each store, you can be sure, that C will not be written before B and A. But you cannot be sure about the "when".
(2)/(3) If the code after the store depends on the effect of the store, you need DSB. For example if you write to some peripheral and want to be sure it will not generate interrupts before enabling them in the interrupt controller.
Hi 42Bastian Schick :
Very thanks for the examples!
There are other questions that confusing me.
(1) According to ARMV8 reference Manual, DMB seems to ensure that all affected memory accesses are Observed-by each PE.
but I don't understand the difference between visibility and completion. Can you give me an extra example to explain in which situation memory access is visible but hasn't been completed?
(2) In "DSB ST", does that mean the DSB instruction only block loads and stores? Not all instructions?
I think the concern with (1) can be removed by replacing "complete" with "order" when reading the statements of the Cortex-A TRM wrt dmb.
digital_kevin said:but I don't understand the difference between visibility and completion
The Observed-by relation can only be established between memory operations from different PEs. AFAIU, when a PE reads from its own earlier store waiting in its store buffer, the Observed-by relation is not applied, even though the store has become visible before it became visible to other PEs.
In a single inner shareability domain, (where the hardware is responsible for maintaining coherency so that the caches are transparent), a store which writes into the cache becomes visible to other PEs in that same domain instantaneously, and is also complete (for that shareability domain).
But if another inner shareability (or just an outer) domain exists, which is peer to the first one above, and the store, for some reason, is needed by this second shareability domain, then the store may not be complete.
Such a situation can arise in a system with multiple cores where each core (PE) implements two threads, and the pair of threads share the store buffer. Here, the store by PE0.T0, which is present in the store buffer, is visible to PE0.T1, but is not visible to any other PE. If such a system were to be made Arm compliant, it becomes a necessity (at the least) that each core/PE be its own inner shareable domain. If two cores are placed in the same inner shareability domain, then it breaks the other-multi-copy atomic behaviour that the Arm architecture depends upon.
The difference between visibility and completion can be understood in a hypothetical system, where each CPU has its own, duplicate copy of the entire memory, and all these copies are connected by an interconnection to allow propagation of updates.
Below is such a system:
CPU0 ...... CPU1 ...... CPUN
| | |
v v v
CB0 CB1 CBN
+------+ +------+ +------+
| MEM0 | | MEM1 | ..... | MEMN |
+---+--+ +---+--+ +---+--+
IB0 IB1 IBN
^ ^ ^
| | |
| v |
| +------------+ |
+----->| IxConnect |<---+
Each MEMx starts out with exactly the same contents, all zeroes, for instance. Each MEMx has one buffer CBx to receive a single request at a time from its CPUx, and one buffer IBx to receive a single request at a time from the IxConnect (i.e. from other MEMs).
Suppose CPU0 decides to store the value A0=100 into address A. The write request W1 = (A, A0) reaches CB0, and MEM0 performs it, emptying CB0. Let this time be T1.
But the write is not complete until the other copies of MEM0 are updated too.
A store W1 on address A by CPU0 is considered complete with respect to another CPUx, if a load to address A by CPUx reads-from W1 or from another store (to the same location A) which is subsequent to W1 in the total order of all memory operations. A store is complete when it completes with respect to all CPUs.
If CPU1 were to load A, it would read 0, which is from a store before W1. By definition, W1 is not complete wrt CPU1 and hence is not complete.
MEM0 sees that a store needs to be propagated to the other MEMs. It sends W1 across the IxConnect to other MEMs. Assume that the path to other MEMs is unpredictable - W1 propagates to the other MEMs with different delays. It may become visible to CPU1 at time T2, and may not become visible to CPU2 until time T100.
The W1 is not complete, but is visible to some CPUs (other than CPU0).
According to your example, only if W1 is visible to all the other PEs, can W1 be complete. Is that right?
Do you mean that every PE /needs/ to load A, before W1 can be considered complete? Then, no. The definition of completion is not as strict (it says '/if/ a load to address...')
If the potential for reading the original (common) value (that existed before W1 clobbered A) does not exist, then W1 is complete. One of the PEs may not load A during the entire execution of the program. But if it /were/ to load A, whether it received the value from W1, or the original, would determine if W1 ever completed.
Edit: Definition 3.3
Late Edit2: The notion of visibility, that was implicitly assumed (and I failed to expose) in my descriptions, is what Arm describes as "A write W1 from an Observer is Observed-by a read R2 from a different Observer if and only if R2 Reads-from W1".
Accordingly, their definition of Completion of a Write includes a statement: "Any read to the same Location by an Observer within the shareability domain will either Reads-from W1,or Reads-from a write that is Coherence-after W1".
Here, "PE0.W1 Observed-by PE1.R2" == "PE1.R2 reads-from PE0.W1".
Any condition which requires a Coherence-after relation between the operations cannot be described by the example system, because its memory can contradict itself about the order of operations at a single location.
- the clause about a load reading from a write subsequent to W1,
- the additional condition on the write completion, that "Any write to the same Location by an Observer within the shareability domain will be Coherence-after W1." ,
- the notion of visibility of a RW1 from a PE by a W2 from another PE,
these cannot be described by the example system.
The utility of the example system does not extend beyond showing that writes need propagation before they can complete, and that they might become visible to loads of other PEs at different intervals.
I have got the point. Thanks! It really helps a lot.