This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

dsb and dmb

Hi all:

I have some questions about DMB and DSB in armv8.

(1)

In armv8 Reference Manual doc, it says "The DMB instruction does not ensure the completion of any of the memory accesses for which it ensures relative order".

But in ARM Cortex-A Series Programmer’s Guide for ARMv8-A doc, it explains some dmb/dsb parameters.
for example :

<option> | Ordered Accesses (before – after) | Shareability Domain
         LD | Load –Load, Load – Store              | Full system

Load - Load/Store:
This means that the barrier requires all loads to complete before the barrier but
does not require stores to complete. Both loads and stores that appear after the
barrier in program order must wait for the barrier to complete.

Since Load - Load/Store means the barrier requires all loads to complete before the barrier, I think it has ensured the completion of memory access, so I am confused.

(2)

In ARM Cortex-A Series Programmer’s Guide for ARMv8-A doc, it also says DSB "enforces the same ordering as the Data Memory Barrier, but has the additional effect of blocking execution of any further instructions, not just loads or stores, or both, until synchronization is complete".

Since DSB can block any instructions, what's "ST" in "DSB ST" for?

(3)

I have already knew that DSB can replace DMB safely, but in what situation should we only use "DSB", not "DMB"? what's the difference between DSB and DMB? An example should be great.

Thanks!

Parents
  • I think the concern with (1) can be removed by replacing "complete" with "order" when reading the statements of the Cortex-A TRM wrt dmb.


    but I don't understand the difference between visibility and completion

    The Observed-by relation can only be established between memory operations from different PEs. AFAIU, when a PE reads from its own earlier store waiting in its store buffer, the Observed-by relation is not applied, even though the store has become visible before it became visible to other PEs.

    In a single inner shareability domain, (where the hardware is responsible for maintaining coherency so that the caches are transparent), a store which writes into the cache becomes visible to other PEs in that same domain instantaneously, and is also complete (for that shareability domain).

    But if another inner shareability (or just an outer) domain exists, which is peer to the first one above, and the store, for some reason, is needed by this second shareability domain, then the store may not be complete.

    Such a situation can arise in a system with multiple cores where each core (PE) implements two threads, and the pair of threads share the store buffer. Here, the store by PE0.T0, which is present in the store buffer, is visible to PE0.T1, but is not visible to any other PE. If such a system were to be made Arm compliant, it becomes a necessity (at the least) that each core/PE be its own inner shareable domain. If two cores are placed in the same inner shareability domain, then it breaks the other-multi-copy atomic behaviour that the Arm architecture depends upon.


    The difference between visibility and completion can be understood in a hypothetical system, where each CPU has its own,  duplicate copy of the entire memory, and all these copies are connected by an interconnection to allow propagation of updates.

    Below is such a system:

                         CPU0    ......   CPU1 ......   CPUN
                          |
                          v
                        |   |-->-+
                        +---+    |
                        |   |-->-+
                        +---+    |
                        |   |-->-+
                        +---+    |
                                 |
                                 |        |              |
                                 v        v              v
                                CB0      CB1            CBN
                             +------+  +------+       +------+
                             | MEM0 |  | MEM1 | ..... | MEMN |
                             +---+--+  +---+--+       +---+--+
                                IB0       IB1            IBN
                                 ^         ^              ^
                                 |         |              |
                                 |         v              |
                                 |      +------------+    |
                                 +----->|  IxConnect |<---+
                                        +------------+
    

    Each MEMx starts out with exactly the same contents, all zeroes, for instance. Each MEMx has one buffer CBx to receive a single request at a time from its CPUx, and one buffer IBx to receive a single request at a time from the IxConnect (i.e. from other MEMs).

    Suppose CPU0 decides to store the value A0=100 into address A. The write request W1 = (A, A0) reaches CB0, and MEM0 performs it, emptying CB0. Let this time be T1.

    But the write is not complete until the other copies of MEM0 are updated too.

    A store W1 on address A by CPU0 is considered complete with respect to another CPUx, if a load to address A by CPUx reads-from W1 or from another store (to the same location A) which is subsequent to W1 in the total order of all memory operations. A store is complete when it completes with respect to all CPUs.

    If CPU1 were to load A, it would read 0, which is from a store before W1. By definition, W1 is not complete wrt CPU1 and hence is not complete.

    MEM0 sees that a store needs to be propagated to the other MEMs. It sends W1 across the IxConnect to other MEMs. Assume that the path to other MEMs is unpredictable - W1 propagates to the other MEMs with different delays. It may become visible to CPU1 at time T2, and may not become visible to CPU2 until time T100.

    The W1 is not complete, but is visible to some CPUs (other than CPU0).

Reply
  • I think the concern with (1) can be removed by replacing "complete" with "order" when reading the statements of the Cortex-A TRM wrt dmb.


    but I don't understand the difference between visibility and completion

    The Observed-by relation can only be established between memory operations from different PEs. AFAIU, when a PE reads from its own earlier store waiting in its store buffer, the Observed-by relation is not applied, even though the store has become visible before it became visible to other PEs.

    In a single inner shareability domain, (where the hardware is responsible for maintaining coherency so that the caches are transparent), a store which writes into the cache becomes visible to other PEs in that same domain instantaneously, and is also complete (for that shareability domain).

    But if another inner shareability (or just an outer) domain exists, which is peer to the first one above, and the store, for some reason, is needed by this second shareability domain, then the store may not be complete.

    Such a situation can arise in a system with multiple cores where each core (PE) implements two threads, and the pair of threads share the store buffer. Here, the store by PE0.T0, which is present in the store buffer, is visible to PE0.T1, but is not visible to any other PE. If such a system were to be made Arm compliant, it becomes a necessity (at the least) that each core/PE be its own inner shareable domain. If two cores are placed in the same inner shareability domain, then it breaks the other-multi-copy atomic behaviour that the Arm architecture depends upon.


    The difference between visibility and completion can be understood in a hypothetical system, where each CPU has its own,  duplicate copy of the entire memory, and all these copies are connected by an interconnection to allow propagation of updates.

    Below is such a system:

                         CPU0    ......   CPU1 ......   CPUN
                          |
                          v
                        |   |-->-+
                        +---+    |
                        |   |-->-+
                        +---+    |
                        |   |-->-+
                        +---+    |
                                 |
                                 |        |              |
                                 v        v              v
                                CB0      CB1            CBN
                             +------+  +------+       +------+
                             | MEM0 |  | MEM1 | ..... | MEMN |
                             +---+--+  +---+--+       +---+--+
                                IB0       IB1            IBN
                                 ^         ^              ^
                                 |         |              |
                                 |         v              |
                                 |      +------------+    |
                                 +----->|  IxConnect |<---+
                                        +------------+
    

    Each MEMx starts out with exactly the same contents, all zeroes, for instance. Each MEMx has one buffer CBx to receive a single request at a time from its CPUx, and one buffer IBx to receive a single request at a time from the IxConnect (i.e. from other MEMs).

    Suppose CPU0 decides to store the value A0=100 into address A. The write request W1 = (A, A0) reaches CB0, and MEM0 performs it, emptying CB0. Let this time be T1.

    But the write is not complete until the other copies of MEM0 are updated too.

    A store W1 on address A by CPU0 is considered complete with respect to another CPUx, if a load to address A by CPUx reads-from W1 or from another store (to the same location A) which is subsequent to W1 in the total order of all memory operations. A store is complete when it completes with respect to all CPUs.

    If CPU1 were to load A, it would read 0, which is from a store before W1. By definition, W1 is not complete wrt CPU1 and hence is not complete.

    MEM0 sees that a store needs to be propagated to the other MEMs. It sends W1 across the IxConnect to other MEMs. Assume that the path to other MEMs is unpredictable - W1 propagates to the other MEMs with different delays. It may become visible to CPU1 at time T2, and may not become visible to CPU2 until time T100.

    The W1 is not complete, but is visible to some CPUs (other than CPU0).

Children