This blog post is co-authored by Wathsala Vithanage and Ola Liljedahl.
Barriers, or fences, are sometimes seen as a cure-all for event ordering problems in concurrent programs. The thinking goes: if operations might be reordered, place a barrier between them, and everything will be forced into the “right” sequence. On strong memory models, this intuition often seems to hold. This makes the assumption more appealing.
In an abstract memory model like C11, which defines a relaxed memory model, things are more subtle. A barrier establishes specific ordering relationships. However it cannot create a total order where only a partial order exists.
To see why this matters, we need to step back to memory models themselves. At the heart of concurrent programming lies the memory model. This defines the rules for how threads can observe, and reorder reads and writes to shared variables. Strong memory models, such as those of x86-64, restrict reordering. Making correctness easier to reason about.
Relaxed memory models, in contrast, grant hardware more freedom to reorder instructions. This boosts performance but also makes correctness more elusive. The C11 standard embraces this relaxed approach. It relies on atomics, fences, and ordering constraints to give programmers the tools to define synchronization.
Two key ordering concepts are acquire/release semantics and sequential consistency. Acquire and release operations establish partial orderings. A release ensures that prior writes in one thread become visible before a corresponding acquire in another. Sequential consistency goes further. It enforces a single global total order, but at the cost of performance.
A pitfall arises when developers lean on acquire/release fences. They may assume that these fences provide stronger guarantees than they do. For example, shown in Figure 0: one might expect that if thread 2 observes an update to variable A. Followed by an acquire fence and then a load-acquire of B. It must also observe the same coherence order of B that thread 1 observed due to strong ordering guarantees in thread 1.
This is not the case, Thread 2 can observe a value 0 for B and 1 for A (refer to Table 3, 4, and 5 in Appendix A for details on the notation used in Figure 0).
Figure 0: Can X be 1 and Y be 0 in Thread 2?
This subtle but critical observation became an issue in the DPDK ring buffer. The synchronization logic assumed that barriers would enforce a total order between two consumer operations. In practice, the program established a partial order, leaving room for unsafe orderings.
On strong memory models like x86-64, these issues remain hidden. Even on some relaxed architectures, like AArch64 with RCsc semantics, the ordering is strong enough to mask them. But once we weaken the memory model further, for example, on AArch64 with RCpc semantics, the illusion collapses.
Using Herd7 litmus tests, we could reproduce unsafe orderings. These showed that barriers do not “fail” in themselves. Instead they fail to provide the stronger guarantees that were mistakenly assumed.
We were alerted to a ring data-corruption issue that appeared only with RCpc instructions. To diagnose it, we built targeted litmus tests and microbenchmarks. These probed the ring’s memory-ordering assumptions and observed when partial-ordering effects surfaced. The following sections document the investigation end to end. They cover root cause analysis and several proposed solutions with differing performance characteristics.
Building the ring library with -march=armv8.2-a+lse+rcpc (allowing the compiler to emit LDAPR (the RCpc acquire) instead of LDAR (the RCsc acquire). This caused the test program to report free-slot and available-element counts larger than the ring’s capacity. After instrumenting the library to log producer/consumer heads and tails, we captured the trace:
-march=armv8.2-a+lse+rcpc
LDAPR
LDAR
T0: writerThread1 — enqueue observed={producer_head=392469, consumer_tail=392469} updated ={producer_head=392470, producer_tail=392470} T1: writerThread1 — dequeue observed={consumer_head=392469, producer_tail=392470} updated ={consumer_head=392470, consumer_tail=392470} T2: writerThread2 — dequeue <— UNDERFLOW observed={consumer_head=392470, producer_tail=392469} updated ={consumer_head=392471, producer_tail=392471}
Figure 1: Ring head and tail observation trace.
These inflated counts arise from arithmetic underflow. The ring computes available as producer_tail - consumer_head.It computes free as capacity + consumer_tail - producer_head. If producer_tail < consumer_head or producer_head < consumer_tail, the subtraction underflows. At T2, producer_tail=392469 and consumer_head=392470. So producer_tail < consumer_head and the available calculation underflows.
producer_tail - consumer_head
capacity + consumer_tail - producer_head
producer_tail < consumer_head
producer_head < consumer_tail
producer_tail=392469
consumer_head=392470
The preceding steps show how we got there. At T0 the ring was empty and an enqueue advanced both producer indices. At T1 the same thread dequeued the sole element and advanced the consumer indices. By T2, a second thread on another CPU observed the updated consumer_head but a stale producer_tail. Producing an inconsistent snapshot.
consumer_head
producer_tail
This pointed to the weaker ordering of the RCpc acquire (LDAPR) creating only a partial order. The order was strong enough to pass on some architectures but insufficient here. We turned to Herd7 to investigate the behavior further.
Herd7 is a memory model simulator. It supports a wide range of models, including x86, AArch64, C11, and many others. It allows researchers and developers to describe small concurrent programs. They can then explore all possible executions permitted by a given memory model. However, Herd7 does not support loops or unbounded program structure. This means that realistic algorithms must be reduced to small, representative litmus tests that capture the essential ordering behavior.
This reduction keeps the exploration tractable. It also means that Herd7 cannot directly validate full implementations. It can only validate the critical synchronization patterns at their core. With that in mind we reduced the ring dequeue and enqueue actions into following AArch64 assembly sequence. This aimed to reproduce the issue described above.
The key is that one processor first acts as producer and then as consumer. The other processor observes consumer_head and producer_tail as if it is about to dequeue.
Table 1 shows the ring library's resulting litmus code for corresponding AArch64 assembly using RCpc load-acquire instruction (LDAPR).
Read/Write to underlying array to Remove/Add elements (not required in the litmus test)
Table 1: Reduction of C implementation to Herd7 AArch64 litmus code.
In Table 1, the reduction deliberately omits conditions and loops. In Herd7, these are captured in the proposition we evaluate (the exists clause). They are not in the program text itself. The access pattern from Figure 1 is encoded for Herd7 in Figure 2(a). In this encoding, P0 executes a producer-then-consumer sequence. While P1 is a newly arrived consumer that observes consumer_head and producer_tail.
exists
The code begins by declaring the initial index values (lines 2–4). We use ph/ch for producer_head/consumer_head, and pt/ct for producer_tail/consumer_tail. The P0 column lists the assembly for the producer and consumer paths, according to the reduction in Table 1.
ph/ch
producer_head/consumer_head
pt/ct
producer_tail/consumer_tail
The exists section (lines 27–30) checks whether an execution allowed by the model satisfies our conditions. It also encodes the input parameters (ph, pt, ch, ct) and asserts that the CAS succeeds when intended. As loops are not supported, rte_wait_until_equal_32 is reduced to a relaxed load. The loop’s terminal condition (tail == old_head) is expressed in the proposition.
ph, pt, ch, ct
rte_wait_until_equal_32
tail == old_head
Finally, the proposition verifies if the underflow scenario occurs in P1. It does so by requiring pt = 0, ch = 1 which is same as the condition producer_tail < consumer_head.
pt = 0
ch = 1
Figure 2(b) shows the corresponding output. The coherence graph it generated is shown in Figure 3. According to the output our program allows four positive orderings in all the sixteen possible ordering under the AArch64 RCpc memory model.
In Figure 3, Thread 1 read the consumer_head updated by Thread 0 as indicated by rf (read-from) edge. However, the value it read for the producer_tail is 0. This value was stale because it was updated to 1 and also observed again before Thread 0 updated the consumer_head which was observed correctly by Thread 1.
This is the same partial order seen in the actual ring test described in Figure 1. With these results, we were confident that our reduction matched the actual implementation of the DPDK ring dequeue and enqueue operations. This led us to repeat the litmus test but with RCsc (LDAR) instead of RCpc (LDAPR) in order to verify that the original instruction sequence works.
Figure 2: Herd7 litmus test for AArch64 RCpc (LDAPR).
Figure 3: Herd7 RCpc coherence graph.
Figure 4(a) repeats the AArch64 litmus test but replaces LDAPR with LDAR. As shown in Figure 4(b), the unsafe outcome disappears. The proposition never turns positive. As a result, the issue seen with LDAPR does not arise here. This matches our original observation that the DPDK ring library does not miscompute free slots or available entries when compiled with RCsc acquire.
(b) Herd7 output summary. Condition was not present.
Figure 4: Herd7 litmus test for AArch64 RCsc (LDAR).
On AArch64, both LDAR and LDAPR form a synchronizes-with relation. This occurs when an acquire load reads from a matching STLR to the same location. The crucial difference is local execution. LDAR (RCsc) cannot rise above a preceding STLR to any address. If it reads from such a preceding store-release, that store is already globally visible.
STLR
In contrast, LDAPR (RCpc) may execute before a preceding STLR completes. It can satisfy the acquire by forwarding the value from the core’s own store buffer. Consequently, the load can observe a value that no other core has seen. The earlier STLR may not become globally visible until after later non-release stores from the same core. This behavior explains the partial order we saw.
At this point it is natural to ask why we care about both LDAR and LDAPR, and why the original DPDK report contrasted their behavior. The nuance comes from history. Arm first exposed acquire/release with the LDAR/STLR pair, which implements RCsc. For well-synchronized programs, RCsc behaves like sequential consistency, per Gharachorloo.As a result, early compilers used LDAR for both memory_order_acquire and memory_order_seq_cst.
memory_order_acquire
memory_order_seq_cst
With Armv8.3-A, LDAPR was introduced as an RCpc acquire. It relaxed cross-address ordering constraints while preserving acquire semantics. In practice, code generation depends on the target and toolchain settings. Older CPUs that lack RCpc and older or default builds (or builds without +rcpc) map acquire loads to LDAR. In contrast, newer GCC/LLVM targets emit LDAPR when RCpc is enabled via the appropriate -march/-mcpu flags.
+rcpc
-march/-mcpu
This is why C11 atomics with memory_order_acquiremay compile to either LDAR or LDAPR. Today, in current AArch64 toolchains, when FEAT_LRCPC is available, LDAR backs memory_order_seq_cst while LDAPR backs memory_order_acquire.
With the LDAR–LDAPR distinction clarified, the next question is: What does C11 require? Which behavior, LDAPR (RCpc acquire) or LDAR (RCsc acquire), matches the guarantees of memory_order_acquire under the C11 memory model? Does AArch64’s mapping honor that?
To answer this, we reduced the relevant portion of the DPDK ring to a C11 Herd7 program as show in Table 2. We then crafted a litmus test to exercise the same access pattern used in the AArch64 litmus tests. This allowed us to compare the C11 outcomes directly with the LDAPR and LDAR cases.
Table 2: Reduction of C implementation to Herd7 C11 litmus code.
Figure 5(a) shows the C11 litmus test using the reduction shown the Table 2. As Herd7 disallows loops, and explicit conditionals tend to inflate the search space, we encode the relevant checks and termination conditions in the exists proposition. This keeps the litmus minimal while preserving the ordering constraints that matter, yielding a tractable exploration of the executions.
Figure 5: Herd7 - C11 litmus test.
Figure 6: Herd7 C11 coherence graph.
Figure 5(b) shows that, of the four executions allowed by the C11 model, the proposition is satisfied in exactly one. This matches the RCpc-based Herd7 litmus test. It shows that AArch64’s use of LDAPR for memory_order_acquire is consistent with C11. Moreover, the C11 coherence graph in Figure 6 includes a reads-from (rf) edge from the release store to the acquire load of tail. This mirrors the LDAPR case and further reinforces this conclusion.
With the groundwork in place, we can now ask what went wrong in the DPDK ring’s C11 implementation. Using Figures 3 and 6 as a guide, we isolate the critical memory-access pattern and its effects. Since LDAPR (unlike LDAR) can satisfy an acquire by forwarding from the issuing core’s store buffer, the first place to look is Thread 0’s sequence around the release/acquire pair. The next place to check is Thread 1’s observations.
In the C11 coherence graph of Figure 6, the region of interest for Thread 0 reduces to the ordered sequence of loads and stores shown in Figure 7(a). Figure 7(b) then captures Thread 1’s two ordered accesses. First, observing the updated consumer_head and followed by an acquire fence reading producer_tail with acquire.
In Figure 7(a), the acquire fence ties Thread 0’s early actions to its later acquire of producer_tail. It orders the relaxed-load of consumer_head and the release-store to producer_tail before the acquire load of producer_tail. That acquire, in turn, orders the subsequent relaxed CAS on consumer_head. The CAS cannot be observed before the acquire completes. Aside from a possible local reordering between the relaxed load of consumer_head and the release to producer_tail, Thread 0 otherwise preserves program order. These two operations are mutually relaxed.
In short, the write to consumer_head is the last store performed by Thread 0. In Figure 7(b), the acquire fence orders Thread 1’s relaxed load of consumer_head before its acquire-load of producer_tail.
W(Rel)[prod_tail] ↓ R(Rlx)[cons_head] ↓ F(Acq) ↓ R(Acq)[prdo_tail] ↓ RMW(Rlx)[cons_head]
R(Rlx)[cons_head] ↓ F(Acq) ↓ R(Acq)[pdod_tail]
Figure 7: Ordered loads and stores in each thread.
You may notice that Figure 7 mirrors Figure 0. The only difference is an additional edge linking Thread 0’s CAS to Thread 1’s relaxed load of consumer_head. Figure 0 asks, “Can X be 1 and Y be 0 in Thread 2?”. The answer is yes. Both our AArch64 (RCpc/LDAPR) and C11 litmus tests yield only a partial order. In particular, the acquire fence does not prevent a core from satisfying reads out of its own store buffer. As a result, it cannot rule out the unsafe interleaving we saw. For completeness, we also simulated the Figure 0 scenario as a C11 Herd7 litmus, as outlined in the introduction. The program, outcomes, and coherence graph in Figure 8 confirm the thesis.
Load-relaxed(cons_head); fence-acquire has the same semantics as load-acquire(cons_head). However, load-acquire must match an earlier store-release to the same variable. Only then can it create a synchronizes-with edge that establishes a happens-before relationship that orders surrounding memory accesses. But there is no store-release to cons_head, or the equivalent fence-release plus store-relaxed. This is a classic mistake.
Load-relaxed(cons_head); fence-acquire
load-acquire(cons_head)
Figure 8: No total ordering of events in the case presented in the Introduction
Returning to the practical question in the DPDK ring. How do we fix it? As the bug arises from an unsafe partial order, the remedies fall into three broad strategies:
The next three subsections explore these options in detail. The right fix for the DPDK ring depends on the performance and latency trade-offs of each approach. Appendix B lists DPDK code changes required for each of the potential solutions suggested here.
The simplest fix is to make all synchronizing operations sequentially consistent, including the CAS. Under SC, synchronizing operations participate in a single global total order that preserves program order of each thread, eliminating the unsafe interleaving. As a result, the explicit acquire fence can be removed.
Figure 9(a) shows the C11 Herd7 litmus after this change. Figure 9(b) shows the corresponding results.
Figure 9: Total order from sequential consistency.
To eliminate the unsafe partial order, all effects before Thread 1’s CAS on A must be visible to operations after Thread 2’s read of A. Make the CAS on A a release RMW on success. Make Thread 2’s read of A an acquire.
If Thread 2’s load-acquire reads the value written by Thread 1’s release CAS, it establishes a synchronizes-with edge. Everything before the CAS in Thread 1 happens-before everything after the load in Thread 2. This yields an order that is consistent with the semantics of the ring buffer.
The acquire fence that once separated the load of the head and load-acquire of the opposing tail is now redundant and therefore removed.
The Herd7 code is shown in Figure 10(a). Figure 10(b) shows the resulting output.
Figure 10: Synchronizing release RMW and acquire of A.
In the ring, the unsafe order manifests when the producer index lags the consumer index. This causes the available arithmetic to underflow. We can harden this by guarding the calculation. If producer_head < consumer_tail—equivalently, if capacity + consumer_tail - producer_head would be negative—treat the result as invalid. Clamp to 0 rather than proceeding.
capacity + consumer_tail
producer_head
We embed this check in the proposition of a slightly modified C11 Herd7 litmus based on Figure 8(a). Figure 11(a) shows the code. Figure 11(b) shows that the outcome never turns positive across all allowed executions. With this guard in place, the explicit acquire fence is redundant and removed.
safe
Figure 11: Eliminating the unsafe partial order by committing only semantically valid executions.
This issue was an eye-opener. It underscores how easy it is to overlook unsafe partial orders and introduce memory corruption through misuse of shared-memory synchronization. While the analysis is instructive, a fix is still required. As outlined in the earlier section, we now examine the data and recommend one of three possible solutions. The recommendations are based on performance reported by a modified ring_perf_autotest test application in DPDK on Neoverse - N1 (Ampere Altra) and Intel Sapphire Rapids.
ring_perf_autotest
For Figure 12, ring_perf_autotest was lightly modified to report enqueue/dequeue throughput for burst size 1 (one element per operation) for all cores. This reveals how throughput scales with core count. It is highest on a single core and declines as cores increase, due to growing compare-and-swap (CAS) contention.
On Arm Neoverse N1 (Figure 12(a)), across all core counts:
On Intel Sapphire Rapids (Figure 12(b)), differences across core counts are small. This is consistent with x86’s strong memory model.
Figure 12: Scalability of Three Proposed Solutions
Based on the data, “Handle Unsynchronized Reads of Ring Buffer Metadata” is the most robust option across conditions and was selected as the fix for this issue. The patch implementing this approach appears in Figure 15 (Appendix B).
The notation we used in this article to annotate memory order is borrowed from Michael L. Scott's classic Shared-Memory Synchronization. Table 3 provides a summary of the annotations for C11 memory orders we have used in this article.
WR||WR
WR||
memory_order_release
||WR
memory_order_relaxed
Some sample usage of the notation is shown in Table 4.
Release Fence
atomic_thread_fence(memory_order_release)
fence(WR||)
Atomic Store Release
atomic_store_explicit(a, 1, memory_order_release)
a.store(1, WR||)
Atomic Store Sequentially Consistent
atomic_store_explicit(a, 1, memory_order_seq_cst)
a.store(1,WR||WR)
Atomic Load Relaxed
x = atomic_load_explicit(a, memory_order_relaxed)
x = a.load()
Atomic Load Acquire
x = atomic_load_explicit(a, memory_irder_acquire)
x = a.load(||WR)
Table 5 explains any other notations used in this article.
po
Program Order
ca
Coherence After
rf
Read From
fr
From Read
scp
SC-Precedence
dd
Data Dependency
mo
Modification Order
Code listing for Fixing the DPDK Ring section.
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.hindex b9388af0da..8befb29dca 100644--- a/lib/ring/rte_ring_c11_pvt.h+++ b/lib/ring/rte_ring_c11_pvt.h@@ -36,7 +36,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val, rte_wait_until_equal_32((uint32_t *)(uintptr_t)&ht->tail, old_val, rte_memory_order_relaxed);- rte_atomic_store_explicit(&ht->tail, new_val, rte_memory_order_release);+ rte_atomic_store_explicit(&ht->tail, new_val, rte_memory_order_seq_cst); } /**@@ -78,19 +78,16 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d, unsigned int max = n; *old_head = rte_atomic_load_explicit(&d->head,- rte_memory_order_relaxed);+ rte_memory_order_seq_cst); do { /* Reset n to the initial burst count */ n = max;- /* Ensure the head is read before tail */- rte_atomic_thread_fence(rte_memory_order_acquire);- /* load-acquire synchronize with store-release of ht->tail * in update_tail. */ stail = rte_atomic_load_explicit(&s->tail,- rte_memory_order_acquire);+ rte_memory_order_seq_cst); /* The subtraction is done between two unsigned 32bits value * (the result is always modulo 32 bits even if we have@@ -115,8 +112,8 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d, /* on failure, *old_head is updated */ success = rte_atomic_compare_exchange_strong_explicit( &d->head, old_head, *new_head,- rte_memory_order_relaxed,- rte_memory_order_relaxed);+ rte_memory_order_seq_cst,+ rte_memory_order_seq_cst); } while (unlikely(success == 0)); return n; }
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.hindex b9388af0da..d2a76ce422 100644--- a/lib/ring/rte_ring_c11_pvt.h+++ b/lib/ring/rte_ring_c11_pvt.h@@ -78,14 +78,11 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d, unsigned int max = n; *old_head = rte_atomic_load_explicit(&d->head,- rte_memory_order_relaxed);+ rte_memory_order_acquire); do { /* Reset n to the initial burst count */ n = max;- /* Ensure the head is read before tail */- rte_atomic_thread_fence(rte_memory_order_acquire);- /* load-acquire synchronize with store-release of ht->tail * in update_tail. */@@ -115,8 +112,8 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d, /* on failure, *old_head is updated */ success = rte_atomic_compare_exchange_strong_explicit( &d->head, old_head, *new_head,- rte_memory_order_relaxed,- rte_memory_order_relaxed);+ rte_memory_order_release,+ rte_memory_order_acquire); } while (unlikely(success == 0)); return n; }
diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.hindex b9388af0da..e5ac1f6b9e 100644--- a/lib/ring/rte_ring_c11_pvt.h+++ b/lib/ring/rte_ring_c11_pvt.h@@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d, /* Reset n to the initial burst count */ n = max;- /* Ensure the head is read before tail */- rte_atomic_thread_fence(rte_memory_order_acquire);- /* load-acquire synchronize with store-release of ht->tail * in update_tail. */@@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d, */ *entries = (capacity + stail - *old_head);+ /*+ * Ensure the entries calculation was not based on a stale+ * and unsafe stail observation that causes underflow.+ */+ if ((int)*entries < 0)+ *entries = 0;+ /* check that we have enough room in ring */ if (unlikely(n > *entries)) n = (behavior == RTE_RING_QUEUE_FIXED) ?