When a barrier does not block: The pitfalls of partial order

September 15, 2025

24 minute read time.

This blog post is co-authored by Wathsala Vithanage and Ola Liljedahl.

Introduction
Unsafe partial orderings in the wild
- An unfortunate series of events
- Analysis
Solutions space
Conclusion
Appendix A
Appendix B

Introduction

Barriers, or fences, are sometimes seen as a cure-all for event ordering problems in concurrent programs. The thinking goes: if operations might be reordered, place a barrier between them, and everything will be forced into the “right” sequence. On strong memory models, this intuition often seems to hold. This makes the assumption more appealing.

In an abstract memory model like C11, which defines a relaxed memory model, things are more subtle. A barrier establishes specific ordering relationships. However it cannot create a total order where only a partial order exists.

To see why this matters, we need to step back to memory models themselves. At the heart of concurrent programming lies the memory model. This defines the rules for how threads can observe, and reorder reads and writes to shared variables. Strong memory models, such as those of x86-64, restrict reordering. Making correctness easier to reason about.

Relaxed memory models, in contrast, grant hardware more freedom to reorder instructions. This boosts performance but also makes correctness more elusive. The C11 standard embraces this relaxed approach. It relies on atomics, fences, and ordering constraints to give programmers the tools to define synchronization.

Two key ordering concepts are acquire/release semantics and sequential consistency. Acquire and release operations establish partial orderings. A release ensures that prior writes in one thread become visible before a corresponding acquire in another. Sequential consistency goes further. It enforces a single global total order, but at the cost of performance.

A pitfall arises when developers lean on acquire/release fences. They may assume that these fences provide stronger guarantees than they do. For example, shown in Figure 0: one might expect that if thread 2 observes an update to variable A. Followed by an acquire fence and then a load-acquire of B. It must also observe the same coherence order of B that thread 1 observed due to strong ordering guarantees in thread 1.

This is not the case, Thread 2 can observe a value 0 for B and 1 for A (refer to Table 3, 4, and 5 in Appendix A for details on the notation used in Figure 0).

Image asking can X be 1 and Y be 0 in Thread 2?

Figure 0: Can X be 1 and Y be 0 in Thread 2?

This subtle but critical observation became an issue in the DPDK ring buffer. The synchronization logic assumed that barriers would enforce a total order between two consumer operations. In practice, the program established a partial order, leaving room for unsafe orderings.

On strong memory models like x86-64, these issues remain hidden. Even on some relaxed architectures, like AArch64 with RCsc semantics, the ordering is strong enough to mask them. But once we weaken the memory model further, for example, on AArch64 with RCpc semantics, the illusion collapses.

Using Herd7 litmus tests, we could reproduce unsafe orderings. These showed that barriers do not “fail” in themselves. Instead they fail to provide the stronger guarantees that were mistakenly assumed.

Unsafe partial orderings in the wild

We were alerted to a ring data-corruption issue that appeared only with RCpc instructions. To diagnose it, we built targeted litmus tests and microbenchmarks. These probed the ring’s memory-ordering assumptions and observed when partial-ordering effects surfaced. The following sections document the investigation end to end. They cover root cause analysis and several proposed solutions with differing performance characteristics.

An unfortunate series of events

Building the ring library with -march=armv8.2-a+lse+rcpc (allowing the compiler to emit LDAPR (the RCpc acquire) instead of LDAR (the RCsc acquire). This caused the test program to report free-slot and available-element counts larger than the ring’s capacity. After instrumenting the library to log producer/consumer heads and tails, we captured the trace:

T0: writerThread1 — enqueue
    observed={producer_head=392469, consumer_tail=392469}
    updated ={producer_head=392470, producer_tail=392470}

T1: writerThread1 — dequeue
    observed={consumer_head=392469, producer_tail=392470}
    updated ={consumer_head=392470, consumer_tail=392470}

T2: writerThread2 — dequeue  <— UNDERFLOW
    observed={consumer_head=392470, producer_tail=392469}
    updated ={consumer_head=392471, producer_tail=392471}

Figure 1: Ring head and tail observation trace.

These inflated counts arise from arithmetic underflow. The ring computes available as producer_tail - consumer_head.It computes free as capacity + consumer_tail - producer_head. If producer_tail < consumer_head or producer_head < consumer_tail, the subtraction underflows. At T2, producer_tail=392469 and consumer_head=392470. So producer_tail < consumer_head and the available calculation underflows.

The preceding steps show how we got there. At T0 the ring was empty and an enqueue advanced both producer indices. At T1 the same thread dequeued the sole element and advanced the consumer indices. By T2, a second thread on another CPU observed the updated consumer_head but a stale producer_tail. Producing an inconsistent snapshot.

This pointed to the weaker ordering of the RCpc acquire (LDAPR) creating only a partial order. The order was strong enough to pass on some architectures but insufficient here. We turned to Herd7 to investigate the behavior further.

Analysis

Herd7 is a memory model simulator. It supports a wide range of models, including x86, AArch64, C11, and many others. It allows researchers and developers to describe small concurrent programs. They can then explore all possible executions permitted by a given memory model. However, Herd7 does not support loops or unbounded program structure. This means that realistic algorithms must be reduced to small, representative litmus tests that capture the essential ordering behavior.

This reduction keeps the exploration tractable. It also means that Herd7 cannot directly validate full implementations. It can only validate the critical synchronization patterns at their core. With that in mind we reduced the ring dequeue and enqueue actions into following AArch64 assembly sequence. This aimed to reproduce the issue described above.

The key is that one processor first acts as producer and then as consumer. The other processor observes consumer_head and producer_tail as if it is about to dequeue.

Table 1 shows the ring library's resulting litmus code for corresponding AArch64 assembly using RCpc load-acquire instruction (LDAPR).

DPDK Ring C Implementation and Herd7 AArch64 Reduction

DPDK Ring C Implementation

Herd7 AArch64 Reduction

Read/Write to underlying array to Remove/Add elements (not required in the litmus test)

Reduction of C implementation to Herd7 AArch64 litmus code.

Table 1: Reduction of C implementation to Herd7 AArch64 litmus code.

In Table 1, the reduction deliberately omits conditions and loops. In Herd7, these are captured in the proposition we evaluate (the exists clause). They are not in the program text itself. The access pattern from Figure 1 is encoded for Herd7 in Figure 2(a). In this encoding, P0 executes a producer-then-consumer sequence. While P1 is a newly arrived consumer that observes consumer_head and producer_tail.

The code begins by declaring the initial index values (lines 2–4). We use ph/ch for producer_head/consumer_head, and pt/ct for producer_tail/consumer_tail. The P0 column lists the assembly for the producer and consumer paths, according to the reduction in Table 1.

The exists section (lines 27–30) checks whether an execution allowed by the model satisfies our conditions. It also encodes the input parameters (ph, pt, ch, ct) and asserts that the CAS succeeds when intended. As loops are not supported, rte_wait_until_equal_32 is reduced to a relaxed load. The loop’s terminal condition (tail == old_head) is expressed in the proposition.

Finally, the proposition verifies if the underflow scenario occurs in P1. It does so by requiring pt = 0, ch = 1 which is same as the condition producer_tail < consumer_head.

Figure 2(b) shows the corresponding output. The coherence graph it generated is shown in Figure 3. According to the output our program allows four positive orderings in all the sixteen possible ordering under the AArch64 RCpc memory model.

In Figure 3, Thread 1 read the consumer_head updated by Thread 0 as indicated by rf (read-from) edge. However, the value it read for the producer_tail is 0. This value was stale because it was updated to 1 and also observed again before Thread 0 updated the consumer_head which was observed correctly by Thread 1.

This is the same partial order seen in the actual ring test described in Figure 1. With these results, we were confident that our reduction matched the actual implementation of the DPDK ring dequeue and enqueue operations. This led us to repeat the litmus test but with RCsc (LDAR) instead of RCpc (LDAPR) in order to verify that the original instruction sequence works.

Herd7 AArch64 litmus test with LDAPR compared to Herd7 output summary. Condition was present four out of sixteen allowed states.

(a) Herd7 AArch64 litmus test with LDAPR.

(b) Herd7 output summary. Condition was present four out of sixteen allowed states.

Figure 2: Herd7 litmus test for AArch64 RCpc (LDAPR).

Herd7 RCpc coherence graph.

Figure 3: Herd7 RCpc coherence graph.

Figure 4(a) repeats the AArch64 litmus test but replaces LDAPR with LDAR. As shown in Figure 4(b), the unsafe outcome disappears. The proposition never turns positive. As a result, the issue seen with LDAPR does not arise here. This matches our original observation that the DPDK ring library does not miscompute free slots or available entries when compiled with RCsc acquire.

Herd7 litmus test for AArch64 RCsc (LDAR).

(a) Herd7 AArch64 litmus test with LDAR.

(b) Herd7 output summary. Condition was not present.

Figure 4: Herd7 litmus test for AArch64 RCsc (LDAR).

On AArch64, both LDAR and LDAPR form a synchronizes-with relation. This occurs when an acquire load reads from a matching STLR to the same location. The crucial difference is local execution. LDAR (RCsc) cannot rise above a preceding STLR to any address. If it reads from such a preceding store-release, that store is already globally visible.

In contrast, LDAPR (RCpc) may execute before a preceding STLR completes. It can satisfy the acquire by forwarding the value from the core’s own store buffer. Consequently, the load can observe a value that no other core has seen. The earlier STLR may not become globally visible until after later non-release stores from the same core. This behavior explains the partial order we saw.

At this point it is natural to ask why we care about both LDAR and LDAPR, and why the original DPDK report contrasted their behavior. The nuance comes from history. Arm first exposed acquire/release with the LDAR/STLR pair, which implements RCsc. For well-synchronized programs, RCsc behaves like sequential consistency, per Gharachorloo.As a result, early compilers used LDAR for both memory_order_acquire and memory_order_seq_cst.

With Armv8.3-A, LDAPR was introduced as an RCpc acquire. It relaxed cross-address ordering constraints while preserving acquire semantics. In practice, code generation depends on the target and toolchain settings. Older CPUs that lack RCpc and older or default builds (or builds without +rcpc) map acquire loads to LDAR. In contrast, newer GCC/LLVM targets emit LDAPR when RCpc is enabled via the appropriate -march/-mcpu flags.

This is why C11 atomics with memory_order_acquiremay compile to either LDAR or LDAPR. Today, in current AArch64 toolchains, when FEAT_LRCPC is available, LDAR backs memory_order_seq_cst while LDAPR backs memory_order_acquire.

With the LDAR–LDAPR distinction clarified, the next question is: What does C11 require? Which behavior, LDAPR (RCpc acquire) or LDAR (RCsc acquire), matches the guarantees of memory_order_acquire under the C11 memory model? Does AArch64’s mapping honor that?

To answer this, we reduced the relevant portion of the DPDK ring to a C11 Herd7 program as show in Table 2. We then crafted a litmus test to exercise the same access pattern used in the AArch64 litmus tests. This allowed us to compare the C11 outcomes directly with the LDAPR and LDAR cases.

DPDK Ring C Implementation and Herd7 AArch64 Reduction

DPDK Ring C Implementation

Herd7 AArch64 Reduction

Read/Write to underlying array to Remove/Add elements (not required in the litmus test)

Reduction of C implementation to Herd7 C11 litmus code.

DPDK Ring C Implementation

Herd7 AArch64 Reduction

Table 2: Reduction of C implementation to Herd7 C11 litmus code.

Figure 5(a) shows the C11 litmus test using the reduction shown the Table 2. As Herd7 disallows loops, and explicit conditionals tend to inflate the search space, we encode the relevant checks and termination conditions in the exists proposition. This keeps the litmus minimal while preserving the ordering constraints that matter, yielding a tractable exploration of the executions. 

Herd7 - C11 litmus test.

(a) Herd7 litmus test in C11.

(b) Herd output summary. Condition exists.

Figure 5: Herd7 - C11 litmus test.

Figure 6: Herd7 C11 coherence graph.

Figure 5(b) shows that, of the four executions allowed by the C11 model, the proposition is satisfied in exactly one. This matches the RCpc-based Herd7 litmus test. It shows that AArch64’s use of LDAPR for memory_order_acquire is consistent with C11. Moreover, the C11 coherence graph in Figure 6 includes a reads-from (rf) edge from the release store to the acquire load of tail. This mirrors the LDAPR case and further reinforces this conclusion.

With the groundwork in place, we can now ask what went wrong in the DPDK ring’s C11 implementation. Using Figures 3 and 6 as a guide, we isolate the critical memory-access pattern and its effects. Since LDAPR (unlike LDAR) can satisfy an acquire by forwarding from the issuing core’s store buffer, the first place to look is Thread 0’s sequence around the release/acquire pair. The next place to check is Thread 1’s observations.

In the C11 coherence graph of Figure 6, the region of interest for Thread 0 reduces to the ordered sequence of loads and stores shown in Figure 7(a). Figure 7(b) then captures Thread 1’s two ordered accesses. First, observing the updated consumer_head and followed by an acquire fence reading producer_tail with acquire.

In Figure 7(a), the acquire fence ties Thread 0’s early actions to its later acquire of producer_tail. It orders the relaxed-load of consumer_head and the release-store to producer_tail before the acquire load of producer_tail. That acquire, in turn, orders the subsequent relaxed CAS on consumer_head. The CAS cannot be observed before the acquire completes. Aside from a possible local reordering between the relaxed load of consumer_head and the release to producer_tail, Thread 0 otherwise preserves program order. These two operations are mutually relaxed.

In short, the write to consumer_head is the last store performed by Thread 0. In Figure 7(b), the acquire fence orders Thread 1’s relaxed load of consumer_head before its acquire-load of producer_tail.

W(Rel)[prod_tail] ↓ R(Rlx)[cons_head] ↓ F(Acq) ↓ R(Acq)[prdo_tail] ↓ RMW(Rlx)[cons_head]	R(Rlx)[cons_head] ↓ F(Acq) ↓ R(Acq)[pdod_tail]
(a) Thead 0's ordered load/store sequence	(b) Thead 1's ordered loads

Figure 7: Ordered loads and stores in each thread.

You may notice that Figure 7 mirrors Figure 0. The only difference is an additional edge linking Thread 0’s CAS to Thread 1’s relaxed load of consumer_head. Figure 0 asks, “Can X be 1 and Y be 0 in Thread 2?”. The answer is yes. Both our AArch64 (RCpc/LDAPR) and C11 litmus tests yield only a partial order. In particular, the acquire fence does not prevent a core from satisfying reads out of its own store buffer. As a result, it cannot rule out the unsafe interleaving we saw.

For completeness, we also simulated the Figure 0 scenario as a C11 Herd7 litmus, as outlined in the introduction. The program, outcomes, and coherence graph in Figure 8 confirm the thesis. 

Load-relaxed(cons_head); fence-acquire has the same semantics as load-acquire(cons_head). However, load-acquire must match an earlier store-release to the same variable. Only then can it create a synchronizes-with edge that establishes a happens-before relationship that orders surrounding memory accesses. But there is no store-release to cons_head, or the equivalent fence-release plus store-relaxed. This is a classic mistake.

Herd7 - C11 code for simulating the case presented in Introduction.

(a) Herd7 - C11 code for simulating the case presented in Introduction.

Summary output and Coherence graph.

(b) Summary output.

Figure 8: No total ordering of events in the case presented in the Introduction

Solutions space

Returning to the practical question in the DPDK ring. How do we fix it? As the bug arises from an unsafe partial order, the remedies fall into three broad strategies:

Enforce a total order
- Upgrade all ring buffer metadata accesses (including the CAS) to sequential consistency.
Establish the desired pair-wise happens-before relations
- Create all necessary synchronizes-with edges using acquire/release semantics.
Handle unsynchronized reads of ring buffer metadata
- Identify and handle inconsistent ring buffer metadata values. Operations are only committed for consistent values.

The next three subsections explore these options in detail. The right fix for the DPDK ring depends on the performance and latency trade-offs of each approach. Appendix B lists DPDK code changes required for each of the potential solutions suggested here.

Total ordering of events

The simplest fix is to make all synchronizing operations sequentially consistent, including the CAS. Under SC, synchronizing operations participate in a single global total order that preserves program order of each thread, eliminating the unsafe interleaving. As a result, the explicit acquire fence can be removed.

Figure 9(a) shows the C11 Herd7 litmus after this change. Figure 9(b) shows the corresponding results.

Herd 7 - C11 implementation of sequentially consistent loads and stores and Output of the litmus test.

(a) Herd 7 - C11 implementation of sequentially consistent loads and stores.

(b) Output of the litmus test.

Figure 9: Total order from sequential consistency.

Establish the desired pair-wise happens-before relations

To eliminate the unsafe partial order, all effects before Thread 1’s CAS on A must be visible to operations after Thread 2’s read of A. Make the CAS on A a release RMW on success. Make Thread 2’s read of A an acquire.

If Thread 2’s load-acquire reads the value written by Thread 1’s release CAS, it establishes a synchronizes-with edge. Everything before the CAS in Thread 1 happens-before everything after the load in Thread 2. This yields an order that is consistent with the semantics of the ring buffer.

The acquire fence that once separated the load of the head and load-acquire of the opposing tail is now redundant and therefore removed. 

The Herd7 code is shown in Figure 10(a). Figure 10(b) shows the resulting output.

Herd 7 - C11 implementation of synchronizing release RMW and acquire of A and Output of the litmus test.

(a) Herd 7 - C11 implementation of synchronizing release RMW and acquire of A.

(b) Output of the litmus test.

Figure 10: Synchronizing release RMW and acquire of A.

Handle unsynchronized reads of ring buffer metadata

In the ring, the unsafe order manifests when the producer index lags the consumer index. This causes the available arithmetic to underflow. We can harden this by guarding the calculation. If producer_head < consumer_tail—equivalently, if capacity + consumer_tail - producer_head would be negative—treat the result as invalid. Clamp to 0 rather than proceeding.

We embed this check in the proposition of a slightly modified C11 Herd7 litmus based on Figure 8(a). Figure 11(a) shows the code. Figure 11(b) shows that the outcome never turns positive across all allowed executions. With this guard in place, the explicit acquire fence is redundant and removed.

Proceed only if safe is set and Litmus test output.

(a) Proceed only if safe is set. (b) Litmus test output.

Figure 11: Eliminating the unsafe partial order by committing only semantically valid executions.

Conclusion

This issue was an eye-opener. It underscores how easy it is to overlook unsafe partial orders and introduce memory corruption through misuse of shared-memory synchronization. While the analysis is instructive, a fix is still required. As outlined in the earlier section, we now examine the data and recommend one of three possible solutions. The recommendations are based on performance reported by a modified ring_perf_autotest test application in DPDK on Neoverse - N1 (Ampere Altra) and Intel Sapphire Rapids.

For Figure 12, ring_perf_autotest was lightly modified to report enqueue/dequeue throughput for burst size 1 (one element per operation) for all cores. This reveals how throughput scales with core count. It is highest on a single core and declines as cores increase, due to growing compare-and-swap (CAS) contention.

On Arm Neoverse N1 (Figure 12(a)), across all core counts:

Total Order is slowest because it relies heavily on sequentially consistent synchronization. 
Handle Unsynchronized Reads is fastest, adding no extra synchronization and incurring only a low-cost conditional branch. 
Pair-wise Happens-Before sits between, introducing a store-release and a matching load-acquire on producer_head and consumer_head.

On Intel Sapphire Rapids (Figure 12(b)), differences across core counts are small. This is consistent with x86’s strong memory model.

Scalability of Three Proposed Solutions

(a) Scalability of the proposed solutions (Arm Neoverse N1)

(b) Scalability of the proposed solutions (Intel Sapphire Rapids)

Figure 12: Scalability of Three Proposed Solutions

Based on the data, “Handle Unsynchronized Reads of Ring Buffer Metadata” is the most robust option across conditions and was selected as the fix for this issue. The patch implementing this approach appears in Figure 15 (Appendix B).

Appendix A

The notation we used in this article to annotate memory order is borrowed from Michael L. Scott's classic Shared-Memory Synchronization. Table 3 provides a summary of the annotations for C11 memory orders we have used in this article.

Notation	C11 Memory Order

WR\|\|WR	memory_order_seq_cst
WR\|\|	memory_order_release
\|\|WR	memory_order_acquire
	memory_order_relaxed
Table 3: Annotation summary.

Some sample usage of the notation is shown in Table 4.

Operation	C11	Notation

Release Fence	atomic_thread_fence(memory_order_release)	fence(WR\|\|)
Atomic Store Release	atomic_store_explicit(a, 1, memory_order_release)	a.store(1, WR\|\|)
Atomic Store Sequentially Consistent	atomic_store_explicit(a, 1, memory_order_seq_cst)	a.store(1,WR\|\|WR)
Atomic Load Relaxed	x = atomic_load_explicit(a, memory_order_relaxed)	x = a.load()
Atomic Load Acquire	x = atomic_load_explicit(a, memory_irder_acquire)	x = a.load(\|\|WR)
Table 4: Shared memory synchronization notation samples.

Table 5 explains any other notations used in this article.

Notation	Description

po	Program Order
ca	`Coherence After`
rf	`Read From`
fr	`From Read`
scp	`SC-Precedence`
dd	`Data Dependency`
mo	`Modification Order`
Table 5: Explanation of other notations used in the article.

Appendix B

Code listing for Fixing the DPDK Ring section.

diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index b9388af0da..8befb29dca 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -36,7 +36,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, uint32_t old_val,
                rte_wait_until_equal_32((uint32_t *)(uintptr_t)&ht->tail, old_val,
                        rte_memory_order_relaxed);

-       rte_atomic_store_explicit(&ht->tail, new_val, rte_memory_order_release);
+       rte_atomic_store_explicit(&ht->tail, new_val, rte_memory_order_seq_cst);
 }

 /**
@@ -78,19 +78,16 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
        unsigned int max = n;

        *old_head = rte_atomic_load_explicit(&d->head,
-                       rte_memory_order_relaxed);
+                       rte_memory_order_seq_cst);
        do {
                /* Reset n to the initial burst count */
                n = max;

-               /* Ensure the head is read before tail */
-               rte_atomic_thread_fence(rte_memory_order_acquire);
-
                /* load-acquire synchronize with store-release of ht->tail
                 * in update_tail.
                 */
                stail = rte_atomic_load_explicit(&s->tail,
-                                       rte_memory_order_acquire);
+                                       rte_memory_order_seq_cst);

                /* The subtraction is done between two unsigned 32bits value
                 * (the result is always modulo 32 bits even if we have
@@ -115,8 +112,8 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
                        /* on failure, *old_head is updated */
                        success = rte_atomic_compare_exchange_strong_explicit(
                                        &d->head, old_head, *new_head,
-                                       rte_memory_order_relaxed,
-                                       rte_memory_order_relaxed);
+                                       rte_memory_order_seq_cst,
+                                       rte_memory_order_seq_cst);
        } while (unlikely(success == 0));
        return n;
 }

Figure 13: Patch for creating a total order.

diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index b9388af0da..d2a76ce422 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -78,14 +78,11 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
        unsigned int max = n;

        *old_head = rte_atomic_load_explicit(&d->head,
-                       rte_memory_order_relaxed);
+                       rte_memory_order_acquire);
        do {
                /* Reset n to the initial burst count */
                n = max;

-               /* Ensure the head is read before tail */
-               rte_atomic_thread_fence(rte_memory_order_acquire);
-
                /* load-acquire synchronize with store-release of ht->tail
                 * in update_tail.
                 */
@@ -115,8 +112,8 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
                        /* on failure, *old_head is updated */
                        success = rte_atomic_compare_exchange_strong_explicit(
                                        &d->head, old_head, *new_head,
-                                       rte_memory_order_relaxed,
-                                       rte_memory_order_relaxed);
+                                       rte_memory_order_release,
+                                       rte_memory_order_acquire);
        } while (unlikely(success == 0));
        return n;
 }

Figure 14: Patch for establishing desired pair-wise happens-before relations.

diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index b9388af0da..e5ac1f6b9e 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -83,9 +83,6 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
                /* Reset n to the initial burst count */
                n = max;

-               /* Ensure the head is read before tail */
-               rte_atomic_thread_fence(rte_memory_order_acquire);
-
                /* load-acquire synchronize with store-release of ht->tail
                 * in update_tail.
                 */
@@ -99,6 +96,13 @@ __rte_ring_headtail_move_head(struct rte_ring_headtail *d,
                 */
                *entries = (capacity + stail - *old_head);

+               /*
+                * Ensure the entries calculation was not based on a stale
+                * and unsafe stail observation that causes underflow.
+                */
+               if ((int)*entries < 0)
+                       *entries = 0;
+
                /* check that we have enough room in ring */
                if (unlikely(n > *entries))
                        n = (behavior == RTE_RING_QUEUE_FIXED) ?

Figure 15: Patch for handling unsynchronized reads of ring buffer metadata.

Architectures and Processors blog

Future Architecture Technologies: POE2 and vMTE

Martin Weidmann

This blog post introduces two future technologies, Permission Overlay Extension version 2 (POE2) and Virtual Tagging Extension (vMTE).
- October 23, 2025
Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

When a barrier does not block: The pitfalls of partial order

Table of Contents

Introduction

Unsafe partial orderings in the wild

An unfortunate series of events

Analysis

Solutions space

Total ordering of events

Establish the desired pair-wise happens-before relations

Handle unsynchronized reads of ring buffer metadata

Conclusion

Appendix A

Appendix B

Future Architecture Technologies: POE2 and vMTE

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Arm A-Profile Architecture developments 2025