Simplifying persistent programming with microarchitectural support

April 15, 2021

7 minute read time.

Hardware-based coherence has significantly simplified concurrent programming and improved application performance. Similar to how it has triumphed software-based coherence, our work shows that significant improvements in performance and programmability of persistent applications can be achieved with microarchitectural support in systems with persistent memory.

The need for strict persistency

In my last blog post about our International Symposium on Computer Architecture (ISCA) 2020 paper, we introduced an implementation of a relaxed memory persistency model, Strand Persistency. Strand Persistency allows instruction ordering constraints enforced by a persist barrier to be confined within a fine-grained region or ‘strand’, while strands can be reordered with regards to each other. The baseline memory persistency model that Strand Persistency improves on is Epoch Persistency, as introduced in Armv8.2-A. This allows persistent applications to explicitly use instructions (DC CVAP and DSB) to flush persistent stores to the point of persistence. The more relaxed Strand Persistency reduces unnecessary instruction ordering constraints, and improves performance by allowing for more instruction reordering and parallel execution.

The evolution of memory models.

Figure 1. Evolution of memory models: consistency and persistency. Notes: SPAA’19 [1] introduces the ISA support for (lazy) release persistency. ISCA’20 [2] introduces the ISA support and microarchitectural implementation of Strand Persistency. HPCA’21 [3] introduces the microarchitectural implementation of strict and sequential persistency.

In our HPCA’21 paper, we introduce an efficient microarchitectural implementation of the Strict Persistency model that coincides persistency with consistency, allowing Sequential Persistency while the underlying memory consistency model is more relaxed.

The main idea for the Strict Persistency implementation is to battery-protect the on-chip volatile cache hierarchy. Any data that is made globally visible by the coherent caches is also persistent, and so persistency coincides with consistency. Strict Persistency simplifies persistent programming. Programmers for concurrency on persistent memory no longer need to reason about memory persistency on top of, an already complex, memory consistency. As caches are battery-protected and can be seen as the point of persistence, no explicit cacheline flush instructions or associated barriers to ensure completion of such cacheline flush instructions are needed. As a result, the performance of persistent applications is improved, and programming is simplified by the absence of such instructions.

The performance and programmability benefits come at the cost of battery-backing the on-chip volatile caches. The battery needed to back up the entire cache hierarchy can be significant for high-end processors with large caches. In the paper, we propose a battery-backed persist buffers solution. This aims to optimize the battery size by only protecting a subset of cachelines that contain updates to persistent regions, and tracking such cachelines in persist buffers alongside level one data caches. The battery-optimized solution with persist buffers:

Can have the performance benefit as close to the battery-backed caches solution as possible,
Requires much less backup energy than the battery-backed caches solution, and,
Can elide explicit cacheline flush instructions even when battery-backing is not available.

Microarchitectural support to synchronize concurrent and sequential visibility with persistency.

Figure 2. Microarchitectural support to synchronize concurrent and sequential visibility with persistency. Architectural support provides means for software to drain data from volatile memory to persistent memory, such as DC CVA[D]P and DSB. Microarchitectural support eliminates the gap between visibility and persistency, and therefore removes the need to apply such instructions.

The need for sequential persistency

With all the caches battery-backed, or with battery-backed persist buffers implemented, Strict Persistency is achieved. However, Strict Persistency is not sufficient for sequential programs, as they need to reason about memory consistency. This is because:

A sequential program running on a system with persistent memory (PM) will have a crash observer at recovery, which is not dissimilar to having a concurrent observer
Caches are battery-backed. Strict Persistency dictates that persistency (as can be seen by a crash observer) is always synchronized with consistency (can be seen by a concurrent observer).

Without PM, sequential programs only have the executing core itself as an observer, and the writes are always observed sequentially. This is due to in-order commit and other-MCA (other-multiple-copy-atomicity, that is, a core can see its writes in its own store buffer earlier than other cores/observers). With PM, the writes to persistent caches (PoP) can be drained out-of-order from the CPU store buffers due to Arm's weak memory consistency model. This means that sequential programs need to reason about memory consistency.

A code example is a simple sequential linked list on PM that first needs to initialize the node, and then publish it. A CPU barrier would be needed to prevent the two steps from being reordered for crash consistency. This can arise if a power failure happens in between the initialize and publish steps when they are reordered.

As we all know, concurrency is hard to reason about, and the question we must ask is: can we avoid introducing this additional complexity to sequential programs, with the introduction of PM? After all, the sequential execution mental model for developers of sequential programs needs to be preserved, even with the introduction of PM. Otherwise, simple sequential programs like the linked list example will essentially need to be reasoned as lock-free programs.

The visibility of sequential programs is at pre-StoreBuffer due to other-MCA, that is the read-own-write-early MCA model. But as concurrency/persistency visibility is at post-StoreBuffer (caches as enabled by hardware coherence and battery-backing), the stores are in-order at pre-StoreBuffer, but are allowed to be observed out-of-order post-StoreBuffer (as allowed by weak memory consistency models). This subtle difference has profound implications on porting sequential programs to PM, even for a simple sequential linked list.

The problem could be addressed in either software or hardware. We propose a hardware solution to close the consistency and persistency gap by addressing the ordering aspect, instead of the atomicity aspect, of memory consistency models. Battery-protecting the store buffers in addition to caches or persist buffers makes sense, as stores are always committed in-order due to enforcement for precise exception for out-of-order execution. This means persistency is at pre-StoreBuffer and consistency is at post-StoreBuffer. Sequential Persistency is achieved with store buffers in the persistence domain as all stores are persisted in-order. As a result, sequential programs no longer need to reason about CPU reordering and use CPU barriers to prevent such potential reordering.

Despite addressing store reordering by CPU store buffers, compilers can still reorder. This needs to be prevented by using compiler barriers that do not impact performance at run time in the same way as CPU barriers - even for a simple sequential linked list on PM. Some form of language or compiler support will be needed to prevent compiler reordering and remove the need for compiler barriers. This might be declaring data structures on persistent memory as 'persistent' to prevent compilers reordering updates to such data structures.

Results, impact and future work

We have shown that Strict Persistency can be achieved without incurring performance degradation, compared to relaxed persistency models, by battery-backing volatile caches or buffers. Furthermore, Sequential Persistency can be achieved with store buffers additionally in the battery protection domain. The battery-backed buffers solution further reduces the backup energy capacity needed by two to three orders of magnitude as compared to battery-backing all the volatile caches, with negligible performance impact as compared to battery-backed caches.

The work shows that significant improvements in performance and programmability of persistent applications can be achieved with some microarchitectural support in systems with persistent memory. This is similar to how hardware-based coherence has triumphed software-based coherence over the past two decades. The hardware-based coherence significantly simplifies concurrent programming and improves application performance.

This is our penultimate installation on memory persistency models, with the final installation to appear at ISCA’21 exploring explicit instruction-level dependency tracking. In addition, we continue to explore architectural support ideas for far memory.

Talk recording and paper

The talk video and paper are available to view. For more information, check out our other blogs on memory persistency and language-level memory persistency.

Watch talk recording Questions? Contact William Wang

References

[1] Wang, William, and Stephan Diestelhorst. "Persistent Atomics for Implementing Durable Lock-Free Data Structures for Non-Volatile Memory (Brief Announcement)." In The 31st ACM Symposium on Parallelism in Algorithms and Architectures, pp. 309-311. 2019.

[2] Gogte, Vaibhav, William Wang, Stephan Diestelhorst, Peter M. Chen, Satish Narayanasamy, and Thomas F. Wenisch. "Relaxed persist ordering using strand persistency." In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 652-665. IEEE, 2020.

[3] Alshboul, Mohammad, Prakash Ramrakhyani, William Wang, James Tuck, and Yan Solihin. “BBB: Simplifying Persistent Programming Using Battery-Backed Buffers.” In IEEE International Symposium on High-Performance Computer Architecture, ser. HPCA. 2021.

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024