Recently I was assigned the task of fixing and improving the atomics support in the Cranelift, the code generator for the WebAssembly runtime 'wasmtime'. I had never worked on atomic support in any compiler, for any architecture, and I had only recently started to work on compiling WebAssembly (Wasm). I had also managed to always keep my head firmly in the sand about memory models. So, there was a lot to learn.
As my head has now been ripped out of the sand, I thought it would be useful to share what I have learned. I do not consider myself a computer scientist, so this is an engineer's guide to multithreading programming with the Armv8 memory model, in the context of C/C++ and WebAssembly.
Specifically, the main focus is how we can support sequentially consistent for data-race free programs efficiently with Armv8. So, what does that mean?
Let us start with the Data-Race Free (DRF) part. A race-condition, or race-hazard, describes that the behavior of the system is, potentially, dependent on a sequence of events that is not well-defined. These manifest as, often spurious, bugs in multithreaded programs where the behavior of a thread can be affected, seemingly unpredictably, by another thread with their communication in shared memory. A DRF program is written to avoid this class of bugs.
The Sequential Consistency (SC) model allows any number of processors and threads to execute a program, with their instructions interleaved, but with the results being as if executed by a notional single thread. This means that the program appears to run sequentially. The program behavior is unaffected by the order that each thread executes, as the instructions from each thread can be freely interleaved; however, the instructions of each thread are executed in program order. This program order is actually a partial order of machine instructions, which means that there is some flexibility in the order of which the instructions are execute. Flexibility is allowed just as long as the program appears to behave as the programmer intended, with the behavior primarily observed through its interactions with memory and control-flow changes.
The SC model would, essentially, be provided:
This description sounds nothing like a modern system!
Instruction reordering, instruction and data prefetching, and various levels of memory are ubiquitous hardware features which are implemented to improve execution speed. Compilers are also capable of transforming source code into a form that is almost unrecognizable. So how can we write and execute high-performance, bug-free, multithreaded programs?
A programmer needs to be capable of expressing which parts of the code are dependent on the results of another thread so that data races are avoided. A programmer also wants to give the compiler and microarchitecture the freedom to optimize without affecting the correctness. A contract needs to be made between the programmer and the microprocessor so that a program can be specified and executed correctly, while reducing overhead and maintaining performance. This contract is provided at the source-level, commonly with a memory model and language features such as: data types, memory primitives and synchronization points. At the machine-level, a computer architecture provides a memory model and specific machine instructions that often map well to common language features that enable multithreading. So, Sequential Consistency for Data-Race Free Programs (SC-DRF) is therefore a type of agreement between the programmer, the language, the compiler, the architecture, and microarchitecture on how a program should behave. If everyone holds up their end of the bargain, the program can still be optimized and will execute in a multi-threaded environment as if it was a notional single thread. A key optimization that we look at later is that of instruction reordering.
A prose memory model for C and C++ was published with the release of C11 and C++11. It provides well-defined behavior for DRF programs and provides a number of primitives to access shared memory and, somewhat reluctantly, supports many memory orderings (ascending in a partial-order of ordering strength):
The default behavior is SC and I will cover some of these other orderings later on. The low-level primitives that C/C++ provide are atomic loads, stores, and read-modify-write operations.
The official WebAssembly specification does not yet support threads, but the design is almost finalized. It will provide a weak memory model, which provides unordered (relaxed) and SC memory operations. As a compilation target for C/C++, Wasm follows C/C++ in that it only supports DRF programs, but it also promotes all orders to SC. So, while it is still weakly ordered, it is stronger than C/C++.
Armv8 is the first time that the Arm architecture has a formal memory model, you can read the prose in the Arm Architecture Reference Manual, section B2.1 or refer to the formal model here. I am not going to go through it here, but I quote parts as necessary as we go. For now, it is just important to note two things:
To quote the reference manual:
The Arm architecture is a weakly ordered memory architecture that permits the observation and completion of memory accesses in a different order from the program order.
In an Other-multi-copy atomic system, it is required that a write from an Observer, if observed by a different Observer, is then observed by all other Observers that access the Location coherently. It is, however, permitted for an Observer to observe its own writes prior to making them visible to other observers in the system.
In this context, an ‘Observer’ is a CPU core, hardware thread, or anything else that can access shared memory. So, from a microarchitecture perspective, this looks like a description of a shared memory system with each Observer potentially having a write buffer which it can read from before writing back to main memory. We would, of course, expect one or more levels of cache in a modern system too, so these would implicitly need to be fully coherent as well. This is also how I understand the x86-TSO model working too - but that does not mean that the models are the same! The Arm ARM states about multi-copy atomicity:
All writes to the same location are serialized, meaning they are observed in the same order by all observers, although some observers might not observe all of the writes.
Which suggests that the Armv8 model:
My first humbling lesson was a very basic one, in that most memory operations are ‘atomic’; they are performed as a single unit. For this to be true, the operation must be aligned and only reading to, or writing from, a single register. I assume this is the case for all modern architectures. These instructions are classed as single-copy atomic.
But, did I not just state it is important to remember that Armv8 is multi-copy atomic?
Well, when we talk about ‘atomics’, we almost certainly mean synchronizing atomics. Just as C/C++ provide synchronizing atomic operations, AArch64 and AArch32 also provide these in the ISA. Normal loads and stores, provide no inter-thread synchronization; however, this does not mean that these memory operations cannot be used to access shared memory.
Two of the memory orders provided by C/C++ are ‘acquire’ and ‘release’ (and acq_rel for read-modify-write).
From Preshing:
Acquire semantics is a property that can only apply to operations that read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation that follows it in program order.
Release semantics is a property that can only apply to operations that write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation that precedes it in program order.
From the previous descriptions, it shows that these operations can be used to create critical sections of code, with an acquire creating a lock and the release unlocking. Using these pairs of operations would allow for some potential reordering of instructions within the critical section. Some pseudocode is shown below.
A) load/store ; could sink after B B) load_acq C) load/store ; has to execute after B and before D D) store_rel E) load/store ; could hoist before D
With Armv7, we could achieve these semantics using DMB, but with Armv8 we have specific instructions with these semantics:
Operation
Armv7
AArch32
AArch64
Load Acquire
LDR; DMB ISH
LDA
LDAR
Store Release
DMD ISH; STR
STL
STLR
After learning a bit about memory models, the next struggle was understanding how load-acquire and store-release instructions can be used to implement C/C++ (also Swift, Rust and WebAssembly) SC-DRF. I did not understand how we could be using, the slightly more relaxed, acq_rel semantics to provide seq_cst. I read a whole bunch of papers, but the information was in language and hardware specifications all along...
Let us take a look at the C++ spec:
seq_cst provides acq_rel ordering as well as establishing a single total modification order of all atomic operations that are so tagged.
So, the key difference between acq_rel and seq_cst is that there is a single legal order of all the atomic operations in an SC program.
We already know that both the load-acquire and store-release are performing a synchronization across all cores. The load must be observed before anything afterwards in the program order, while all memory operations prior in the program order to the store must be observed before it. But these semantics says nothing about a prior store-release being reordered after a later load-acquire. Reusing our previous example:
A) store_rel ; could sink after B!!B) load_acq C) load/store ; has to execute after B and before DD) store_relE) load/store ; could hoist before D
But the Armv8 memory model, fortunately, provides explicit ordering in this case:
Where a Load-Acquire appears in program order after a Store-Release, the memory access generated by the Store-Release instruction is Observed-by each PE to the extent that PE is required to observe the access coherently, before the memory access generated by the Load-Acquire instruction is Observed-by that PE, to the extent that the PE is required to observe the access coherently.
It is this little snippet of the Armv8 memory that provides the single legal order that we require. A load-acquire cannot be issued before any prior store-release has been observed all by the cores, so it forces all threads to agree on the shared memory, after a store-release and before the next load-acquire. If we refer back to the Armv7 sequence for acquire-release, we can convert this to seq_cst by placing a DMB after the store as well. This is what the Armv8 memory model is providing automatically with the store-release.
Load SC
Store SC
DMB ISH; STR; DMB ISH
C/C++ will default to seq_cst, and I feel that it would be best to always use that unless you are extremely confident and/or enjoy debugging multi-threaded programs. The specification states that if you inter-mingle seq_cst with any other ordering, seq_cst ordering cannot be provided. This model has been implemented in the Clang/LLVM so many other languages will have picked up this model, either willingly or through a lack of a desire to do a lot of compiler work. Wasm will, for now, only support seq_cst atomics.
The main takeaway for me, is that in the transition from Armv7 to Armv8, the architecture has remained usefully weak but has gained some, surprising powerful and flexible, synchronization primitives. The Large System Extensions (FEAT_LSE and FEAT_LSE2) improved multi-threaded capabilities further by introducing: compare-swap instructions, atomic read-modify-write operations, single-copy atomic 128-bit load and stores (LDP, STP) and greatly relaxed alignment rules. Both FEAT_LSE and FEAT_LSE2 are now mandatory in Armv9.
"The specification states that if you inter-mingle seq_cst with any other ordering, seq_cst ordering cannot be provided. This model has been implemented in the Clang/LLVM"
This is basically the salt of the article, right?
seq_cst only works when used with other seq_cstwhich makes using seq_cst somewhat pointlessand LLVM downgrades it to acq_rel, which is still compliant with the specif only seq_cst is used, then it's downgraded to acq_rel and the order is preservedif seq_cst is mixed with other memory models, then the spec says sequentially consistency cannot be provided so acq_rel is fine too
Hi,
I'm not sure what you mean by LLVM downgrading to acq_rel, I'm not aware of compilers changing the memory ordering of operations. Do you have an example?
I'm also disagree that seq_cst is pointless, as having a single legal ordering of these operations can be an extremely useful property in a program. Of course, there are instances were seq_cst is too restrictive, i.e slower, so more relaxed constructs can be beneficial. Indeed, LDAPR and friends were introduced in Armv8.3 for such cases.
As for intermingling, as well as adding complexity to an already hard problem, I guess if one is doing that then they're probably not looking for seq_cst anyway, so maybe we can't ask the C/C++ authors for well defined behavior there? Anyway, as long as the programmer doesn't write any bugs and the compiler obeys the given semantics then the program should just work... right?
Hi Samuel! Apologies for the long delay :D - got no notifications or emails, and this completely disappeared off my radar.
What I meant is - SEQ_CST is more restrictive than ACQ_REL. Then, if you take into account that if SEQ_CST is used together with ACQ_REL can't be guaranteed, then you can basically "downgrade" all SEQ_CST calls into being ACQ_REL. In between of them, the sequential consistency will be guaranteed. All the rest is not supposed to be guaranteed, so clang uses this "UB" to simplify things.