Locks, SWPs and two Smoking Barriers (Part 2)

September 11, 2013

4 minute read time.

In the last article, I explained how to modify SWP code to make use of compiler intrinsics. Using intrinsics hides the underlying detail needed to use the load and store exclusive instructions (LDREX and STREX) and the use of memory barriers. In this article I look at implementing atomic memory accesses in assembler.

In order to describe memory barriers, what they are and how they should be used, I need to describe two types of memory model, strongly ordered and weakly ordered. The strongly-ordered model is very natural for programmers. In this model, the order that a program writes data to memory is the order in which the data is observed being written into memory, that is, other programs sharing the data will "see" the same ordering regardless of the CPU that they are executing on.

For example, if a CPU writes a new X then writes a new Y, all other CPUs that subsequently read Y then a read X, will access either the new Y and new X, the old Y and the new X, or old Y and the old X. However, because the order of the write is strongly ordered as write X first then write Y, n o CPU will access the new Y and the old X .

Modern CPUs, such as ARM, optimize memory accesses via hardware mechanisms avoiding memory accesses when it can, for example by using a cache. When it has to access memory, it makes those memory accesses as efficient as possible. An example of efficient memory accesses is the speculative reading of memory into caches before the CPU asks for it. This could happen when a program is accessing an array in memory. Whilst the program is working on element N, the memory subsystem could be reading element N+1 (and above) into cache ready for use. These mechanisms can cause the memory accesses to be reordered .

Systems that are allowed to reorder memory accesses in this way are called weakly ordered memory systems. The upshot of this is that, whilst a (program running on a) CPU has a coherent view of its own memory accesses, those memory accesses will be observed differently by (programs running on) other CPUs. To take my example above, another CPU might easily access the new Y and the old X.

Unfortunately, being able to observe the same order of memory accesses is key when accessing memory atomically. If, in the above example, X is a spinlock and Y is the data structure being modified within the critical region protected by X, it is important that the accesses to X are observed by all other CPUs in the system as being before accesses to Y. What you need is some mechanism to make sure that all CPUs observe accesses to X before the accesses to Y and this mechanism is a memory barrier.

Taking load and store exclusives and memory barriers into account, an assembly version of acquiring a spinlock looks like this:

CODE

ENTRY (_spin_lock)
    mov r1,#1
1: ldrex r2,[r0]
    cmp r2,#0
    strexeq r2, r1, [r0]
    cmpeq r2, #0
    bne 1b
    dmb
    mov r0,#0
END (_spin_lock)

As I described in the previous article, a pair of load and store exclusives are used to acquire the lock. The "cmpeq r2, #0' instruction checks that the exclusive store to the lock location failed, which it will do if another CPU has accessed the lock between the LDREX and STREX operations. A failure will cause the code to retry the load exclusive.

You will note the Data Memory Barrier (DMB) instruction that is issued once the lock has been acquired. The DMB guarantees that all memory accesses before the memory barrier will be observed by all of the other CPUs in the system before all memory accesses made after the memory barrier. This makes more sense if you consider that once a lock has been acquired, a program will then access the data structure(s) locked by the lock. The DMB in the lock function above ensures that accesses to the locked data structure are observed after accesses to the lock.

The unlock code looks like:

CODE

ENTRY (_spin_unlock)
    mov r1, #0
    dmb
    str r1, [r0]
END (_spin_unlock)

The write to unlock the spinlock uses a simple STR instruction (not a store exclusive). This is fine if there is only one owner of the lock, the most common case. If a critical region allows more than one CPU to access it, load and store exclusives would need to be used.

The DMB instruction ensures that accesses to the locked data structure(s) are observed before the lock is unlocked .

The next article looks at power considerations.

David Rusling, ARM Fellow, David was born a few weeks before Sputnik was launched. He's always liked mathematics, but America's space program together with 'Star Trek' made him think that computers were really interesting and so he graduated in 1982 with a degree in Computer Science. The future turns out to have less flashing lights than he expected. After hacking networking boxes for Digital Equipment Corporation, he got involved in the port of Linux to the Alpha processor. This gave him an abiding respect for the power of open source in general and Linux in particular. He worked on StrongARM before moving to ARM where he added tools experience. He's an ARM Fellow; which he says, "really means that I'm a techno-dweeb with a wide freedom to meddle." His official role is to set the technical direction for ARM's tools and software story.

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Locks, SWPs and two Smoking Barriers (Part 2)

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Arm A-Profile Architecture developments 2025

When a barrier does not block: The pitfalls of partial order