Locks, SWPs and two Smoking Barriers

September 11, 2013

3 minute read time.

Before ARMv6, the main synchronisation mechanism was the SWP instruction. SWP has two aspects, in a uniprocessor system it allows the read and write operations not to be interrupted between them. In a multiprocessor system it ensures that multiple masters will do the locking. For multiprocessor systems with complex memory hierarchies and long memory latencies SWP creates performance bottlenecks.

This was replaced in the ARMv6 architecture by exclusive loads and stores (LDREX and STREX). This works on the principle of a monitor existing for the location in memory. This effectively tags the memory with the identity of the agent(s) trying to access it. In a spinlock implementation, an exclusive load reads data from the memory, tagging it with its identifier. A short number of instructions later, it uses an exclusive store to write data to memory but this only works if the tag is still valid and the tag will only be valid if some other agent has not also modified that location since the exclusive load.

At the same time that the load and store exclusives were added to the ARM architecture, the SWP instruction was depreciated and the architecture notes that use of SWP is not guaranteed to work for SMP systems. The load and store exclusives and the deprecation of the SWP instructions is described in detail in the ARM ARM [1].

As an aid to removing legacy SWP instructions, ARMv7 allows you to disable the SWP instruction. In ARM SMP Linux to help find legacy uses of the SWP instruction, we disable SWP but emulate the instructions (via the undefined instruction trap) and log those emulations. While this generates extra instruction overhead, it ensures that the software operates safely. You should also be aware that the SWP instruction does not exist in the Thumb 2 instruction set and so will see errors if you try and assemble code containing the SWP instruction into the Thumb 2 instruction set.

For SMP performance we want to the replace the use of the SWP instruction with appropriate load and store exclusive instructions in libraries and applications. This easiest way to achieve this is to make use of the GCC compiler built-ins (described here) . ARM GCC will either directly generate the correct inline code or insert a call to a kernel user helper function containing the right code.

As an example, consider the following assembly code function implementing a spin lock:

CODE

ENTRY (__spin_lock)
    mov r1,#1
1:    swp r2,r1,[r0]
    teq r2,#0
    bne 1b
    mov r0,#0
END (__spin_lock)

This would be replaced with

CODE

typedef struct {int flag; ...} spinlock_t;int __spin_lock(spinlock_t *lock)
{
while (__sync_lock_test_and_set(&lock->flag, 1));
return 0;
}

To release the lock, you need to call another of the GCC builtins, in this case __sync_lock_release():

CODE

void _spin_unlock(spinlock_t *lock) {
__sync_lock_release(&lock->flag);
}

Code wishing to lock a data structure would look something like this:

CODE

// grab the lock
__spin_lock(&lock);// modify the locked data structure
<modification code>// release the lock
__spin_unlock(&lock);

The built in functions take care of all of the details for you, including dealing with weakly ordered memory systems via memory barriers. This is only a brief introduction for more details I suggest that you read the "Barrier Litmus Tests and Cookbook" document in ARM's Infocenter.

In the next article, I explain how to implement spin locks in assembler and describe how memory barriers should be used.

References:

ARM DDI 0406B_errata_2009_Q3 (ID100209) : ARM® Architecture Reference Manual ARM®v7-A and ARM®v7-R edition
PRD03-GENC-007826 1.0 : Barrier Litmus Tests and Cookbook

David Rusling, ARM Fellow, David was born a few weeks before Sputnik was launched. He's always liked mathematics, but America's space program together with 'Star Trek' made him think that computers were really interesting and so he graduated in 1982 with a degree in Computer Science. The future turns out to have less flashing lights than he expected. After hacking networking boxes for Digital Equipment Corporation, he got involved in the port of Linux to the Alpha processor. This gave him an abiding respect for the power of open source in general and Linux in particular. He worked on StrongARM before moving to ARM where he added tools experience. He's an ARM Fellow; which he says, "really means that I'm a techno-dweeb with a wide freedom to meddle." His official role is to set the technical direction for ARM's tools and software story.

0 comments
0 members are here

Architectures and Processors blog

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025
Caches and Self-Modifying Code: Working with Threads

Jacob Bramley

How to synchronize JIT-compiled instructions across threads.
- January 21, 2025
Caches and Self-Modifying Code: Implementing `__clear_cache`

Jacob Bramley

How to implement `__clear_cache` using assembly.
- January 20, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Locks, SWPs and two Smoking Barriers

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Caches and Self-Modifying Code: Working with Threads

Caches and Self-Modifying Code: Implementing `__clear_cache`