Before ARMv6, the main synchronisation mechanism was the SWP instruction. SWP has two aspects, in a uniprocessor system it allows the read and write operations not to be interrupted between them. In a multiprocessor system it ensures that multiple masters will do the locking. For multiprocessor systems with complex memory hierarchies and long memory latencies SWP creates performance bottlenecks.
This was replaced in the ARMv6 architecture by exclusive loads and stores (LDREX and STREX). This works on the principle of a monitor existing for the location in memory. This effectively tags the memory with the identity of the agent(s) trying to access it. In a spinlock implementation, an exclusive load reads data from the memory, tagging it with its identifier. A short number of instructions later, it uses an exclusive store to write data to memory but this only works if the tag is still valid and the tag will only be valid if some other agent has not also modified that location since the exclusive load.
At the same time that the load and store exclusives were added to the ARM architecture, the SWP instruction was depreciated and the architecture notes that use of SWP is not guaranteed to work for SMP systems. The load and store exclusives and the deprecation of the SWP instructions is described in detail in the ARM ARM [1].
As an aid to removing legacy SWP instructions, ARMv7 allows you to disable the SWP instruction. In ARM SMP Linux to help find legacy uses of the SWP instruction, we disable SWP but emulate the instructions (via the undefined instruction trap) and log those emulations. While this generates extra instruction overhead, it ensures that the software operates safely. You should also be aware that the SWP instruction does not exist in the Thumb 2 instruction set and so will see errors if you try and assemble code containing the SWP instruction into the Thumb 2 instruction set.
For SMP performance we want to the replace the use of the SWP instruction with appropriate load and store exclusive instructions in libraries and applications. This easiest way to achieve this is to make use of the GCC compiler built-ins (described here) . ARM GCC will either directly generate the correct inline code or insert a call to a kernel user helper function containing the right code.
As an example, consider the following assembly code function implementing a spin lock:
This would be replaced with
To release the lock, you need to call another of the GCC builtins, in this case __sync_lock_release():
Code wishing to lock a data structure would look something like this:
The built in functions take care of all of the details for you, including dealing with weakly ordered memory systems via memory barriers. This is only a brief introduction for more details I suggest that you read the "Barrier Litmus Tests and Cookbook" document in ARM's Infocenter.In the next article, I explain how to implement spin locks in assembler and describe how memory barriers should be used.References:
David Rusling, ARM Fellow, David was born a few weeks before Sputnik was launched. He's always liked mathematics, but America's space program together with 'Star Trek' made him think that computers were really interesting and so he graduated in 1982 with a degree in Computer Science. The future turns out to have less flashing lights than he expected. After hacking networking boxes for Digital Equipment Corporation, he got involved in the port of Linux to the Alpha processor. This gave him an abiding respect for the power of open source in general and Linux in particular. He worked on StrongARM before moving to ARM where he added tools experience. He's an ARM Fellow; which he says, "really means that I'm a techno-dweeb with a wide freedom to meddle." His official role is to set the technical direction for ARM's tools and software story.