Here is a minimal C implementation of a spinlock "lock" operation using GCC's built-in atomics:
#include <stdbool.h> void spin_lock(bool *l) { while (__atomic_test_and_set(l, __ATOMIC_ACQUIRE)) ; }
I am concerned by GCC's output when compiling for Aarch64:
spin_lock: mov w2, 1 .p2align 2 .L4: ldaxrb w1, [x0] stxrb w3, w2, [x0] cbnz w3, .L4 uxtb w1, w1 cbnz w1, .L4 ret
The ldaxrb surely prevents subsequent memory accesses from being reordered before it, but, to my understanding, nothing prevents those accesses from being reordered between the ldaxrb and stxrb. If I understand correctly, the acquire barrier should be placed after stxrb, not before.
When compiling for ARM, however, GCC correctly inserts a dmb after strexb:
spin_lock: mov r2, #1 .L4: ldrexb r3, [r0] strexb r1, r2, [r0] cmp r1, #0 bne .L4 tst r3, #255 dmb sy bne .L4 bx lr
Am I missing something? If GCC's output for Aarch64 is correct, could anyone explain what forces the acquire memory ordering I specified? In the opposite case, what would be a correct solution (beside GCC's solution for ARM)?
I am using Linaro's gcc-linaro-5.3-2016.02-x86_64_aarch64-elf and gcc-linaro-4.9-2015.02-3-x86_64_arm-eabi toolchains.
While having a barrier after the STXR would certainly work, I don't think it is necessary.
The LDAXR guarantees that explicit access after the barrier aren't re-order before the barrier. The processor might start speculatively fetching between the LDAXR and the STXR. If the STXR fails, those will simply be discarded. If the STXR succeeds, those accesses are still _after_ you saw the spinlock available and you know that no other thread/processor jumped between the LDAXR/STXR as otherwise the STXR wouldn't have succeeded.
So why it is technically true that they could be before you owned the spinlock, I don't see that it has actually broken anything.
I'm not a GCC expert. But is it possible that you were building for ARMv7-A? Which doesn't include the Load-Acquire instruction.
Once you switch to using a DMB, you could arguably place it where it is (line 09) or pretty much any point after the LDREX. I suspect the rationale for its placing is that if the STREX fails you can avoid the cost of the barrier.
Thank you for your answer.
The LDAXR guarantees that explicit access after the barrier aren't re-order before the barrier. The processor might start speculatively fetching between the LDAXR and the STXR. If the STXR fails, those will simply be discarded. If the STXR succeeds, those accesses are still _after_ you saw the spinlock available and you know that no other thread/processor jumped between the LDAXR/STXR as otherwise the STXR wouldn't have succeeded. So why it is technically true that they could be before you owned the spinlock, I don't see that it has actually broken anything.
Right. Speculative writes aren't a problem either, as they can only be observed after the cbnz has been architecturally resolved.
Indeed, I was compiling for ARMv7-A precisely to investigate where GCC would insert a dmb to implement the acquire barrier in the absence of ARMv8's load-acquire.
View all questions in Cortex-A / A-Profile forum