The Arm architecture supports the release consistent processor consistent model for load acquires for a number of years. This feature enables the LDAPR family of instructions that can be used to implement load-acquire semantics from C and C++. Recent compilers like GCC 13.1 and LLVM 16 can now use these instructions in code generation automatically, resulting in significant performance gains on some multi-threaded workloads.
LDAPR
Since the introduction of the Armv8-A architecture the defined mapping of atomic loads with an acquire memory model from C++11 has been to use the LDAR instruction. For example:
#include <atomic> std::atomic<unsigned long> data; unsigned long foo() { return data.load(std::memory_order_acquire); }
generates an LDAR instruction to preform the load from data. Such acquire loads are often paired with corresponding store-release operations using the STLR instructions. This mapping looks simple at the language level but there are many interesting ordering guarantees to consider at the assembly level. In particular, the allowed reordering guarantees at the hardware level look like :
data
It is important for software to give enough freedom to the hardware to reorder memory accesses where possible according to the weak memory model, while still following the constraints of the source language memory model.
We've been exploring the case for using the LDAPR instructions in place of the default LDAR instructions when they are available on the target CPU. The rationale is that the Release Consistency processor consistency model that these instructions implement should give the hardware freedom to reorder memory accesses past STLR instructions more aggressively, so long as there is no address dependency between them.
The freedom to move LDAPR instructions past preceding STLR instructions to unrelated addresses has the potential to improve performance in these sequences.
STLR
Once we confirmed that the above relaxation from LDAR to LDAPR is legal under the C and C++ language rules, the implementation of the new mapping has been comparatively simple.
LDAR
As of the LLVM 16 and GCC 13.1 releases of the popular open-source compilers for AArch64 the code generated for the snippet as above uses LDAPR when compiled with an -march or -mcpu option that supports it. It can also be explicitly enabled through the +rcpc option extensions. For example compiling with -O2 -mcpu=neoverse-v1 generates:
-march
-mcpu
+rcpc
-O2 -mcpu=neoverse-v1
foo(): adrp x8, data add x8, x8, :lo12:data ldapr x0, [x8] ret data: .zero 8
To evaluate the performance impact of this change we used the popular Data Plane Development Kit DPDK and the performance benchmarks it includes.
We are happy to report that on the ring sub-tests of DPDK that perform enqueue and dequeue operations on a shared queue across multiple cores we have seen impressive improvements. Sometimes they go up to 70% on modern CPUs like the Neoverse V1 that implement the LDAPR instructions! A short analysis of the code indicates that the hot parts of that workload indeed rely on the memory system of the CPU reordering the acquire memory accesses as aggressively as it can, and the extra freedom allowed by the LDAPR instruction allows that.
If you are able to recompile your application for your particular target then benefiting from this work is as simple as adding the appropriate -march or -mcpu option to your build flags.
Popular open-source compilers from GCC 13.1 and LLVM 16 support this new functionality, Arm Compiler for Linux 23.04 or later.
The 2016 architecture extensions introduced the first instances of the LDAPR instructions but future additions extend this group of instructions. FEAT_LRCPC2 from Armv8.4-a, for example, adds unscaled offset addressing modes and sign-extensions via the LDAPURSB instruction. We expect compilers to be updated to make more aggressive use of LDAPR-style instructions in the future.
Learn more