Enabling the LDAPR instructions for C/C++ compilers

November 15, 2023

3 minute read time.

The Arm architecture supports the release consistent processor consistent model for load acquires for a number of years. This feature enables the LDAPR family of instructions that can be used to implement load-acquire semantics from C and C++. Recent compilers like GCC 13.1 and LLVM 16 can now use these instructions in code generation automatically, resulting in significant performance gains on some multi-threaded workloads.

Mapping language constructs to the Arm architecture

Since the introduction of the Armv8-A architecture the defined mapping of atomic loads with an acquire memory model from C++11 has been to use the LDAR instruction. For example:

#include <atomic>

std::atomic<unsigned long> data;

unsigned long foo() {

    return data.load(std::memory_order_acquire);
}

generates an LDAR instruction to preform the load from data. Such acquire loads are often paired with corresponding store-release operations using the STLR instructions. This mapping looks simple at the language level but there are many interesting ordering guarantees to consider at the assembly level. In particular, the allowed reordering guarantees at the hardware level look like :

Ordering requirements on LDAR-based acquire-release sequences

It is important for software to give enough freedom to the hardware to reorder memory accesses where possible according to the weak memory model, while still following the constraints of the source language memory model.

Enter LDAPR

We've been exploring the case for using the LDAPR instructions in place of the default LDAR instructions when they are available on the target CPU. The rationale is that the Release Consistency processor consistency model that these instructions implement should give the hardware freedom to reorder memory accesses past STLR instructions more aggressively, so long as there is no address dependency between them.

Ordering requirements on LDAPR-based acquire-release sequences

The freedom to move LDAPR instructions past preceding STLR instructions to unrelated addresses has the potential to improve performance in these sequences.

Implementation and evaluation in compilers

Once we confirmed that the above relaxation from LDAR to LDAPR is legal under the C and C++ language rules, the implementation of the new mapping has been comparatively simple.

As of the LLVM 16 and GCC 13.1 releases of the popular open-source compilers for AArch64 the code generated for the snippet as above uses LDAPR when compiled with an -march or -mcpu option that supports it. It can also be explicitly enabled through the +rcpc option extensions. For example compiling with -O2 -mcpu=neoverse-v1 generates:

foo():
        adrp    x8, data
        add     x8, x8, :lo12:data
        ldapr   x0, [x8]
        ret
data:
        .zero   8

To evaluate the performance impact of this change we used the popular Data Plane Development Kit DPDK and the performance benchmarks it includes.

We are happy to report that on the ring sub-tests of DPDK that perform enqueue and dequeue operations on a shared queue across multiple cores we have seen impressive improvements. Sometimes they go up to 70% on modern CPUs like the Neoverse V1 that implement the LDAPR instructions! A short analysis of the code indicates that the hot parts of that workload indeed rely on the memory system of the CPU reordering the acquire memory accesses as aggressively as it can, and the extra freedom allowed by the LDAPR instruction allows that.

Conclusion and next steps

If you are able to recompile your application for your particular target then benefiting from this work is as simple as adding the appropriate -march or -mcpu option to your build flags.

Popular open-source compilers from GCC 13.1 and LLVM 16 support this new functionality, Arm Compiler for Linux 23.04 or later.

The 2016 architecture extensions introduced the first instances of the LDAPR instructions but future additions extend this group of instructions. FEAT_LRCPC2 from Armv8.4-a, for example, adds unscaled offset addressing modes and sign-extensions via the LDAPURSB instruction. We expect compilers to be updated to make more aggressive use of LDAPR-style instructions in the future.

Learn more

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Enabling the LDAPR instructions for C/C++ compilers

Mapping language constructs to the Arm architecture

Enter LDAPR

Implementation and evaluation in compilers

Conclusion and next steps

GCC 15: Continuously Improving

GitHub and Arm are transforming development on Windows for developers

What is new in LLVM 20?