Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog Enabling the LDAPR instructions for C/C++ compilers
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Arm Compiler for Linux
  • Memory Model Tool
  • LLVM
  • GCC
  • Compilers
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Enabling the LDAPR instructions for C/C++ compilers

Kyrylo Tkachov
Kyrylo Tkachov
November 15, 2023
3 minute read time.

The Arm architecture supports the release consistent processor consistent model for load acquires for a number of years. This feature enables the LDAPR family of instructions that can be used to implement load-acquire semantics from C and C++. Recent compilers like GCC 13.1 and LLVM 16 can now use these instructions in code generation automatically, resulting in significant performance gains on some multi-threaded workloads.

Mapping language constructs to the Arm architecture

Since the introduction of the Armv8-A architecture the defined mapping of atomic loads with an acquire memory model from C++11 has been to use the LDAR instruction. For example:

#include <atomic>

std::atomic<unsigned long> data;

unsigned long foo() {

    return data.load(std::memory_order_acquire);
}

generates an LDAR instruction to preform the load from data.  Such acquire loads are often paired with corresponding store-release operations using the STLR instructions. This mapping looks simple at the language level but there are many interesting ordering guarantees to consider at the assembly level. In particular, the allowed reordering guarantees at the hardware level look like :

Ordering requirements on LDAR-based acquire-release sequences

It is important for software to give enough freedom to the hardware to reorder memory accesses where possible according to the weak memory model, while still following the constraints of the source language memory model.

Enter LDAPR

We've been exploring the case for using the LDAPR instructions in place of the default LDAR instructions when they are available on the target CPU. The rationale is that the Release Consistency processor consistency model that these instructions implement should give the hardware freedom to reorder memory accesses past STLR instructions more aggressively, so long as there is no address dependency between them.

Ordering requirements on LDAPR-based acquire-release sequences

The freedom to move LDAPR instructions past preceding STLR instructions to unrelated addresses has the potential to improve performance in these sequences.

Implementation and evaluation in compilers

Once we confirmed that the above relaxation from LDAR to LDAPR is legal under the C and C++ language rules, the implementation of the new mapping has been comparatively simple.

As of the LLVM 16 and GCC 13.1 releases of the popular open-source compilers for AArch64 the code generated for the snippet as above uses LDAPR when compiled with an -march or -mcpu option that supports it. It can also be explicitly enabled through the +rcpc option extensions. For example compiling with -O2 -mcpu=neoverse-v1 generates:

foo():
        adrp    x8, data
        add     x8, x8, :lo12:data
        ldapr   x0, [x8]
        ret
data:
        .zero   8

To evaluate the performance impact of this change we used the popular Data Plane Development Kit DPDK and the performance benchmarks it includes.

We are happy to report that on the ring sub-tests of DPDK that perform enqueue and dequeue operations on a shared queue across multiple cores we have seen impressive improvements. Sometimes they go up to 70% on modern CPUs like the Neoverse V1 that implement the LDAPR instructions! A short analysis of the code indicates that the hot parts of that workload indeed rely on the memory system of the CPU reordering the acquire memory accesses as aggressively as it can, and the extra freedom allowed by the LDAPR instruction allows that.

Conclusion and next steps

If you are able to recompile your application for your particular target then benefiting from this work is as simple as adding the appropriate -march or -mcpu option to your build flags.

Popular open-source compilers from GCC 13.1 and LLVM 16 support this new functionality, Arm Compiler for Linux 23.04 or later.

The 2016 architecture extensions introduced the first instances of the LDAPR instructions but future additions extend this group of instructions. FEAT_LRCPC2 from Armv8.4-a, for example, adds unscaled offset addressing modes and sign-extensions via the LDAPURSB instruction. We expect compilers to be updated to make more aggressive use of LDAPR-style instructions in the future.

Learn more

Anonymous
Tools, Software and IDEs blog
  • Python on Arm: 2025 Update

    Diego Russo
    Diego Russo
    Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
    • August 21, 2025
  • Product update: Arm Development Studio 2025.0 now available

    Stephen Theobald
    Stephen Theobald
    Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
    • July 18, 2025
  • GCC 15: Continuously Improving

    Tamar Christina
    Tamar Christina
    GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
    • June 26, 2025