Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Caches and Self-Modifying Code: Implementing `__clear_cache`
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Architecture
  • Compilers
  • Runtime
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Caches and Self-Modifying Code: Implementing `__clear_cache`

Jacob Bramley
Jacob Bramley
January 20, 2025
5 minute read time.

Some time ago, I posted an article about cache maintenance in self-modifying code. I described the use of the __clear_cache function (in Linux) to synchronize the instruction and data caches so that the processor executes what you want it to execute after you have written some code.

Most of the time, using an abstraction (like __clear_cache) is the best solution. However, there may be times when you need to implement it yourself, possibly because you are actually implementing a similar library function, or want something slightly different and want to know where to start from. Perhaps you just want to know how it works. That is what I’ll discuss here.

Implementing __clear_cache for AArch64 using the A64 instruction set

The A64 instruction set provides the necessary instructions to perform the required cache maintenance operations in user space (or ‘EL0’ in Arm terminology). This allows self-modifying code updates to be performed directly from EL0, without a system call. For example, __clear_cache on Linux requires no system calls.

Avoiding a dependency on a kernel interface potentially makes this approach fairly portable across operating systems. However, operating system kernels have the ability to deny EL0 access to the necessary instructions. If that’s the case on your target platform, this approach will not work, and you will need to rely on the tools provided by the system.

The Arm ARM (section B2.7.4.2 in DDI0487L.a) tells you exactly what you need to do:

  1. Write the new instructions to memory.
  2. Use dc cvau to clean the data cache to the point of unification. Loosely, the point of unification is the point at which data and instruction accesses see the same value for a given memory location. I simplified this to “memory” in my earlier article, but it is more typically an L2 cache. The architecture does not specify exactly where it is, but application code using dc cvau does not need to know.
  3. Use a dsb barrier to ensure that the data is visible before we move on.
  4. Use ic ivau to invalidate the instruction cache to the point of unification.
  5. Use another dsb barrier to ensure that the ic completes before the next instruction.
  6. Finally, an isb barrier flushes the pipeline, ensuring that any subsequent instructions that the processor has already started working on are discarded and reloaded.

The following sequence synchronizes a single cache line at x0:

...   // Write code to the cache line at x0.
dc cvau, x0
dsb ish
ic ivau, x0
dsb ish
isb
...   // It is now safe to execute the code in the cache line at x0.

These steps are effectively the same as what the Linux kernel does for 32-bit programs.

In practice, code buffers are likely to vary in length, and span multiple cache lines, so functions like __clear_cache will need to loop. To do that, you need to know the size of the system’s cache lines, which you can determine by reading the cache type register, ctr_el0, described in section D24.2.37 in DDI0487L.a of the Arm ARM. Here’s a simple (but complete) example using GCC inline assembly:

#include <stdint.h>
#include <stddef.h>

void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
    uint32_t ctr;
    asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));

    // Work out the line sizes for the I and D caches.
    uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
    uintptr_t const isize = 4 << ((ctr >>  0) & 0xf);

    for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
        asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
    }

    asm("dsb ish\n\t" : : : "memory");

    for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
        asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
    }

    asm("dsb ish\n\t"
        "isb\n\t"
        : : : "memory");
}

Some other realistic examples can be found in VIXL’s CPU::EnsureIAndDCacheCoherency and in Google V8.

Cache-coherent implementations

Although the Arm architecture does not require automatic coherency between data writes and instruction fetches, it does allow for implementations that provide it. On such implementations, one or both of the cache-maintenance operations can be omitted:

  • If ctr_el0.IDC is set, the implementation does not require an explicit dc cvau. The dsb is still required, to ensure that outstanding stores complete.
  • If ctr_el0.DIC is set, the implementation does not require an explicit ic ivau.

Taking this into account, the example implementation looks like this:

#include <stdint.h>
#include <stddef.h>

void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
    uint32_t ctr;
    asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));

    uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
    uintptr_t const isize = 4 << ((ctr >>  0) & 0xf);
    bool n_idc = ((ctr >> 28) & 0x1) == 0;
    bool n_dic = ((ctr >> 29) & 0x1) == 0;

    if (n_idc) {
        for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
            asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
        }
    }

    asm("dsb ish\n\t" : : : "memory");

    if (n_dic) {
        for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
            asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
        }
        asm("dsb ish\n\t" : : : "memory");
    }

    asm("isb\n\t" : : : "memory");
}

Implementing __clear_cache for AArch32: Arm (A32) and Thumb (T32).

In AArch32, neither A32 nor T32 offer similar EL0 instructions, so __clear_cache works by calling into the Linux kernel. You can do this directly as follows:

ldr   r0, =start_address
ldr   r1, =end_address
mov   r2, #0              @ r2 _must_ be zero.
ldr   r7, =0x000f0002
svc   0                   @ The svc number is ignored.

The important thing is that the registers r0, r1, r2 and r7 are set properly when the svc executes; it doesn’t matter how you achieve this. If the arguments are already in the right registers, for example, you might not need to do anything. The Google V8 JavaScript engine uses GCC inline assembly to do it and lets the compiler worry about the best way to get the values where they need to be.

Anonymous
Parents
  • Jubilee Young
    Jubilee Young 3 months ago

    Out of curiosity, why can't the implementation be this?

    dc cvau, x0
    ic ivau, x0
    dsb ish
    isb
    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Jacob Bramley
    Jacob Bramley 3 months ago in reply to Jubilee Young

    We need the DC to complete — meaning actually write to point of unification — before the I cache refills.

    In your example, the first DSB is missing, so the IC can complete, and the I cache for x0 be refilled, before the DC itself completes. The second DSB waits for both, but the refill could already have occurred by then.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Comment
  • Jacob Bramley
    Jacob Bramley 3 months ago in reply to Jubilee Young

    We need the DC to complete — meaning actually write to point of unification — before the I cache refills.

    In your example, the first DSB is missing, so the IC can complete, and the I cache for x0 be refilled, before the DC itself completes. The second DSB waits for both, but the refill could already have occurred by then.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Children
No Data
Architectures and Processors blog
  • When a barrier does not block: The pitfalls of partial order

    Wathsala Vithanage
    Wathsala Vithanage
    Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
    • September 15, 2025
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025