Caches and Self-Modifying Code: Implementing `__clear_cache`

January 20, 2025

5 minute read time.

Some time ago, I posted an article about cache maintenance in self-modifying code. I described the use of the __clear_cache function (in Linux) to synchronize the instruction and data caches so that the processor executes what you want it to execute after you have written some code.

Most of the time, using an abstraction (like __clear_cache) is the best solution. However, there may be times when you need to implement it yourself, possibly because you are actually implementing a similar library function, or want something slightly different and want to know where to start from. Perhaps you just want to know how it works. That is what I’ll discuss here.

Implementing `__clear_cache` for AArch64 using the A64 instruction set

The A64 instruction set provides the necessary instructions to perform the required cache maintenance operations in user space (or ‘EL0’ in Arm terminology). This allows self-modifying code updates to be performed directly from EL0, without a system call. For example, __clear_cache on Linux requires no system calls.

Avoiding a dependency on a kernel interface potentially makes this approach fairly portable across operating systems. However, operating system kernels have the ability to deny EL0 access to the necessary instructions. If that’s the case on your target platform, this approach will not work, and you will need to rely on the tools provided by the system.

The Arm ARM (section B2.7.4.2 in DDI0487L.a) tells you exactly what you need to do:

Write the new instructions to memory.
Use dc cvau to clean the data cache to the point of unification. Loosely, the point of unification is the point at which data and instruction accesses see the same value for a given memory location. I simplified this to “memory” in my earlier article, but it is more typically an L2 cache. The architecture does not specify exactly where it is, but application code using dc cvau does not need to know.
Use a dsb barrier to ensure that the data is visible before we move on.
Use ic ivau to invalidate the instruction cache to the point of unification.
Use another dsb barrier to ensure that the ic completes before the next instruction.
Finally, an isb barrier flushes the pipeline, ensuring that any subsequent instructions that the processor has already started working on are discarded and reloaded.

The following sequence synchronizes a single cache line at x0:

...   // Write code to the cache line at x0.
dc cvau, x0
dsb ish
ic ivau, x0
dsb ish
isb
...   // It is now safe to execute the code in the cache line at x0.

These steps are effectively the same as what the Linux kernel does for 32-bit programs.

In practice, code buffers are likely to vary in length, and span multiple cache lines, so functions like __clear_cache will need to loop. To do that, you need to know the size of the system’s cache lines, which you can determine by reading the cache type register, ctr_el0, described in section D24.2.37 in DDI0487L.a of the Arm ARM. Here’s a simple (but complete) example using GCC inline assembly:

#include <stdint.h>
#include <stddef.h>

void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
    uint32_t ctr;
    asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));

    // Work out the line sizes for the I and D caches.
    uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
    uintptr_t const isize = 4 << ((ctr >>  0) & 0xf);

    for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
        asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
    }

    asm("dsb ish\n\t" : : : "memory");

    for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
        asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
    }

    asm("dsb ish\n\t"
        "isb\n\t"
        : : : "memory");
}

Some other realistic examples can be found in VIXL’s CPU::EnsureIAndDCacheCoherency and in Google V8.

Cache-coherent implementations

Although the Arm architecture does not require automatic coherency between data writes and instruction fetches, it does allow for implementations that provide it. On such implementations, one or both of the cache-maintenance operations can be omitted:

If ctr_el0.IDC is set, the implementation does not require an explicit dc cvau. The dsb is still required, to ensure that outstanding stores complete.
If ctr_el0.DIC is set, the implementation does not require an explicit ic ivau.

Taking this into account, the example implementation looks like this:

#include <stdint.h>
#include <stddef.h>

void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
    uint32_t ctr;
    asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));

    uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
    uintptr_t const isize = 4 << ((ctr >>  0) & 0xf);
    bool n_idc = ((ctr >> 28) & 0x1) == 0;
    bool n_dic = ((ctr >> 29) & 0x1) == 0;

    if (n_idc) {
        for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
            asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
        }
    }

    asm("dsb ish\n\t" : : : "memory");

    if (n_dic) {
        for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
            asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
        }
        asm("dsb ish\n\t" : : : "memory");
    }

    asm("isb\n\t" : : : "memory");
}

Implementing `__clear_cache` for AArch32: Arm (A32) and Thumb (T32).

In AArch32, neither A32 nor T32 offer similar EL0 instructions, so __clear_cache works by calling into the Linux kernel. You can do this directly as follows:

ldr   r0, =start_address
ldr   r1, =end_address
mov   r2, #0              @ r2 _must_ be zero.
ldr   r7, =0x000f0002
svc   0                   @ The svc number is ignored.

The important thing is that the registers r0, r1, r2 and r7 are set properly when the svc executes; it doesn’t matter how you achieve this. If the arguments are already in the right registers, for example, you might not need to do anything. The Google V8 JavaScript engine uses GCC inline assembly to do it and lets the compiler worry about the best way to get the values where they need to be.

Parents

Jubilee Young 3 months ago
Out of curiosity, why can't the implementation be this?
dc cvau, x0
ic ivau, x0
dsb ish
isb
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jacob Bramley 3 months ago in reply to Jubilee Young

We need the DC to complete — meaning actually write to point of unification — before the I cache refills.

In your example, the first DSB is missing, so the IC can complete, and the I cache for x0 be refilled, before the DC itself completes. The second DSB waits for both, but the refill could already have occurred by then.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Jacob Bramley 3 months ago in reply to Jubilee Young

We need the DC to complete — meaning actually write to point of unification — before the I cache refills.

In your example, the first DSB is missing, so the IC can complete, and the I cache for x0 be refilled, before the DC itself completes. The second DSB waits for both, but the refill could already have occurred by then.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025
Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Caches and Self-Modifying Code: Implementing `__clear_cache`

Implementing `__clear_cache` for AArch64 using the A64 instruction set

Cache-coherent implementations

Implementing `__clear_cache` for AArch32: Arm (A32) and Thumb (T32).

When a barrier does not block: The pitfalls of partial order

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Caches and Self-Modifying Code: Implementing `__clear_cache`

Implementing __clear_cache for AArch64 using the A64 instruction set

Cache-coherent implementations

Implementing __clear_cache for AArch32: Arm (A32) and Thumb (T32).

Implementing `__clear_cache` for AArch64 using the A64 instruction set

Implementing `__clear_cache` for AArch32: Arm (A32) and Thumb (T32).