Caches and Self-Modifying Code: Implementing `__clear_cache`

January 20, 2025

5 minute read time.

Some time ago, I posted an article about cache maintenance in self-modifying code. I described the use of the __clear_cache function (in Linux) to synchronize the instruction and data caches so that the processor executes what you want it to execute after you have written some code.

Most of the time, using an abstraction (like __clear_cache) is the best solution. However, there may be times when you need to implement it yourself, possibly because you are actually implementing a similar library function, or want something slightly different and want to know where to start from. Perhaps you just want to know how it works. That is what I’ll discuss here.

Implementing `__clear_cache` for AArch64 using the A64 instruction set

The A64 instruction set provides the necessary instructions to perform the required cache maintenance operations in user space (or ‘EL0’ in Arm terminology). This allows self-modifying code updates to be performed directly from EL0, without a system call. For example, __clear_cache on Linux requires no system calls.

Avoiding a dependency on a kernel interface potentially makes this approach fairly portable across operating systems. However, operating system kernels have the ability to deny EL0 access to the necessary instructions. If that’s the case on your target platform, this approach will not work, and you will need to rely on the tools provided by the system.

The Arm ARM (section B2.7.4.2 in DDI0487L.a) tells you exactly what you need to do:

Write the new instructions to memory.
Use dc cvau to clean the data cache to the point of unification. Loosely, the point of unification is the point at which data and instruction accesses see the same value for a given memory location. I simplified this to “memory” in my earlier article, but it is more typically an L2 cache. The architecture does not specify exactly where it is, but application code using dc cvau does not need to know.
Use a dsb barrier to ensure that the data is visible before we move on.
Use ic ivau to invalidate the instruction cache to the point of unification.
Use another dsb barrier to ensure that the ic completes before the next instruction.
Finally, an isb barrier flushes the pipeline, ensuring that any subsequent instructions that the processor has already started working on are discarded and reloaded.

The following sequence synchronizes a single cache line at x0:

...   // Write code to the cache line at x0.
dc cvau, x0
dsb ish
ic ivau, x0
dsb ish
isb
...   // It is now safe to execute the code in the cache line at x0.

These steps are effectively the same as what the Linux kernel does for 32-bit programs.

In practice, code buffers are likely to vary in length, and span multiple cache lines, so functions like __clear_cache will need to loop. To do that, you need to know the size of the system’s cache lines, which you can determine by reading the cache type register, ctr_el0, described in section D24.2.37 in DDI0487L.a of the Arm ARM. Here’s a simple (but complete) example using GCC inline assembly:

#include <stdint.h>
#include <stddef.h>

void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
    uint32_t ctr;
    asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));

    // Work out the line sizes for the I and D caches.
    uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
    uintptr_t const isize = 4 << ((ctr >>  0) & 0xf);

    for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
        asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
    }

    asm("dsb ish\n\t" : : : "memory");

    for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
        asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
    }

    asm("dsb ish\n\t"
        "isb\n\t"
        : : : "memory");
}

Some other realistic examples can be found in VIXL’s CPU::EnsureIAndDCacheCoherency and in Google V8.

Cache-coherent implementations

Although the Arm architecture does not require automatic coherency between data writes and instruction fetches, it does allow for implementations that provide it. On such implementations, one or both of the cache-maintenance operations can be omitted:

If ctr_el0.IDC is set, the implementation does not require an explicit dc cvau. The dsb is still required, to ensure that outstanding stores complete.
If ctr_el0.DIC is set, the implementation does not require an explicit ic ivau.

Taking this into account, the example implementation looks like this:

#include <stdint.h>
#include <stddef.h>

void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) {
    uint32_t ctr;
    asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr));

    uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf);
    uintptr_t const isize = 4 << ((ctr >>  0) & 0xf);
    bool n_idc = ((ctr >> 28) & 0x1) == 0;
    bool n_dic = ((ctr >> 29) & 0x1) == 0;

    if (n_idc) {
        for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) {
            asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory");
        }
    }

    asm("dsb ish\n\t" : : : "memory");

    if (n_dic) {
        for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) {
            asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory");
        }
        asm("dsb ish\n\t" : : : "memory");
    }

    asm("isb\n\t" : : : "memory");
}

Implementing `__clear_cache` for AArch32: Arm (A32) and Thumb (T32).

In AArch32, neither A32 nor T32 offer similar EL0 instructions, so __clear_cache works by calling into the Linux kernel. You can do this directly as follows:

ldr   r0, =start_address
ldr   r1, =end_address
mov   r2, #0              @ r2 _must_ be zero.
ldr   r7, =0x000f0002
svc   0                   @ The svc number is ignored.

The important thing is that the registers r0, r1, r2 and r7 are set properly when the svc executes; it doesn’t matter how you achieve this. If the arguments are already in the right registers, for example, you might not need to do anything. The Google V8 JavaScript engine uses GCC inline assembly to do it and lets the compiler worry about the best way to get the values where they need to be.

Architectures and Processors blog

Caches and Self-Modifying Code: Implementing `__clear_cache`

Jacob Bramley

How to implement `__clear_cache` using assembly.
- January 20, 2025
The when, why and how of waiting and backoff in multi-threaded applications on Arm

Ola Liljedahl

Read about the different user space delays and wait implementations for the Armv8+ architecture and best practices for the purpose of improving throughput and fair access to shared resources.
- December 13, 2024
Using SVE in C#

Alan Hayward

.NET 9 introduces SVE support on Arm, allowing users to write simplified vectorised code. This blog post gives examples in C# and compares it to C++.
- November 20, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Caches and Self-Modifying Code: Implementing `__clear_cache`

Implementing `__clear_cache` for AArch64 using the A64 instruction set

Cache-coherent implementations

Implementing `__clear_cache` for AArch32: Arm (A32) and Thumb (T32).