Some time ago, I posted an article about cache maintenance in self-modifying code. I described the use of the __clear_cache function (in Linux) to synchronize the instruction and data caches so that the processor executes what you want it to execute after you have written some code.
__clear_cache
Most of the time, using an abstraction (like __clear_cache) is the best solution. However, there may be times when you need to implement it yourself, possibly because you are actually implementing a similar library function, or want something slightly different and want to know where to start from. Perhaps you just want to know how it works. That is what I’ll discuss here.
The A64 instruction set provides the necessary instructions to perform the required cache maintenance operations in user space (or ‘EL0’ in Arm terminology). This allows self-modifying code updates to be performed directly from EL0, without a system call. For example, __clear_cache on Linux requires no system calls.
Avoiding a dependency on a kernel interface potentially makes this approach fairly portable across operating systems. However, operating system kernels have the ability to deny EL0 access to the necessary instructions. If that’s the case on your target platform, this approach will not work, and you will need to rely on the tools provided by the system.
The Arm ARM (section B2.7.4.2 in DDI0487L.a) tells you exactly what you need to do:
dc cvau
dsb
ic ivau
ic
isb
The following sequence synchronizes a single cache line at x0:
... // Write code to the cache line at x0. dc cvau, x0 dsb ish ic ivau, x0 dsb ish isb ... // It is now safe to execute the code in the cache line at x0.
These steps are effectively the same as what the Linux kernel does for 32-bit programs.
In practice, code buffers are likely to vary in length, and span multiple cache lines, so functions like __clear_cache will need to loop. To do that, you need to know the size of the system’s cache lines, which you can determine by reading the cache type register, ctr_el0, described in section D24.2.37 in DDI0487L.a of the Arm ARM. Here’s a simple (but complete) example using GCC inline assembly:
ctr_el0
#include <stdint.h> #include <stddef.h> void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) { uint32_t ctr; asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr)); // Work out the line sizes for the I and D caches. uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf); uintptr_t const isize = 4 << ((ctr >> 0) & 0xf); for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) { asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory"); } asm("dsb ish\n\t" : : : "memory"); for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) { asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory"); } asm("dsb ish\n\t" "isb\n\t" : : : "memory"); }
Some other realistic examples can be found in VIXL’s CPU::EnsureIAndDCacheCoherency and in Google V8.
Although the Arm architecture does not require automatic coherency between data writes and instruction fetches, it does allow for implementations that provide it. On such implementations, one or both of the cache-maintenance operations can be omitted:
ctr_el0.IDC
ctr_el0.DIC
Taking this into account, the example implementation looks like this:
#include <stdint.h> #include <stddef.h> void EnsureIAndDCacheCoherency(uintptr_t start, uintptr_t end) { uint32_t ctr; asm("mrs %[ctr], ctr_el0\n\t" : [ctr] "=r" (ctr)); uintptr_t const dsize = 4 << ((ctr >> 16) & 0xf); uintptr_t const isize = 4 << ((ctr >> 0) & 0xf); bool n_idc = ((ctr >> 28) & 0x1) == 0; bool n_dic = ((ctr >> 29) & 0x1) == 0; if (n_idc) { for (uintptr_t dline = start & ~(dsize - 1); dline < end; dline += dsize) { asm("dc cvau, %[dline]\n\t" : : [dline] "r" (dline) : "memory"); } } asm("dsb ish\n\t" : : : "memory"); if (n_dic) { for (uintptr_t iline = start & ~(isize - 1); iline < end; iline += isize) { asm("ic ivau, %[iline]\n\t" : : [iline] "r" (iline) : "memory"); } asm("dsb ish\n\t" : : : "memory"); } asm("isb\n\t" : : : "memory"); }
In AArch32, neither A32 nor T32 offer similar EL0 instructions, so __clear_cache works by calling into the Linux kernel. You can do this directly as follows:
ldr r0, =start_address ldr r1, =end_address mov r2, #0 @ r2 _must_ be zero. ldr r7, =0x000f0002 svc 0 @ The svc number is ignored.
The important thing is that the registers r0, r1, r2 and r7 are set properly when the svc executes; it doesn’t matter how you achieve this. If the arguments are already in the right registers, for example, you might not need to do anything. The Google V8 JavaScript engine uses GCC inline assembly to do it and lets the compiler worry about the best way to get the values where they need to be.
r0
r1
r2
r7
svc