Caches and Self-Modifying Code

September 11, 2013

6 minute read time.

Ideally, caches act as some magic make-it-go-faster logic, sitting between your processor core (or cores) and your memory bank. Whilst it can be beneficial to consider specific cache features when writing some performance-critical code, it is usually advisable to consider only general cache behaviour in mind. However, there are cases where the cache behaviour must be considered in order to get the result that you want, and self-modifying code is an excellent example.

Cached Arm architectures have a separate cache for data and instruction accesses; these are called the D-cache and the I-cache, respectively. For this reason, the Arm architecture is often considered to be a Modified Harvard Architecture, though I must admit that with most real processors existing somewhere between Harvard and von Neumann architectures, I do not find that label particularly useful. There are a few benefits of this design, but the one I have seen discussed the most often is that with two interfaces to the CPU, the core can load an instruction and some data at the same time.

Whilst employing this Harvard-style memory interface is useful for performance, it does have its own drawbacks. The typical drawback of a pure Harvard architecture is that instruction memory is not directly accessible from the same address space as data memory, though this restriction does not apply to Arm. On Arm, you can write instructions into memory, but because the D-cache and I-cache are not coherent, the newly-written instructions might be masked by the existing contents of the I-cache, causing the processor to execute old (or possibly invalid) instructions.

The Problem

Consider some hypothetical self-modifying code¹: Some existing JIT-compiled code was generated at run-time to load a function address into a regiser and then branch to it. The JIT compiler has moved the target function to a new location, and needs to update the original pointer. This is a typical operation in JIT compilers, either because the destination address is not known at the time of compilation or because the destination function has been re-compiled in a new location with some additional optimizations.

A highly simplified view of the processor before the new code is written might look like this:

Simplified view of processor before new code

In this case, the I-cache begins with the old code already loaded. This may not always be true; if the code has not been executed it is unlikely to be in the I-cache, but it is possible for code to be there even if it has not run before, and it is not practical to check this at run-time so we assume that the I-cache does contain the old code, just to be on the safe side.

The processor can only execute instructions that are in the I-cache and can only see data that are in the D-cache. Generally, it cannot directly access memory. Significantly for us, it cannot directly execute instructions that are in the D-cache and cannot programatically read or write data that are in the I-cache. Because we cannot directly write into the I-cache (or into memory), we will end up with something like this when we write out the new code:

If you now try to run the code you have just written, the processor will ignore it and will simply execute the old code because it is still in the I-cache and it does not know any different. This is a real nuisance for applications (such as JIT compilers) that employ self-modifying code!

The Solution

We need to get data from the D-cache into the I-cache. It is clear from the data-flow arrows on the diagram that there is only one way to achieve this: the data must move to the external memory and then into the I-cache from there.

At some point in the future, the processor may decide to write out the new data from the D-cache to memory, and it may also need to re-read from memory to the I-cache, but there is not a practical way to know when it will do those things, so we have to force it. The terminology surrounding caches varies between processor architectures, so I will emphasize important terms. Currently, the data we care about in the D-cache are new, and do not match the contents of memory. These are dirty data. Unsurprisingly, to push the data out to memory, we must clean it, then wait for the write to complete. The result looks something like this:

In order to execute the new instructions, we need to tell the processor that the contents of the I-cache are stale and need to be re-loaded from memory. We do this by invalidating the instructions in the I-cache. The results will look like this:

Invalidating instructions in I-cache esults

If you now attempt to run the code you have written out to memory, the instruction fetch will miss in the I-cache and the processor will have to get the new version from memory. The result is that your newly-emitted code gets executed, as you intended.

That is not the whole story, however; there are other things that you need to do. On processors with branch prediction, you need to clear out the branch target buffer. Generally, processors will queue up writes to memory in a write buffer, so this also must be drained when cleaning the D-cache. Of course, in practice these tasks are very specific to the processor you are using, and you will use a library function to do this stuff anyway. It is important to understand what your library functions do and why they are necessary; it is not important to understand the nitty-gritty details of each specific processor if you just want to write self-modifying code.

Finally, you might consider using a pli instruction to hint to the processor that it should preload the new code into the I-cache. This might give you a decent performance boost, as it will not have to stall on memory when you eventually branch to it. Of course, being a hint, it might have no effect at all, but it can be beneficial on some implementations.

The Code

Diagrams and memory models are all very good, but how does this all relate to real code?

As usual, the relevant instructions are CP15 (System Control Coprocessor) operations, and cannot be run from non-privileged modes (where most applications run). In practice, this means that the operating system has to perform the operations for you. Fortunately, most systems provide a mechanism for cleaning or invalidating the cache from non-privileged modes, and these mechanisms will hide the processor-specific complexities.

Linux (GCC)

In GCC on Linux, you should use the __clear_cache function:

void __clear_cache(char* beg, char* end);

Of course, there is little documentation for this important function, and you have to root around a fair bit to find out what it actually does. Essentially, __clear_cache does the following (using a system call):

Clean the specified data cache range.
Invalidate the specified instruction cache range.

The start address (char* beg) is inclusive, whilst the end address (char* end) is exclusive.

The function will also flush the write buffer and perform any other necessary processor-specific fiddling about that you, as the caller, do not want to worry about. If you really want to know exactly what it does, you will need to look in the Linux kernel, in arch/arm/mm/cache-v7.S (or the equivalent file for whichever architecture you are using).

There is an example which you can play with attached to this post.

Others

Operating System	Relevant Library Function
Linux (GCC)	`__clear_cache`
Google Android	`cacheflush`
Windows CE	`FlushInstructionCache`

¹
Whilst the use of self-modifying code is generally discouraged by many (including myself), there are a small number of cases where its use is essential. An excellent example is that of a JIT compiler, where fragments of code are compiled at run-time. A more common (though less obvious) example is that of an operating system kernel: from the point of view of the processor, some code in the system is modifying some other code in the system every time a process is swapped in or out.

clear_cache.tar.gz

13 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Caches and Self-Modifying Code

The Problem

The Solution

The Code

Linux (GCC)

Others

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC