Caches and Self-Modifying Code

September 11, 2013

6 minute read time.

Ideally, caches act as some magic make-it-go-faster logic, sitting between your processor core (or cores) and your memory bank. Whilst it can be beneficial to consider specific cache features when writing some performance-critical code, it is usually advisable to consider only general cache behaviour in mind. However, there are cases where the cache behaviour must be considered in order to get the result that you want, and self-modifying code is an excellent example.

Cached Arm architectures have a separate cache for data and instruction accesses; these are called the D-cache and the I-cache, respectively. For this reason, the Arm architecture is often considered to be a Modified Harvard Architecture, though I must admit that with most real processors existing somewhere between Harvard and von Neumann architectures, I do not find that label particularly useful. There are a few benefits of this design, but the one I have seen discussed the most often is that with two interfaces to the CPU, the core can load an instruction and some data at the same time.

Whilst employing this Harvard-style memory interface is useful for performance, it does have its own drawbacks. The typical drawback of a pure Harvard architecture is that instruction memory is not directly accessible from the same address space as data memory, though this restriction does not apply to Arm. On Arm, you can write instructions into memory, but because the D-cache and I-cache are not coherent, the newly-written instructions might be masked by the existing contents of the I-cache, causing the processor to execute old (or possibly invalid) instructions.

The Problem

Consider some hypothetical self-modifying code¹: Some existing JIT-compiled code was generated at run-time to load a function address into a regiser and then branch to it. The JIT compiler has moved the target function to a new location, and needs to update the original pointer. This is a typical operation in JIT compilers, either because the destination address is not known at the time of compilation or because the destination function has been re-compiled in a new location with some additional optimizations.

A highly simplified view of the processor before the new code is written might look like this:

Simplified view of processor before new code

In this case, the I-cache begins with the old code already loaded. This may not always be true; if the code has not been executed it is unlikely to be in the I-cache, but it is possible for code to be there even if it has not run before, and it is not practical to check this at run-time so we assume that the I-cache does contain the old code, just to be on the safe side.

The processor can only execute instructions that are in the I-cache and can only see data that are in the D-cache. Generally, it cannot directly access memory. Significantly for us, it cannot directly execute instructions that are in the D-cache and cannot programatically read or write data that are in the I-cache. Because we cannot directly write into the I-cache (or into memory), we will end up with something like this when we write out the new code:

If you now try to run the code you have just written, the processor will ignore it and will simply execute the old code because it is still in the I-cache and it does not know any different. This is a real nuisance for applications (such as JIT compilers) that employ self-modifying code!

The Solution

We need to get data from the D-cache into the I-cache. It is clear from the data-flow arrows on the diagram that there is only one way to achieve this: the data must move to the external memory and then into the I-cache from there.

At some point in the future, the processor may decide to write out the new data from the D-cache to memory, and it may also need to re-read from memory to the I-cache, but there is not a practical way to know when it will do those things, so we have to force it. The terminology surrounding caches varies between processor architectures, so I will emphasize important terms. Currently, the data we care about in the D-cache are new, and do not match the contents of memory. These are dirty data. Unsurprisingly, to push the data out to memory, we must clean it, then wait for the write to complete. The result looks something like this:

In order to execute the new instructions, we need to tell the processor that the contents of the I-cache are stale and need to be re-loaded from memory. We do this by invalidating the instructions in the I-cache. The results will look like this:

Invalidating instructions in I-cache esults

If you now attempt to run the code you have written out to memory, the instruction fetch will miss in the I-cache and the processor will have to get the new version from memory. The result is that your newly-emitted code gets executed, as you intended.

That is not the whole story, however; there are other things that you need to do. On processors with branch prediction, you need to clear out the branch target buffer. Generally, processors will queue up writes to memory in a write buffer, so this also must be drained when cleaning the D-cache. Of course, in practice these tasks are very specific to the processor you are using, and you will use a library function to do this stuff anyway. It is important to understand what your library functions do and why they are necessary; it is not important to understand the nitty-gritty details of each specific processor if you just want to write self-modifying code.

Finally, you might consider using a pli instruction to hint to the processor that it should preload the new code into the I-cache. This might give you a decent performance boost, as it will not have to stall on memory when you eventually branch to it. Of course, being a hint, it might have no effect at all, but it can be beneficial on some implementations.

The Code

Diagrams and memory models are all very good, but how does this all relate to real code?

As usual, the relevant instructions are CP15 (System Control Coprocessor) operations, and cannot be run from non-privileged modes (where most applications run). In practice, this means that the operating system has to perform the operations for you. Fortunately, most systems provide a mechanism for cleaning or invalidating the cache from non-privileged modes, and these mechanisms will hide the processor-specific complexities.

Linux (GCC)

In GCC on Linux, you should use the __clear_cache function:

void __clear_cache(char* beg, char* end);

Of course, there is little documentation for this important function, and you have to root around a fair bit to find out what it actually does. Essentially, __clear_cache does the following (using a system call):

Clean the specified data cache range.
Invalidate the specified instruction cache range.

The start address (char* beg) is inclusive, whilst the end address (char* end) is exclusive.

The function will also flush the write buffer and perform any other necessary processor-specific fiddling about that you, as the caller, do not want to worry about. If you really want to know exactly what it does, you will need to look in the Linux kernel, in arch/arm/mm/cache-v7.S (or the equivalent file for whichever architecture you are using).

There is an example which you can play with attached to this post.

Others

Operating System	Relevant Library Function
Linux (GCC)	`__clear_cache`
Google Android	`cacheflush`
Windows CE	`FlushInstructionCache`

¹
Whilst the use of self-modifying code is generally discouraged by many (including myself), there are a small number of cases where its use is essential. An excellent example is that of a JIT compiler, where fragments of code are compiled at run-time. A more common (though less obvious) example is that of an operating system kernel: from the point of view of the processor, some code in the system is modifying some other code in the system every time a process is swapped in or out.

clear_cache.tar.gz

Jens Bauer over 11 years ago

I believe it's necessary, especially in that case.
Just think of the instruction cache as a short-term memory, and the RAM (no matter which segment you're in) as a long-term memory.
If you erase your long term memory, to exclude that you just prepared a cup of tea, you still remember in your short term memory that you're about to drink that cup of tea.
In other words: If you have a short routine you keep modifying and executing, chances that an old version will exist in your instruction cache are very big.
-And as soon as there is just a single possibility that this happens, you'll get strange behaviours (aka periodic bugs).
So I'd definitely just clear the caches every time I've generated or modified code.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jens Bauer over 11 years ago

Great article. I especially liked the details about flushing the D-cache before flushing the I-cache. It makes a really good overview.

Another example of self-modifying code is something that all operating systems have, and I bet there's no computer that does not have it.

-The ability to load executable programs. It does exactly the same thing. The only difference is where the code comes from. Whether it's stored on a harddisk, a SD-card, on a RAM-disk in memory or is generated - it really makes no difference. The real meaning of "self-modifying code" is actually that usually a larger block of instructions are patched with - say a constant byte value - in order to save space. On Arm such code is usually not necessary; we can normally just load the values from RAM.

I generate code on the fly in some cases. If that's the fastest way to do the operation (eg. time-critical operations), I'll use that. If loading from memory is faster, I'll use that.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Dan Milon over 12 years ago

Does this also apply when you directly write on the code segment (ie not generate new code, but modify and execute existing code).From what I understand, you don't have to flush the cache in that case, since you're referring to the code segment of the memory, which goes directly into the instruction cache. Is that right?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Chris Leary over 12 years ago

Why is it necessary to clear out the BTB?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jacob Bramley over 12 years ago

QUOTE (cdleary @ Oct 13 2010, 08:04 PM)

Why is it necessary to clear out the BTB?

Oh, that's a tricky question! Actually, the Cortex-A8 TRM says that you don't have to, but also states that the operation is implemented as a NOP and is essentially free. After a quick skim, I can't find anything in the Architecture Reference Manual that explains why you need to do it.

My statement that it was necessary was based on what the Linux kernel code does. If I find more information, I'll get back to you.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Chris Walsh

Arm is pleased to announce that the Arm Intrinsics Search Engine has been updated to include the Scalable Matrix Extension (SME) intrinsics, including both SME and SME2 intrinsics.
- October 3, 2025
Arm A-Profile Architecture developments 2025

Martin Weidmann

Each year, Arm publishes updates to the A-Profile architecture alongside full Instruction Set and System Register documentation. In 2025, the update is Armv9.7-A.
- October 2, 2025
When a barrier does not block: The pitfalls of partial order

Wathsala Vithanage

Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
- September 15, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Caches and Self-Modifying Code

The Problem

The Solution

The Code

Linux (GCC)

Others

Scalable Matrix Extension: Expanding the Arm Intrinsics Search Engine

Arm A-Profile Architecture developments 2025

When a barrier does not block: The pitfalls of partial order