Caches and Self-Modifying Code

September 11, 2013

6 minute read time.

Ideally, caches act as some magic make-it-go-faster logic, sitting between your processor core (or cores) and your memory bank. Whilst it can be beneficial to consider specific cache features when writing some performance-critical code, it is usually advisable to consider only general cache behaviour in mind. However, there are cases where the cache behaviour must be considered in order to get the result that you want, and self-modifying code is an excellent example.

Cached Arm architectures have a separate cache for data and instruction accesses; these are called the D-cache and the I-cache, respectively. For this reason, the Arm architecture is often considered to be a Modified Harvard Architecture, though I must admit that with most real processors existing somewhere between Harvard and von Neumann architectures, I do not find that label particularly useful. There are a few benefits of this design, but the one I have seen discussed the most often is that with two interfaces to the CPU, the core can load an instruction and some data at the same time.

Whilst employing this Harvard-style memory interface is useful for performance, it does have its own drawbacks. The typical drawback of a pure Harvard architecture is that instruction memory is not directly accessible from the same address space as data memory, though this restriction does not apply to Arm. On Arm, you can write instructions into memory, but because the D-cache and I-cache are not coherent, the newly-written instructions might be masked by the existing contents of the I-cache, causing the processor to execute old (or possibly invalid) instructions.

The Problem

Consider some hypothetical self-modifying code¹: Some existing JIT-compiled code was generated at run-time to load a function address into a regiser and then branch to it. The JIT compiler has moved the target function to a new location, and needs to update the original pointer. This is a typical operation in JIT compilers, either because the destination address is not known at the time of compilation or because the destination function has been re-compiled in a new location with some additional optimizations.

A highly simplified view of the processor before the new code is written might look like this:

Simplified view of processor before new code

In this case, the I-cache begins with the old code already loaded. This may not always be true; if the code has not been executed it is unlikely to be in the I-cache, but it is possible for code to be there even if it has not run before, and it is not practical to check this at run-time so we assume that the I-cache does contain the old code, just to be on the safe side.

The processor can only execute instructions that are in the I-cache and can only see data that are in the D-cache. Generally, it cannot directly access memory. Significantly for us, it cannot directly execute instructions that are in the D-cache and cannot programatically read or write data that are in the I-cache. Because we cannot directly write into the I-cache (or into memory), we will end up with something like this when we write out the new code:

If you now try to run the code you have just written, the processor will ignore it and will simply execute the old code because it is still in the I-cache and it does not know any different. This is a real nuisance for applications (such as JIT compilers) that employ self-modifying code!

The Solution

We need to get data from the D-cache into the I-cache. It is clear from the data-flow arrows on the diagram that there is only one way to achieve this: the data must move to the external memory and then into the I-cache from there.

At some point in the future, the processor may decide to write out the new data from the D-cache to memory, and it may also need to re-read from memory to the I-cache, but there is not a practical way to know when it will do those things, so we have to force it. The terminology surrounding caches varies between processor architectures, so I will emphasize important terms. Currently, the data we care about in the D-cache are new, and do not match the contents of memory. These are dirty data. Unsurprisingly, to push the data out to memory, we must clean it, then wait for the write to complete. The result looks something like this:

In order to execute the new instructions, we need to tell the processor that the contents of the I-cache are stale and need to be re-loaded from memory. We do this by invalidating the instructions in the I-cache. The results will look like this:

Invalidating instructions in I-cache esults

If you now attempt to run the code you have written out to memory, the instruction fetch will miss in the I-cache and the processor will have to get the new version from memory. The result is that your newly-emitted code gets executed, as you intended.

That is not the whole story, however; there are other things that you need to do. On processors with branch prediction, you need to clear out the branch target buffer. Generally, processors will queue up writes to memory in a write buffer, so this also must be drained when cleaning the D-cache. Of course, in practice these tasks are very specific to the processor you are using, and you will use a library function to do this stuff anyway. It is important to understand what your library functions do and why they are necessary; it is not important to understand the nitty-gritty details of each specific processor if you just want to write self-modifying code.

Finally, you might consider using a pli instruction to hint to the processor that it should preload the new code into the I-cache. This might give you a decent performance boost, as it will not have to stall on memory when you eventually branch to it. Of course, being a hint, it might have no effect at all, but it can be beneficial on some implementations.

The Code

Diagrams and memory models are all very good, but how does this all relate to real code?

As usual, the relevant instructions are CP15 (System Control Coprocessor) operations, and cannot be run from non-privileged modes (where most applications run). In practice, this means that the operating system has to perform the operations for you. Fortunately, most systems provide a mechanism for cleaning or invalidating the cache from non-privileged modes, and these mechanisms will hide the processor-specific complexities.

Linux (GCC)

In GCC on Linux, you should use the __clear_cache function:

void __clear_cache(char* beg, char* end);

Of course, there is little documentation for this important function, and you have to root around a fair bit to find out what it actually does. Essentially, __clear_cache does the following (using a system call):

Clean the specified data cache range.
Invalidate the specified instruction cache range.

The start address (char* beg) is inclusive, whilst the end address (char* end) is exclusive.

The function will also flush the write buffer and perform any other necessary processor-specific fiddling about that you, as the caller, do not want to worry about. If you really want to know exactly what it does, you will need to look in the Linux kernel, in arch/arm/mm/cache-v7.S (or the equivalent file for whichever architecture you are using).

There is an example which you can play with attached to this post.

Others

Operating System	Relevant Library Function
Linux (GCC)	`__clear_cache`
Google Android	`cacheflush`
Windows CE	`FlushInstructionCache`

¹
Whilst the use of self-modifying code is generally discouraged by many (including myself), there are a small number of cases where its use is essential. An excellent example is that of a JIT compiler, where fragments of code are compiled at run-time. A more common (though less obvious) example is that of an operating system kernel: from the point of view of the processor, some code in the system is modifying some other code in the system every time a process is swapped in or out.

clear_cache.tar.gz

Parents

NOTAN over 7 years ago

Definition of MOVT Macro in attachment might have a problem:

// MOVT(rd,x) emits this: "MOVT rd, #:upper16:x"

#define MOVT(rd,x) (0xe3400000 | (((x)>>12)&0xf0000) | ((rd)<<12) | ((x)>>16))

if x > 0x0fffffff the upper 4 bit of x are ored with rd. This leads to problems if the function address very high.

Should be:

#define MOVT(rd,x) (0xe3400000 | (((x)>>12)&0xf0000) | ((rd)<<12) | (((x)>>16)&0xfff))
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

NOTAN over 7 years ago

Definition of MOVT Macro in attachment might have a problem:

// MOVT(rd,x) emits this: "MOVT rd, #:upper16:x"

#define MOVT(rd,x) (0xe3400000 | (((x)>>12)&0xf0000) | ((rd)<<12) | ((x)>>16))

if x > 0x0fffffff the upper 4 bit of x are ored with rd. This leads to problems if the function address very high.

Should be:

#define MOVT(rd,x) (0xe3400000 | (((x)>>12)&0xf0000) | ((rd)<<12) | (((x)>>16)&0xfff))
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

Architectures and Processors blog

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Ker Liu

In-depth analysis of what the PMU of L2D_CACHE_WR counts on the Neoverse N2 server.
- April 15, 2024
Arm SPE: SoC Telemetry & Performance Analysis using Statistical Profiling Extension

Brian Jeff

We refer to the SPE performance methodology whitepaper published by Arm for details on the content of this blog.
- December 8, 2023
Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture

Anton Kirilov

We discuss some of the challenges that we face when we are trying to implement the WebAssembly SIMD bitmask operations on the 64-bit Arm architecture.
- December 6, 2023

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Caches and Self-Modifying Code

The Problem

The Solution

The Code

Linux (GCC)

Others

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Arm SPE: SoC Telemetry & Performance Analysis using Statistical Profiling Extension

Implementing the WebAssembly bitmask operations on the 64-bit Arm architecture