Caches and Self-Modifying Code

September 11, 2013

6 minute read time.

Ideally, caches act as some magic make-it-go-faster logic, sitting between your processor core (or cores) and your memory bank. Whilst it can be beneficial to consider specific cache features when writing some performance-critical code, it is usually advisable to consider only general cache behaviour in mind. However, there are cases where the cache behaviour must be considered in order to get the result that you want, and self-modifying code is an excellent example.

Cached Arm architectures have a separate cache for data and instruction accesses; these are called the D-cache and the I-cache, respectively. For this reason, the Arm architecture is often considered to be a Modified Harvard Architecture, though I must admit that with most real processors existing somewhere between Harvard and von Neumann architectures, I do not find that label particularly useful. There are a few benefits of this design, but the one I have seen discussed the most often is that with two interfaces to the CPU, the core can load an instruction and some data at the same time.

Whilst employing this Harvard-style memory interface is useful for performance, it does have its own drawbacks. The typical drawback of a pure Harvard architecture is that instruction memory is not directly accessible from the same address space as data memory, though this restriction does not apply to Arm. On Arm, you can write instructions into memory, but because the D-cache and I-cache are not coherent, the newly-written instructions might be masked by the existing contents of the I-cache, causing the processor to execute old (or possibly invalid) instructions.

The Problem

Consider some hypothetical self-modifying code¹: Some existing JIT-compiled code was generated at run-time to load a function address into a regiser and then branch to it. The JIT compiler has moved the target function to a new location, and needs to update the original pointer. This is a typical operation in JIT compilers, either because the destination address is not known at the time of compilation or because the destination function has been re-compiled in a new location with some additional optimizations.

A highly simplified view of the processor before the new code is written might look like this:

Simplified view of processor before new code

In this case, the I-cache begins with the old code already loaded. This may not always be true; if the code has not been executed it is unlikely to be in the I-cache, but it is possible for code to be there even if it has not run before, and it is not practical to check this at run-time so we assume that the I-cache does contain the old code, just to be on the safe side.

The processor can only execute instructions that are in the I-cache and can only see data that are in the D-cache. Generally, it cannot directly access memory. Significantly for us, it cannot directly execute instructions that are in the D-cache and cannot programatically read or write data that are in the I-cache. Because we cannot directly write into the I-cache (or into memory), we will end up with something like this when we write out the new code:

If you now try to run the code you have just written, the processor will ignore it and will simply execute the old code because it is still in the I-cache and it does not know any different. This is a real nuisance for applications (such as JIT compilers) that employ self-modifying code!

The Solution

We need to get data from the D-cache into the I-cache. It is clear from the data-flow arrows on the diagram that there is only one way to achieve this: the data must move to the external memory and then into the I-cache from there.

At some point in the future, the processor may decide to write out the new data from the D-cache to memory, and it may also need to re-read from memory to the I-cache, but there is not a practical way to know when it will do those things, so we have to force it. The terminology surrounding caches varies between processor architectures, so I will emphasize important terms. Currently, the data we care about in the D-cache are new, and do not match the contents of memory. These are dirty data. Unsurprisingly, to push the data out to memory, we must clean it, then wait for the write to complete. The result looks something like this:

In order to execute the new instructions, we need to tell the processor that the contents of the I-cache are stale and need to be re-loaded from memory. We do this by invalidating the instructions in the I-cache. The results will look like this:

Invalidating instructions in I-cache esults

If you now attempt to run the code you have written out to memory, the instruction fetch will miss in the I-cache and the processor will have to get the new version from memory. The result is that your newly-emitted code gets executed, as you intended.

That is not the whole story, however; there are other things that you need to do. On processors with branch prediction, you need to clear out the branch target buffer. Generally, processors will queue up writes to memory in a write buffer, so this also must be drained when cleaning the D-cache. Of course, in practice these tasks are very specific to the processor you are using, and you will use a library function to do this stuff anyway. It is important to understand what your library functions do and why they are necessary; it is not important to understand the nitty-gritty details of each specific processor if you just want to write self-modifying code.

Finally, you might consider using a pli instruction to hint to the processor that it should preload the new code into the I-cache. This might give you a decent performance boost, as it will not have to stall on memory when you eventually branch to it. Of course, being a hint, it might have no effect at all, but it can be beneficial on some implementations.

The Code

Diagrams and memory models are all very good, but how does this all relate to real code?

As usual, the relevant instructions are CP15 (System Control Coprocessor) operations, and cannot be run from non-privileged modes (where most applications run). In practice, this means that the operating system has to perform the operations for you. Fortunately, most systems provide a mechanism for cleaning or invalidating the cache from non-privileged modes, and these mechanisms will hide the processor-specific complexities.

Linux (GCC)

In GCC on Linux, you should use the __clear_cache function:

void __clear_cache(char* beg, char* end);

Of course, there is little documentation for this important function, and you have to root around a fair bit to find out what it actually does. Essentially, __clear_cache does the following (using a system call):

Clean the specified data cache range.
Invalidate the specified instruction cache range.

The start address (char* beg) is inclusive, whilst the end address (char* end) is exclusive.

The function will also flush the write buffer and perform any other necessary processor-specific fiddling about that you, as the caller, do not want to worry about. If you really want to know exactly what it does, you will need to look in the Linux kernel, in arch/arm/mm/cache-v7.S (or the equivalent file for whichever architecture you are using).

There is an example which you can play with attached to this post.

Others

Operating System	Relevant Library Function
Linux (GCC)	`__clear_cache`
Google Android	`cacheflush`
Windows CE	`FlushInstructionCache`

¹
Whilst the use of self-modifying code is generally discouraged by many (including myself), there are a small number of cases where its use is essential. An excellent example is that of a JIT compiler, where fragments of code are compiled at run-time. A more common (though less obvious) example is that of an operating system kernel: from the point of view of the processor, some code in the system is modifying some other code in the system every time a process is swapped in or out.

clear_cache.tar.gz

NOTAN over 7 years ago

Definition of MOVT Macro in attachment might have a problem:

// MOVT(rd,x) emits this: "MOVT rd, #:upper16:x"

#define MOVT(rd,x) (0xe3400000 | (((x)>>12)&0xf0000) | ((rd)<<12) | ((x)>>16))

if x > 0x0fffffff the upper 4 bit of x are ored with rd. This leads to problems if the function address very high.

Should be:

#define MOVT(rd,x) (0xe3400000 | (((x)>>12)&0xf0000) | ((rd)<<12) | (((x)>>16)&0xfff))
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
jyothi over 8 years ago

Gr8 info.

I tried using __clear_cache() for mapped memory, it didn't work. Not sure what may be issue. Trying in Armv7-A Cortex A9 processors. Used PROT_READ, PROT_WRITE,PROT_EXEC macros while mapping memory. My requirement is to flush data cache and invalidate data cache.

when I gone through cache-v7.S, I could find 2 procedures: v7_dma_inv_range (description is invalidating data cache within specified region), v7_flush_cache_range (description is flush a range of TLB entries in a specified address space).

How to make use of those routines or any similar routines in user space.

By default, does user mode able to do flush/invalidate cache.

Do we need to set some permissions/rights for user mode.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jacob Bramley over 9 years ago
Oops, sorry, I don't remember seeing a notification of your comment for some reason. I suspect that you've probably found your problem by now, but just in case, here are some comments:

The syscall number on Linux (__Arm_NR_cacheflush) is 0x000f0002, not 0x00f00002. This is most likely the cause of your SIGILL.

Are the start and end addresses calculated correctly? The range calculation refers to address 0xc844, but the start address (r0) is 0xc850, just after your code.

This looks like hand-written assembly. If it is, you should use labels rather than explicit addresses. This will keep the code maintainable.

Here's an example:

push {r0-r2, r7} adr r0, start_label ldr r1, =end_label mov r2, #0 ldr r7, =0x000f0002 svc 0 pop {r0-r2, r7} start_label: ... @ Patched code. end_label:
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
linux_feixue over 10 years ago

when i insert following shellcode to call cacheflush via svc call
.text:0000C82C 87 00 2D E9                 STMFD   SP!, {R0-R2,R7}
.text:0000C830 0F 00 A0 E1                 MOV     R0, PC
.text:0000C834 18 00 80 E2                 ADD     R0, R0, #0x18
.text:0000C838 28 10 9F E5                 LDR     R1, =(dword_CC50 - 0xC844)
.text:0000C83C 0F 10 81 E0                 ADD     R1, R1, PC ; dword_CC50
.text:0000C840 00 20 A0 E3                 MOV     R2, #0
.text:0000C844 14 70 9F E5                 LDR     R7, =0xF00002
.text:0000C848 00 00 00 EF                 SVC     0
.text:0000C84C 87 00 BD E8                 LDMFD   SP!, {R0-R2,R7}
when i use once this shellcode the result is correct ,but when i placed the shellcode at two place ,the patched elf seems run correctly but it crash ,i use logcat to capture the error code:
I/DEBUG(22666): signal 4 (SIGILL), code 4 (ILL_ILLTRP), fault addr 0000c848
anybody knows why?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jacob Bramley over 10 years ago

Yep, indeed. It doesn't matter whether the code is new or was existing code that you are modifying. If you write to memory, it won't necessarily be seen immediately by the instruction side. Basically, executing instructions are always loaded through the instruction side, but loads and stores using the ldr or str instructions will always go through the data side. This is what makes Arm a (modified) Harvard architecture.

For example, if you execute an instruction and then load it as data (using ldr), you will have two copies of the data in cache, one in the instruction cache and one in the data cache. If you then write to memory, the data cache will be updated, but the instruction cache might not. Synchronising the caches with __clear_cache updates the instruction cache to match whatever was written to the data cache.

Something else that's important to understand is that the processor (and caches) have no idea about what a "code segment" or a "data segment" actually is; they are just tool chain constructs. Once your code is running, it can access the code as if it is data and — if the OS allows — execute from the data segment. If you're executing from the data segment, the instructions are still loaded into the instruction cache. The processor doesn't know that it's a data segment, it just knows that the pc points there and that it needs to execute an instruction.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

MPAM-Style cache partitioning with ATP-Engine and gem5

Hristo Belchev

Upstream gem5 and ATP-Engine MPAM-style cache partitioning are discussed, with experiments for the feature being proposed and analyzed.
- April 24, 2024
Optimizing your programs for Arm platforms

Tamar Christina

This blog covers techniques and tips that are useful to create better performing programs through compilers whether you are creating Android, Desktop or Server applications.
- April 24, 2024
Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Ker Liu

In-depth analysis of what the PMU of L2D_CACHE_WR counts on the Neoverse N2 server.
- April 15, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Caches and Self-Modifying Code

The Problem

The Solution

The Code

Linux (GCC)

Others

MPAM-Style cache partitioning with ATP-Engine and gem5

Optimizing your programs for Arm platforms

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server