Caches and Self-Modifying Code: Working with Threads

January 21, 2025

9 minute read time.

Many years ago, I wrote about the need for self-modifying code to synchronize caches, and how to use a function like __clear_cache to do that. I recently described how that is actually implemented. However, those articles left out an important consideration: what happens if the code you are generating will be (or might be) executed by a different CPU core?

Moving away from a single-threaded execution environment adds complications, and there are many possible applications, with different constraints and goals.

At one extreme, it is acceptable to perform full synchronization. This is the easiest to prove correct, and the Arm ARM lists a suitable code sequence^[1], but it requires the executing thread to explicitly cooperate with the synchronization process, and this is not always desirable.

At the other extreme, a JIT compiler might want to update some code in a best-effort fashion. This technique could be used when a function is optimized: the older, unoptimised code still has correct behaviour, and if optimisation is a relatively rare event, it might be faster to run the old code a few more times than for the executing thread to explicitly synchronize for every function call.

A note on terminology: The Arm ARM uses terms such as “observers”, “Processing Elements” (PEs) and “shareability domains”, and assigns them specific meanings. All of these terms are important, but software engineers tend to think about higher-level ideas, such as threads. Threads might co-exist on a single core (which is a “processing element”), but we typically have to assume that they will run concurrently, on multiple cores. I will therefore talk about threads and cores, because they are terms familiar to most software engineers, but be aware that this is an abstract software model mapped onto the architecture by convention.

I am going to focus on AArch64 here, but the same concepts apply to AArch32 (with some translation of code sequences).

Notable obstacles

In my article about the single-threaded case, I showed that an Arm core has separate instruction and data caches. This was a simplification but was sufficiently accurate to illustrate the problem.

Perhaps the most obvious difference in a multi-core system is that each core has its own instruction and data caches. However, at least for the full-synchronization case, this does not change much, because the ic and dc operations are visible to the other cores, and the (dsb) barriers ensure that the operations are observed in program order.

The real obstacle to multi-threaded synchronization is the pipeline on the core that will eventually execute the code. For example, consider this example in a traditional three-stage — fetch, decode, execute — pipeline:

pipeline diagram showing only partial observation of unsynchronized new instructions

This is another simplification, but it is sufficient to understand the general principle: the processor can be working on future instructions — fetching, decoding and executing them — before the current instruction (in program order) has completed. If we modify the instruction stream, the core executing those instructions has to be notified, just in case it has already fetched them.

We usually do this with an isb instruction. Conceptually, it discards any work already done to execute instructions that occur after the isb itself. The typical single-threaded synchronization sequence ends with an isb, for this reason. However, the isb instruction is not broadcast to other cores, so they do not see the isb that the code-generating core executes:

      str  <new_instruction>, [x0]
      dc   cvau, x0
      dsb  ish
      ic   ivau, x0
      dsb  ish
      isb           // <- This is not broadcast to other cores.

The main obstacle, then, is how to ensure that the executing cores get the new instructions even if they have already fetched the old ones that they replaced.

Full synchronization

The simplest solution to this is that given in the Arm ARM^[1]: after the usual dc/ic sequence completes, execute an isb on the thread that will run the code.

Whilst true, this does not specify how to enforce a correct order of operations. Some kind of control-flow synchronization is required between cores:

The generating thread has to know that no other thread is executing from the memory that it is currently writing to.
The executing thread has to know not to enter a function whilst it is being compiled.

Both concerns can be handled with a mutex, but the executing thread would additionally have to execute an isb before entering any function that may have been recompiled. The combined overhead is likely to be unacceptably slow if applied on a per-function basis, but the approach can still work if code is compiled in larger batches.

The advantage of this approach is that it is relatively simple, and easy to validate, so there is little room for error). A minimal implementation might look like this:

// Compiling thread.
      str  <new_instruction>, [x10]   // Overwrite `fn`.
      dc   cvau, x10
      dsb  ish
      ic   ivau, x10
      dsb  ish
      mov  w2, #1
      str  w2, [x1]  // Notify that the new code is ready.

// Executing thread.
wait: ldr w0, [x1]   // Wait for the code to be ready.
      cbz w0, wait
      dsb ish        // Ensure that the ldr is complete. Note that a
                     // 'dmb' is not sufficient.
      isb            // Re-fetch subsequent instructions.
fn:   ...            // To be written by compiling thread.

Alternatives to `isb`

Although the isb instruction is often the easiest way to ensure that the new instructions take effect, it is not strictly the only way. isb is one of a few things that cause a “context synchronization event”^[2], and any of these events will suffice for our purposes. Most notably, taking or returning from an exception has the same effect, which to traditional user-space includes system calls (through svc), signals and task scheduling interrupts, for example.

If you can somehow guarantee that every executing thread has called into (or been interrupted by) the kernel, then they have already experienced a context synchronization event, and do not need an explicit isb. It’s possible to achieve this in Linux using the membarrier system call.

The compromise is that membarrier adds overhead to the compiling thread, but if you expect the executing thread to run the code many times, it might be worthwhile.

No synchronization (for special instructions)

At the other extreme, the AArch64 permits specific instructions to be exchanged concurrently, with no synchronization whatsoever. The set of exchangeable instructions is quite limited^[1], but includes direct branches and nop, which can be used to build more complex patterns. Both the old and the new instructions must belong to this set, but if you can meet this constraint, the architecture guarantees that either the old or the new instruction is executed.

Note that if multiple instructions are modified in this way, it is possible for a mixture of old and new instructions to be executed, irrespective of the order in which they were updated. For example:

// Old code       New code      Other possible executions
   bl <fn_a>      cbz .+8       bl <fn_a>       cbz .+8
   nop            bl <fn_b>     bl <fn_b>       nop

In addition, it is architecturally possible for an old instruction to be executed even if the core has already executed a new instruction at a given address. In general, software should assume that the old-versus-new selection of each instruction is independently determined at the point that each instruction is executed. There is no time limit after which the new code can be assumed to be reliably executed by all threads.

Effective, safe use of this technique therefore relies on the updated instructions being logically independent from one another, or periodic use of stronger synchronization to manage the number of possible executions.

For example, a JIT compiler might optimize a function and redirect existing bl instructions to the new version without synchronization. The new function would be faster, but functionally identical to the old functions, so each call site is logically independent. The JIT compiler must keep the old function in memory until it can perform stronger synchronization (as described above) to guarantee that the old functions won’t be called again. This scheme allows for the synchronization cost to be amortized, rather than paid each time a call site is updated.

In this example, you might wonder why the new functions themselves do not require an isb. Normally, any new code would require an explicit isb (or other context synchronization event) on the executing core, but a “prefetch speculation protection” makes this unnecessary.

Prefetch speculation protection

Prefetch speculation protection^[3] is an architectural guarantee that makes some minimally-synchronized code updates possible.

Put simply:

Compile a new function at location A.
Use the usual dc/ic sequence to synchronize the instruction and data caches on the compiling thread.
Update some instruction at location B. (Typically, this will be a branch to location A.)

Any other thread that executes the new instruction B is guaranteed to also fetch the new function A. For example:

// Compiling thread.
      str  <new_instruction>, [x10]   // Compile `new`.
      dc   cvau, x10
      dsb  ish
      ic   ivau, x10
      dsb  ish
      mov  w1, #(`bl new`)   // Replace `bl old` with `bl new`.
      str  w1, [x11]

// Executing thread.
      bl   old
      ...

old:  ...             // Old function implementation.

new:  ...             // To be written by compiling thread.

The compiling thread writes the new function and performs the full dc/ic sequence before replacing the bl. bl can be concurrently modified without synchronization, and if the executing thread takes the new branch, it is guaranteed to see an up-to-date new function too.

This technique requires very little explicit synchronization, but it has one notable limitation: it can only be used if access to the new function depends on one of the few instructions that can be concurrently modified. If you need an indirect branch at the call site, for example, you will need to use a different (or combined) approach.

Validation, and exploring other patterns

Multi-threaded code is notoriously difficult to test. Software faults might become apparent only on particular CPU implementations, or under particular load patterns. For this reason, we prefer to build high-level primitives (such as mutexes) that we can understand more intuitively. However, when we are implementing those primitives, we have to be confident that they are correct.

I validated the examples in this article using a combination of manual analysis (using the formal concurrency model from section B2.3 of DDI0487 L.a) with output from the herd7 tool from the herdtools7 suite. herd7 can validate sequences by checking that all possible executions meet some proposition that you define, and was described by Jade Alglave in an earlier article.

Summary

There is a range of cross-thread synchronization options available, each with their own advantages and disadvantages. The best choice will depend on the workload you need to support.

There are many possible variations, and many possible system interactions that are difficult to generalise. However, I’ve shown some specific examples that should help in some likely cases, and tools like herd7 can help in more complicated cases.

^[1] See section B2.2.5 of DDI0487 L.a: “Concurrent modification and execution of instructions”

^[2] “Context synchronization event” is defined in the glossary of DDI0487 L.a.

^[3] “Prefetch speculation protection” is an name from old editions of the Arm ARM (such as DDI0487 I.a). In recent versions, the required behaviours are covered by the formal concurrency model (as in section B2.3 of DDI0487 L.a), but this specific set of properties no longer has its own name.

Architectures and Processors blog

Caches and Self-Modifying Code: Working with Threads

Jacob Bramley

How to synchronize JIT-compiled instructions across threads.
- January 21, 2025
Caches and Self-Modifying Code: Implementing `__clear_cache`

Jacob Bramley

How to implement `__clear_cache` using assembly.
- January 20, 2025
The when, why and how of waiting and backoff in multi-threaded applications on Arm

Ola Liljedahl

Read about the different user space delays and wait implementations for the Armv8+ architecture and best practices for the purpose of improving throughput and fair access to shared resources.
- December 13, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog