Modern processors aggressively reorder instructions to maximize performance. For correctness purposes, instructions cannot be reordered arbitrarily. All data dependences within a single instruction, whether memory or register, must be honored. Unfortunately, maintaining correctness within multi-threaded and emerging persistent applications is more complicated. Currently, coarse grain memory barriers must be inserted to constrain reordering’s, but they can significantly impair performance. To remedy this, in collaboration with the University of Illinois at Urbana-Champaign, we investigate a new set of instructions which can precisely define their fine-grain ordering requirements. With these proposed instructions, we could reduce the need for memory barriers, therefore improving performance.
Current processors maximize performance by executing instructions concurrently within multiple functional units. To minimize stalling, processors often execute instructions out of order. This is all done while maintaining the illusion of sequential execution within a single thread to the end user. Without such a guarantee of sequential consistency, any program's output would be hopelessly confused.
Within a single thread, correctness is maintained by honoring all data dependences, which can be in the form of either register or memory dependences. Figure 1 shows a brief series of instructions (whose functionality is described later in this article) and their data dependences. Within the figure, there are register dependences (green arrows) between instructions #1 -> #2 (x3), #1 -> #4 (x3), #3 -> #4 (x4), and #4 -> #5 (condition flags). In addition, there is a memory dependence (blue arrow) between instructions #1 ->#3.
Figure 1: Register and memory dependences
Overall, these dependences create a partial ordering, such as that in Figure 2, which guides the processor in deciding when to execute each instruction.
Figure 2: Dependence partial ordering
As shown in figure 2, the processor must guarantee that instruction #1 executes before all other instructions. However, as they are unordered, instructions #2 and #3 can execute in parallel or even in the opposite order. Overall, data dependences serve as a key guide to the processor on how it can achieve maximal performance while maintaining correctness.
While the processor ensures that a single thread’s execution maintains sequential consistency, the same does not hold for multi-threaded applications. In multi-threaded applications, it is the responsibility of the program to ensure that the data racing between multiple threads do not lead to incorrect results. Likewise, in emerging persistent applications, it is the program’s responsibility to ensure data reaches the non-volatile media in an order that allows a consistent state to be recovered.
Currently, beyond respecting a thread’s register and memory dependences, the only other tool provided to restrict instruction reordering is the use of barriers. At a high level, barriers are a coarse way to order large swaths of instructions around where the barrier is inserted. For instance, when AArch64's full memory barrier (DMB Inner Shareable domain (ISH)) instruction is encountered, memory instructions after the barrier must wait to execute until all prior memory instructions have completed.
While barriers are effective for maintaining correctness, they also can significantly impact performance. Barriers can cause long stall times as the processor's functional units wait for prior memory instructions to be completed.
Given the performance impact, a program would be wise to limit the number of barrier instructions used. Unfortunately, often, inserting barriers is unavoidable.
One scenario where a barrier is needed is when using hazard pointers. Hazard pointers are a technique used to ensure data is not prematurely freed. When using hazard pointers, a thread must announce the location it is about to access, so the location is not prematurely freed. Figure 3 shows how this announcement operation can be performed on AArch64.
Figure 3: Hazard pointer announcement
Within the figure, first the location to access is retrieved (inst #1). Next, the location is stored in the hazard pointer to announce where the thread is about to access (inst #2). Afterwards, the location to access is read again (inst #4), and checked to make sure it did not change (inst #5). If the location changed, this process must be repeated until the thread can successfully announce the location it accesses (inst #6).
The key step of the algorithm is validating that the location announced did not change before the announcement was made. In other words, it must be guaranteed that the second load of the location (inst #4) happens after the announcement (inst #2). Unfortunately, as shown in figure 2, these two operations are unordered. So, enforcing this load-store dependence currently requires a full (DMB ISH) barrier on AArch64 (inst #3).
In the prior example, a barrier is needed to guarantee that ldr x4, [x1] executes after str x3, [x2]. A better solution would be to provide a way for the program to tell the processor that this ordering requirement exists. To this end, in our upcoming work presented at International Symposium on Computer Architecture 2021 (ISCA), we introduce new instructions which allow a program to do exactly this. Extending the notion of data dependences, we expose a new class of dependences that we call execution dependences. We define an execution dependence as a required ordering between two individual instructions, where the effects of the sink instruction cannot be observed until the source instruction is complete.
Revisiting figure 3, with EDE the program can tell the processor there is an execution dependence between instructions #2 -> #4. Therefore, the barrier (instruction #3) can be removed, preventing any unnecessary stalling.
Within EDE, we have added instructions that allow execution dependences between instructions to be defined. Similar to how data dependences are defined within registers, EDE provides a way to link two instructions together with Execution Dependence Keys (EDKs). Please see our paper for more details on how EDKs can be used to define execution dependences.
Given the execution orderings conveyed by EDE, it is the hardware's responsibility to honor these dependences. While many hardware implementations of EDE are possible, so far, we have chosen to evaluate two practical designs. Figure 4 shows a high-level outline of a how instructions traditionally propagate through a processor and which components our two implementations modify.
Figure 4: Processor components modified to support EDE
Our first implementation, IQ, enforces execution dependences at the issue queue. Within IQ, an execution dependence sink instruction is delayed at the issue queue until the corresponding execution dependence source instruction has completed.
Our second implementation, WB, caters to the persistent application domain, focusing on using EDE for stores and cacheline-writeback instruction. In WB, execution dependences are enforced within the write buffer. By delaying execution dependence enforcement until after retirement, this second implementation can further reduce the number of stalls needed.
To evaluate EDE, we adapted several persistent applications to use our new instructions, and implemented our two hardware designs within a simulator. Overall, we found that IQ and WB attained average workload speedups of 18% and 26%, respectively, confirming EDE's potential impact.
More details about the overhead of barriers in persistent applications, EDE's new instructions, and our hardware implementation can be found in our paper. However, there is still much work we can do. In particular, we believe that there are many other multi-threaded use cases for EDE beyond the hazard pointer example described previously. We also believe that if the concept of execution dependences were to be introduced into a compiler's internals, the compiler would be able to perform most of the heavy lifting associated with using EDE instructions to eliminate barriers. In this way, EDE would be able to provide application speedups transparently without user involvement.
If you are interested in learning more about EDE or have other compelling EDE use cases, please do not hesitate to reach out.
Read full paper Contact Ilias Vougioukas