In recent years, ISA vendors across the industry have been introducing increasingly expressive SIMD instruction sets incorporating features to better support auto-vectorization. For example, the Arm Scalable Vector Extension (SVE) [1] provides predication, a rich set of horizontal operations, vector partitioning, and gather and scatter memory accesses. This expands the range of programs that can be vectorized. However, limitations in analysis of memory aliases performed by compilers and the presence of infrequent memory dependencies can still unnecessarily restrict vectorization, resulting in suboptimal performance.
To address this problem, Arm Research have joined forces with the research group led by Timothy Jones at the University of Cambridge. Together, they are investigating a hardware-software codesign solution; Speculative Vectorization with Selective Replay (SRV). SRV requires a lightweight ISA extension, and its implementation leverages the memory address disambiguation logic commonly found in modern out-of-order micro-architectures.
The results of this work have recently been presented at the ISCA 2021 conference.
Current auto-vectorization techniques generate vector code only when the code transformations are proven to be safe. In other words, the compiler or programmer must ensure that vectorization does not violate any of the data dependencies in the original sequential program. Of those dependencies, those involving memory accesses have a more dynamic nature and are therefore harder to deal with. Sometimes, the compiler may be unable to figure out that different memory accesses do not alias, mostly due to intrinsic limitations in alias analysis. In other cases, potential dependencies at compile time may manifest rarely, or not manifest at all, at runtime. Figure 1 illustrates a code example with some memory dependencies:
Figure 1: Code example with read-after-write (RAW) memory dependencies.
Assuming that the array x is initialized with values {3, 0, 1, 2, 7, 4, 5, 6, 11, …}, a read-after-write (RAW) dependency occurs after every four vector elements. With traditional vector architectures, without being able to make assumptions about the specific memory locations accessed at runtime, a compiler would be forced to take a conservative approach and inhibit vectorization for this particular loop. This would leave performance on the table, as up to four vector elements could be safely processed in parallel.
FlexVec [2] is one of the earliest proposals to introduce a specific instruction (VPCONFLICTM) for checking conflicts between addresses at runtime. The result of VPCONFLICTM is a vector predicate, or mask, that can be used to partition the loop in such a way that vectorization is applied safely. While this approach is a step forward, it comes with non-negligible runtime costs, which get worse as the number of potentially conflicting memory access instructions in a vectorized code region increases. In addition, the previous loop would require two iterations to process a vector’s worth of data in an ideal scenario. Vector partitioning using VPCONFLICTM may require more passes based on the implemented vector length. In fact, to limit the overhead and the amount of metadata required to keep track of the lanes that have been processed safely, it is simpler to generate code in such a way that only a single set of adjacent vector elements is processed at a time.
To overcome these limitations, we started with the following observations:
These observations drove us to develop SRV as a speculative hardware-software codesign technique and to mitigate its implementation costs by leveraging the existing memory disambiguation hardware. The next section describes SRV in more detail.
SRV relies on the compiler to identify regions of code, usually loop bodies, that are candidates for speculative vectorization, and to mark those regions with SRV-start and SRV-end instructions. A micro-architecture implementing SRV executes those regions of code (SRV-regions) while monitoring the accessed memory addresses to detect dependency violations across vector lanes. At the end of a region, if violations are detected, the code within the region is replayed by disabling those lanes that operated on the wrong data. This is achieved using a predication mechanism - older and younger lanes that operated on the correct data need not be replayed.
Figure 2. In a baseline out-of-order micro-architecture, only a few components (grey-shaded) require modifications to support SRV.
In more detail, on execution of an SRV-start, the Program Counter (PC) of the following instruction is recorded to enable replaying the SRV-region. The execution of the SRV-region then proceeds in the same way as regular code. However, Read-After-Write (RAW) memory dependency violations across different vector lanes are recorded in the background using a predicate register called SRV-needs-replay. In addition, due to the speculative nature of this approach, data stored to memory cannot leave the core and enter the cache hierarchy until it becomes non-speculative. On completion of the SRV-region, when encountering an SRV-end instruction, if no violations are detected (indicated by an all-0 SRV-needs-replay register), then the speculative state can be committed, and execution can continue beyond the SRV-region. If, instead, some violations are detected, execution rolls back to the beginning of the SRV-region using the recorded PC, and only those lanes that have used incorrect data are replayed. Rollback may occur multiple times before all lanes are processed correctly.
The detection of memory dependency violations while executing an SRV-region is implemented by extending the canonical memory disambiguation logic. This already performs checks across different instructions (here referred to as vertical disambiguation), but, to support SRV, it is extended with checks performed across different vector lanes (horizontal disambiguation). ThisThe extension can be achieved relatively cheaply by adding extra lane-specific metadata to the CAM-based structures used for vertical disambiguation.
While RAW violations trigger re-execution, Write-After-Read (WAR) and Write-After-Write (WAW) violations are handled immediately. For WAR, data forwarding from older stores to “younger” lanes is prevented. For WAW, an augmented store queue records the lane index associated with each store so that the most recent version of the data (in program order) is eventually written back.
Various implementation details are introduced to correctly deal with different vector element sizes and different types of memory accesses found in vector code. For example, contiguous, indexed, and load-and-broadcast accesses; their full descriptions can be found in the paper.
Using this selective-replay vectorization architecture, which augments a standard out-of-order pipeline, we can limit the overheads of detecting and correcting dependency violations, allowing compilers to vectorize freely, even in the presence of unknown dependencies.
For evaluation of the proposed technique, we selected a mix of general-purpose and scientific and HPC workloads from the following benchmark suites: SPEC CPU2006, NAS Parallel Benchmarks (NPB), Livermore Loops, SSCA2, and HPC challenge.
A system supporting the SRV architecture was prototyped using the gem5 simulator, with SRV implemented as an extension to Arm SVE, assuming a typical high-end out-of-order micro-architecture baseline.
To support SRV code generation, we extended the LLVM vectorizer to identify loops with statically unknown memory dependencies. Then, we allowed the compiler to bypass the memory safety checks for these loops using the same mechanism used for OpenMP hints, so that a later transformation stage can apply vectorization regardless of those dependencies. In addition, SRV-start/-end instructions are added to mark the speculatively vectorized code region.
Figure 3 shows the per-loop speedup observed in our experiments, while figure 4 shows the speedup achieved on the whole program.
Figure 3. Per-loop speedup for the SRV-vectorizable loops, together with their corresponding coverage over the whole program.
Figure 4. Whole-program speedup
We aimed to generate a first order estimation of the area and power overheads of SRV. To do this, we modified McPAT [3] to model the storage for the additional metadata and the power consumption of the extra CAM-based lookup operations contributing to the total dynamic power consumption of the processor. Figure 5 shows the estimated impact of SRV on dynamic power consumption.
Figure 5. Estimation of the change in dynamic power consumption introduced by SRV.
SRV is an effective approach to extend the coverage of auto-vectorization to loops with unknown memory dependencies. These findings are promising, we believe that SRV is an important first step towards unleashing even further performance and efficiency improvements through more advanced compiler techniques and additional architectural enhancements. This will be the focus of future work – so be sure to stay tuned for updates.
Read the full paper