Speculative Vectorization with Selective Replay

November 29, 2021

8 minute read time.

This blog has been co-authored with Peng Sun, Senior Software Modelling Engineer at Arm.

In recent years, ISA vendors across the industry have been introducing increasingly expressive SIMD instruction sets incorporating features to better support auto-vectorization. For example, the Arm Scalable Vector Extension (SVE) [1] provides predication, a rich set of horizontal operations, vector partitioning, and gather and scatter memory accesses. This expands the range of programs that can be vectorized. However, limitations in analysis of memory aliases performed by compilers and the presence of infrequent memory dependencies can still unnecessarily restrict vectorization, resulting in suboptimal performance.

To address this problem, Arm Research have joined forces with the research group led by Timothy Jones at the University of Cambridge. Together, they are investigating a hardware-software codesign solution; Speculative Vectorization with Selective Replay (SRV). SRV requires a lightweight ISA extension, and its implementation leverages the memory address disambiguation logic commonly found in modern out-of-order micro-architectures.

The results of this work have recently been presented at the ISCA 2021 conference.

Barriers to auto-vectorization

Current auto-vectorization techniques generate vector code only when the code transformations are proven to be safe. In other words, the compiler or programmer must ensure that vectorization does not violate any of the data dependencies in the original sequential program. Of those dependencies, those involving memory accesses have a more dynamic nature and are therefore harder to deal with. Sometimes, the compiler may be unable to figure out that different memory accesses do not alias, mostly due to intrinsic limitations in alias analysis. In other cases, potential dependencies at compile time may manifest rarely, or not manifest at all, at runtime. Figure 1 illustrates a code example with some memory dependencies:

Code example with read-after-write (RAW) memory dependencies.

Figure 1: Code example with read-after-write (RAW) memory dependencies.

Assuming that the array x is initialized with values {3, 0, 1, 2, 7, 4, 5, 6, 11, …}, a read-after-write (RAW) dependency occurs after every four vector elements. With traditional vector architectures, without being able to make assumptions about the specific memory locations accessed at runtime, a compiler would be forced to take a conservative approach and inhibit vectorization for this particular loop. This would leave performance on the table, as up to four vector elements could be safely processed in parallel.

FlexVec [2] is one of the earliest proposals to introduce a specific instruction (VPCONFLICTM) for checking conflicts between addresses at runtime. The result of VPCONFLICTM is a vector predicate, or mask, that can be used to partition the loop in such a way that vectorization is applied safely. While this approach is a step forward, it comes with non-negligible runtime costs, which get worse as the number of potentially conflicting memory access instructions in a vectorized code region increases. In addition, the previous loop would require two iterations to process a vector’s worth of data in an ideal scenario. Vector partitioning using VPCONFLICTM may require more passes based on the implemented vector length. In fact, to limit the overhead and the amount of metadata required to keep track of the lanes that have been processed safely, it is simpler to generate code in such a way that only a single set of adjacent vector elements is processed at a time.

To overcome these limitations, we started with the following observations:

Based on empirical data, for a large percentage of loops with potential memory conflicts at compile time, those conflicts do not manifest or manifest very rarely when running the application. This observation makes a speculation-based approach appealing.
Modern out-of-order micro-architectures already perform various address checks as part of the memory disambiguation mechanism. Addresses of speculatively issued load instructions are matched against those of older (in program order) store instructions to prevent memory ordering violations. A potential solution could try to leverage and extend this mechanism without requiring explicit checks in software.
Automatically keeping track of the vector lanes that have been processed safely in hardware would allow the removal of additional predicate-generating instructions from the critical path of a vectorized loop. This would lead to improved performance and efficiency.

These observations drove us to develop SRV as a speculative hardware-software codesign technique and to mitigate its implementation costs by leveraging the existing memory disambiguation hardware. The next section describes SRV in more detail.

A hardware-software codesign approach

SRV relies on the compiler to identify regions of code, usually loop bodies, that are candidates for speculative vectorization, and to mark those regions with SRV-start and SRV-end instructions. A micro-architecture implementing SRV executes those regions of code (SRV-regions) while monitoring the accessed memory addresses to detect dependency violations across vector lanes. At the end of a region, if violations are detected, the code within the region is replayed by disabling those lanes that operated on the wrong data. This is achieved using a predication mechanism - older and younger lanes that operated on the correct data need not be replayed.

Required modifications to support SRV

Figure 2. In a baseline out-of-order micro-architecture, only a few components (grey-shaded) require modifications to support SRV.

In more detail, on execution of an SRV-start, the Program Counter (PC) of the following instruction is recorded to enable replaying the SRV-region. The execution of the SRV-region then proceeds in the same way as regular code. However, Read-After-Write (RAW) memory dependency violations across different vector lanes are recorded in the background using a predicate register called SRV-needs-replay. In addition, due to the speculative nature of this approach, data stored to memory cannot leave the core and enter the cache hierarchy until it becomes non-speculative. On completion of the SRV-region, when encountering an SRV-end instruction, if no violations are detected (indicated by an all-0 SRV-needs-replay register), then the speculative state can be committed, and execution can continue beyond the SRV-region. If, instead, some violations are detected, execution rolls back to the beginning of the SRV-region using the recorded PC, and only those lanes that have used incorrect data are replayed. Rollback may occur multiple times before all lanes are processed correctly.

The detection of memory dependency violations while executing an SRV-region is implemented by extending the canonical memory disambiguation logic. This already performs checks across different instructions (here referred to as vertical disambiguation), but, to support SRV, it is extended with checks performed across different vector lanes (horizontal disambiguation). ThisThe extension can be achieved relatively cheaply by adding extra lane-specific metadata to the CAM-based structures used for vertical disambiguation.

While RAW violations trigger re-execution, Write-After-Read (WAR) and Write-After-Write (WAW) violations are handled immediately. For WAR, data forwarding from older stores to “younger” lanes is prevented. For WAW, an augmented store queue records the lane index associated with each store so that the most recent version of the data (in program order) is eventually written back.

Various implementation details are introduced to correctly deal with different vector element sizes and different types of memory accesses found in vector code. For example, contiguous, indexed, and load-and-broadcast accesses; their full descriptions can be found in the paper.

Using this selective-replay vectorization architecture, which augments a standard out-of-order pipeline, we can limit the overheads of detecting and correcting dependency violations, allowing compilers to vectorize freely, even in the presence of unknown dependencies.

Results and future work

For evaluation of the proposed technique, we selected a mix of general-purpose and scientific and HPC workloads from the following benchmark suites: SPEC CPU2006, NAS Parallel Benchmarks (NPB), Livermore Loops, SSCA2, and HPC challenge.

A system supporting the SRV architecture was prototyped using the gem5 simulator, with SRV implemented as an extension to Arm SVE, assuming a typical high-end out-of-order micro-architecture baseline.

To support SRV code generation, we extended the LLVM vectorizer to identify loops with statically unknown memory dependencies. Then, we allowed the compiler to bypass the memory safety checks for these loops using the same mechanism used for OpenMP hints, so that a later transformation stage can apply vectorization regardless of those dependencies. In addition, SRV-start/-end instructions are added to mark the speculatively vectorized code region.

Figure 3 shows the per-loop speedup observed in our experiments, while figure 4 shows the speedup achieved on the whole program.

Per-loop speedup for the SRV-vectorizable loops

Figure 3. Per-loop speedup for the SRV-vectorizable loops, together with their corresponding coverage over the whole program.

Whole-program speedup

Figure 4. Whole-program speedup

We aimed to generate a first order estimation of the area and power overheads of SRV. To do this, we modified McPAT [3] to model the storage for the additional metadata and the power consumption of the extra CAM-based lookup operations contributing to the total dynamic power consumption of the processor. Figure 5 shows the estimated impact of SRV on dynamic power consumption.

Power consumption introduced by SRV.

Figure 5. Estimation of the change in dynamic power consumption introduced by SRV.

SRV is an effective approach to extend the coverage of auto-vectorization to loops with unknown memory dependencies. These findings are promising, we believe that SRV is an important first step towards unleashing even further performance and efficiency improvements through more advanced compiler techniques and additional architectural enhancements. This will be the focus of future work – so be sure to stay tuned for updates.

Read the full paper

References

N. Stephens et al., “The ARM Scalable Vector Extension,” IEEE Micro, vol. 37, no. 2, 2017. (Blog post also available here)
S. S. Baghsorkhi, N. Vasudevan, and Y. Wu, “FlexVec: Auto-vectorization for irregular loops,” in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2016, vol. 13-17-June.
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures,” in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 469–480.

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024

Research Articles

Speculative Vectorization with Selective Replay

Barriers to auto-vectorization

A hardware-software codesign approach

Results and future work

References

HOL4 users' workshop 2025

TinyML: Ubiquitous embedded intelligence

To the edge and beyond