Published in IEEE Micro, Vol. 37, Issue. 2Authors: Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, Paul Walker
In this paper, Nigel Stephens (Lead ISA Architect and Fellow) and his colleagues from groups across Arm introduce the Arm Scalable Vector Extension (SVE). SVE is the culmination of a multi-year project run between Arm Research and Arm's Architecture and Technology group together with many external collaborators; it is the latest in a long and successful line of single-instruction, multiple data (SIMD) features supported by Arm compatible processors.
The Scalable Vector Extensions combine a number of architecture ingredients. A scalable vector length allows the architecture to be exploited at a number of different processor performance and cost points whilst reusing software investment. SVE also contain a number of features that remove barriers to vectorization – increasing the number of applications that can benefit from the extensions.
Scalar instructions operate on individual data items. This means that to produce one result it is necessary to execute one instruction. To increase processing throughput, ISAs have adopted SIMD instructions. SIMD instructions operate on multiple data items, so a single instruction may generate more than one result. The use of SIMD instructions allow a processor to exploit data level parallelism that is present in some workloads; by processing multiple data items simultaneously, SIMD instructions can increase compute performance.
The addition of SIMD support in a processor comes at a cost in terms of silicon area needed to implement the functionality and the energy needed to power it. This cost is a function of the 'vector width' - which expresses the number of data items that can be processed by one instruction. Classical SIMD instruction sets bake this vector width into the definition of the instruction set as a constant value; unfortunately the width that best suits one application domain may not suit others.
Processors compatible with the Arm instruction set find themselves deployed in the broadest range of systems of any processor architecture - ranging from hearing aids to supercomputers. This is made possible in part due to the flexibility of the Arm instruction set and the wide range of compatible processor implementations that target different performance, power and area budgets.
In contrast to classical SIMD architectures, SVE does not dictate a single vector width across all implementations. With SVE, the processor designer may tailor the vector width for their specific implementation, taking into account the performance and cost constraints of the application domain for which the processor is intended.
Programs for classical SIMD architectures also bake in assumptions about the vector width for the architecture that they are compiled to. As a result, programs compiled for a shorter width cannot benefit from a later instruction set extension that specifies a wider vector width.
The Scalable Vector Extensions introduce a number of features that allow a compiler to generate code that is agnostic to the vector width of the processor implementation. An example is shown in the graph below illustrating the execution of the same program on 128-bit and 256-bit implementations. The 128-bit implementation runs two iterations of the loop to process all the data; in contrast the 256-bit implementation only needs a single iteration of the loop to process the data - with no modifications needed to the program code.
By using vector length agnostic code, a program may automatically benefit from larger vector widths provided in higher end implementations.
This is illustrated in the graph below that plots the SVE speedup compared to Arm AdvSIMD code for three different SVE implementation widths. The same SVE binary was used for evaluating the speedup at the three different vector widths. Not all programs considered could be sped up compared to AdvSIMD but the ones that could exhibited increased speedup with larger vector width. The results presented in the paper were achieved with an early development auto-vectorizing compiler; code generation will improve over time yielding better vectorization quality.
Not all programs exhibit the right properties to be good targets for vectorization. Good candidates are often found in data intensive algorithms that have regular data access and processing patterns - for example those algorithms typically found in machine learning, media and image processing applications.
SVE introduces several new features that remove obstacles to vectorization and make the architecture applicable to a wider set of programs.
Classical SIMD architectures perform the same operation on every data item in the architected vector width; this provides the best performance in terms of data operations per instruction executed but it can be hard to find problems that fit exactly to this requirement.
Oftentimes there will be a non-vector width multiple of data items to be processed, or the algorithm will need to conditionally apply operations to some of the data items. The solution developed for this is predicated execution, where a vector of boolean values (predicates, one for each data item) enables or disables the operation applied to each data item. Predication is central to SVE’s design, with a predicate register file available as an operand to many SVE instructions.
The exit condition of some uncounted loops involves the value of the data being processed. A good example of this is operations on strings that finishing when the value 0 is encountered. Executing such processing with vectors may load data items from memory beyond the value 0 and end up accessing addresses outside of allocated memory causing a fault.
SVE introduces fault-tolerant speculative vectorization to mask faults that occur on data items different to the first data item, i.e., those speculatively loaded from memory. Masked faults are detected and produce a predicate register used to process the successfully-loaded values before the faulting data item.
Early SIMD architectures moved data to and from memory in contiguous blocks; this makes most efficient use of the underlying memory system. Not all programs have the data arranged in a straightforward manner, with the result that the program cannot be vectorized for the target SIMD instruction set.
SVE introduces gather-scatter load/store operations where a vector of addresses is specified permitting non-contiguous data access.
The use of gather-scatter memory operations may not yield the same efficiency as contiguous memory operations. However, it does permit more programs to vectorize for SVE, leaving data re-arrangement as an optimization option.
SVE provides new capabilities to the Arm Architecture in terms of scalable and accessible vector processing; it is a key enabling technology for Arm compatible systems in the HPC market.
Read the whole paper