The Arm Scalable Vector Extension (SVE)

March 26, 2018

5 minute read time.

Published in IEEE Micro, Vol. 37, Issue. 2
Authors: Nigel Stephens, Stuart Biles, Matthias Boettcher, Jacob Eapen, Mbou Eyole, Giacomo Gabrielli, Matt Horsnell, Grigorios Magklis, Alejandro Martinez, Nathanael Premillieu, Alastair Reid, Alejandro Rico, Paul Walker

In this paper, Nigel Stephens (Lead ISA Architect and Fellow) and his colleagues from groups across Arm introduce the Arm Scalable Vector Extension (SVE). SVE is the culmination of a multi-year project run between Arm Research and Arm's Architecture and Technology group together with many external collaborators; it is the latest in a long and successful line of single-instruction, multiple data (SIMD) features supported by Arm compatible processors.

The Scalable Vector Extensions combine a number of architecture ingredients. A scalable vector length allows the architecture to be exploited at a number of different processor performance and cost points whilst reusing software investment. SVE also contain a number of features that remove barriers to vectorization – increasing the number of applications that can benefit from the extensions.

Scalable Vector Length

Scalar instructions operate on individual data items. This means that to produce one result it is necessary to execute one instruction. To increase processing throughput, ISAs have adopted SIMD instructions. SIMD instructions operate on multiple data items, so a single instruction may generate more than one result. The use of SIMD instructions allow a processor to exploit data level parallelism that is present in some workloads; by processing multiple data items simultaneously, SIMD instructions can increase compute performance.

The addition of SIMD support in a processor comes at a cost in terms of silicon area needed to implement the functionality and the energy needed to power it. This cost is a function of the 'vector width' - which expresses the number of data items that can be processed by one instruction. Classical SIMD instruction sets bake this vector width into the definition of the instruction set as a constant value; unfortunately the width that best suits one application domain may not suit others.

Processors compatible with the Arm instruction set find themselves deployed in the broadest range of systems of any processor architecture - ranging from hearing aids to supercomputers. This is made possible in part due to the flexibility of the Arm instruction set and the wide range of compatible processor implementations that target different performance, power and area budgets.

In contrast to classical SIMD architectures, SVE does not dictate a single vector width across all implementations. With SVE, the processor designer may tailor the vector width for their specific implementation, taking into account the performance and cost constraints of the application domain for which the processor is intended.

Vector Length Agnostic Definition

Programs for classical SIMD architectures also bake in assumptions about the vector width for the architecture that they are compiled to. As a result, programs compiled for a shorter width cannot benefit from a later instruction set extension that specifies a wider vector width.

The Scalable Vector Extensions introduce a number of features that allow a compiler to generate code that is agnostic to the vector width of the processor implementation. An example is shown in the graph below illustrating the execution of the same program on 128-bit and 256-bit implementations. The 128-bit implementation runs two iterations of the loop to process all the data; in contrast the 256-bit implementation only needs a single iteration of the loop to process the data - with no modifications needed to the program code.

Scalable Vector Extension Graph 1

By using vector length agnostic code, a program may automatically benefit from larger vector widths provided in higher end implementations.

This is illustrated in the graph below that plots the SVE speedup compared to Arm AdvSIMD code for three different SVE implementation widths. The same SVE binary was used for evaluating the speedup at the three different vector widths. Not all programs considered could be sped up compared to AdvSIMD but the ones that could exhibited increased speedup with larger vector width. The results presented in the paper were achieved with an early development auto-vectorizing compiler; code generation will improve over time yielding better vectorization quality.

Scalable Vector Extension Graph 2

Removing Barriers to Auto-Vectorization

Not all programs exhibit the right properties to be good targets for vectorization. Good candidates are often found in data intensive algorithms that have regular data access and processing patterns - for example those algorithms typically found in machine learning, media and image processing applications.

SVE introduces several new features that remove obstacles to vectorization and make the architecture applicable to a wider set of programs.

Predicated Execution

Classical SIMD architectures perform the same operation on every data item in the architected vector width; this provides the best performance in terms of data operations per instruction executed but it can be hard to find problems that fit exactly to this requirement.

Oftentimes there will be a non-vector width multiple of data items to be processed, or the algorithm will need to conditionally apply operations to some of the data items. The solution developed for this is predicated execution, where a vector of boolean values (predicates, one for each data item) enables or disables the operation applied to each data item. Predication is central to SVE’s design, with a predicate register file available as an operand to many SVE instructions.

Speculative Vectorization

The exit condition of some uncounted loops involves the value of the data being processed. A good example of this is operations on strings that finishing when the value 0 is encountered. Executing such processing with vectors may load data items from memory beyond the value 0 and end up accessing addresses outside of allocated memory causing a fault.

SVE introduces fault-tolerant speculative vectorization to mask faults that occur on data items different to the first data item, i.e., those speculatively loaded from memory. Masked faults are detected and produce a predicate register used to process the successfully-loaded values before the faulting data item.

Gather-scatter

Early SIMD architectures moved data to and from memory in contiguous blocks; this makes most efficient use of the underlying memory system. Not all programs have the data arranged in a straightforward manner, with the result that the program cannot be vectorized for the target SIMD instruction set.

SVE introduces gather-scatter load/store operations where a vector of addresses is specified permitting non-contiguous data access.

The use of gather-scatter memory operations may not yield the same efficiency as contiguous memory operations. However, it does permit more programs to vectorize for SVE, leaving data re-arrangement as an optimization option.

Summary

SVE provides new capabilities to the Arm Architecture in terms of scalable and accessible vector processing; it is a key enabling technology for Arm compatible systems in the HPC market.

Read the whole paper

0 comments
0 members are here

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024