Technology Update: Scalable Vector Extension (SVE) for Armv8-A

August 22, 2016

6 minute read time.

Today at Hot Chips in Cupertino, I had the opportunity to present the latest update to our Armv8-A architecture, known as the Scalable Vector Extension or SVE. Before going into the technical details, key points about Armv8-A SVE are:

Arm is significantly extending the vector processing capabilities associated with AArch64 (64-bit) execution in the Arm architecture, now and into the future, enabling implementation choices for vector lengths that scale from 128 to 2048 bits.
High Performance Scientific Compute provides an excellent focus for the introduction of this technology and its associated ecosystem development.
SVE features will enable advanced vectorizing compilers to extract more fine-grain parallelism from existing code and so reduce software deployment effort.

I’ll first provide some historical context. Armv7 Advanced SIMD (aka the Arm NEON instructions) is ~12 years old, a technology originally intended to accelerate media processing tasks on the main processor. It operated on well-conditioned data in memory with fixed-point and single-precision floating-point elements in sixteen 128-bit vector registers. With the move to AArch64, NEON gained full IEEE double-precision float, 64-bit integer operations, and grew the register file to thirty-two 128-bit vector registers. These evolutionary changes made NEON a better compiler target for general-purpose compute. SVE is a complementary extension that does not replace NEON, and was developed specifically for vectorization of HPC scientific workloads.

Immense amounts of data are being collected today in areas such as meteorology, geology, astronomy, quantum physics, fluid dynamics, and pharmaceutical research. Exascale computing (the execution of a billion billion floating point operations, or exaFLOPs, per second) is the target that many HPC systems aspire to over the next 5-10 years. In addition, advances in data analytics and areas such as computer vision and machine learning are already increasing the demands for increased parallelization of program execution today and into the future.

Over the years, considerable research has gone into determining how best to extract more data level parallelism from general-purpose programming languages such as C, C++ and Fortran. This has resulted in the inclusion of vectorization features such as gather load and scatter store, per-lane predication, and of course longer vectors.

A key choice to make is the most appropriate vector length, where many factors may influence the decision:

Current implementation technology and associated power, performance and area tradeoffs.
The specific application program characteristics.
The market, which is HPC today; in common with general trends in computer architecture evolution, a growing need for longer vectors is expected in other markets in the future.

Rather than specifying a specific vector length, SVE allows CPU designers to choose the most appropriate vector length for their application and market, from 128 bits up to 2048 bits per vector register. SVE also supports a vector-length agnostic (VLA) programming model that can adapt to the available vector length. Adoption of the VLA paradigm allows you to compile or hand-code your program for SVE once, and then run it at different implementation performance points, while avoiding the need to recompile or rewrite it when longer vectors appear in the future. This reduces deployment costs over the lifetime of the architecture; a program just works and executes wider and faster.

Scientific workloads, mentioned earlier, have traditionally been carefully written to exploit as much data-level parallelism as possible with careful use of OpenMP pragmas and other source code annotations. It’s therefore relatively straightforward for a compiler to vectorize such code and make good use of a wider vector unit. Supercomputers are also built with the wide, high-bandwidth memory systems necessary to feed a longer vector unit.

However, while HPC is a natural fit for SVE’s longer vectors, it offers an opportunity to improve vectorizing compilers that will be of general benefit over the longer term as other systems scale to support increased data level parallelism.

It is worth noting at this point that Amdahl’s law tells us the theoretical limit of a task’s speedup is governed by the amount of unparallelizable code. If you succeed in vectorizing 10% of your execution and make that code run 4 times faster (e.g. a 256-bit vector allows 4x64b parallel operations), then you've reduced 1000 cycles down to 925 cycles, providing a limited speedup for the power and area cost of the extra gates. Even if you could vectorize 50% of your execution infinitely (unlikely!) you've still only doubled the overall performance. You need to be able to vectorize much more of your program to realize the potential gains from longer vectors.

So SVE also introduces novel features that begin to tackle some of the barriers to compiler vectorization. The general philosophy of SVE is to make it easier for a compiler to opportunistically vectorize code where it would not normally be possible or cost effective to do so.

What are the new features and the benefits of SVE compared to NEON?

Feature	Benefit
Scalable vector length (VL)	Increased parallelism while allowing implementation choice of VL
VL agnostic (VLA) programming	Supports a programming paradigm of write-once, run-anywhere scalable vector code
Gather-load & Scatter-store	Enables vectorization of complex data structures with non-linear access patterns
Per-lane predication	Enables vectorization of complex, nested control code containing side effects and avoidance of loop heads and tails (particularly for VLA)
Predicate-driven loop control and management	Reduces vectorization overhead relative to scalar code
Vector partitioning and SW managed speculation	Permits vectorization of uncounted loops with data-dependent exits
Extended integer and floating-point horizontal reductions	Allows vectorization of more types of reducible loop-carried dependencies
Scalarized intra-vector sub-loops	Supports vectorization of loops containing complex loop-carried dependencies

SVE is targeted at the A64 instruction set only, as a performance enhancement associated with 64-bit computing (known as AArch64 execution in the Arm architecture). A64 is a fixed-length instruction set, where all instructions are encoded in 32 bits. Currently 75% of the A64 encoding space is already allocated, making it a precious resource. SVE occupies just a quarter of the remaining 25%, in other words one sixteenth of the A64 encoding space, as follows:

The variable length aspect of SVE is managed through predication, meaning that it does not require any encoding space. Care was taken with respect to predicated execution to constrain that aspect of the encoding space. Load and store instructions are assigned half of the allocated SVE instruction space, limited by careful consideration of addressing modes. Nearly a quarter of this space remains unallocated and available for future expansion.

In summary, SVE opens a new chapter for the Arm architecture in terms of the scale and opportunity for increasing levels of vector processing on Arm processor cores. It is early days for SVE tools and software, and it will take time for SVE compilers and the rest of the SVE software ecosystem to mature. HPC is the current focus and catalyst for this compiler work, and creates development momentum in areas such as Linux distributions and optimized libraries for SVE, as well as in Arm and third party tools and software.

We are already engaging with key members of the Arm partnership, and will now broaden that engagement across the open-source community and wider Arm ecosystem to support development of SVE and the HPC market, enabling a path to efficient Exascale computing.

Stay tuned for more information

Following on from the announcement and the details provided, initial engagement with the open-source community will start with the upstreaming and review of tools support and associated standards.

A Beta release of the SVE supplement to the Armv8-A Architecture Reference Manual is now available to download.

Annotated SVE VLA programming examples can be found here:

Download - A Sneak peak into SVE and VLA Programming

ARMv8-A SVE technology Hot Chips v12.pdf

Top Comments

Yichao Yu over 7 years ago +2

I know this is likely not the right place to ask questions about SVE but I can't really find this info or a better place to ask about this anywhere. My question is, how does the first-faulting class of...

Parents

Yichao Yu over 7 years ago

I know this is likely not the right place to ask questions about SVE but I can't really find this info or a better place to ask about this anywhere.
My question is, how does the first-faulting class of instructions interacts with paging? Does it always "fault" at page boundary? Does it create different kinds of faults (i.e. a special fault needs to be setup for paging by the kernel)? Or does it leak paging info to userspace (i.e. a user process can use such a instruction to tell if a page is swapped out without letting the kernel know by doing a load on the page boundary)?
Edit: according to the pdf linked above, it seems that it's the last one. Can this be possibly a security problem?
- Cancel
- Up +2 Down
- Reply
- More
- Cancel

Comment

Yichao Yu over 7 years ago

I know this is likely not the right place to ask questions about SVE but I can't really find this info or a better place to ask about this anywhere.
My question is, how does the first-faulting class of instructions interacts with paging? Does it always "fault" at page boundary? Does it create different kinds of faults (i.e. a special fault needs to be setup for paging by the kernel)? Or does it leak paging info to userspace (i.e. a user process can use such a instruction to tell if a page is swapped out without letting the kernel know by doing a load on the page boundary)?
Edit: according to the pdf linked above, it seems that it's the last one. Can this be possibly a security problem?
- Cancel
- Up +2 Down
- Reply
- More
- Cancel

Children

No Data

High Performance Computing (HPC) blog

Gencove adopts Sentieon and AWS Graviton to reduce cost of genetic research

David Lecomber

In this blog, we compare the execution cost of Sentieon DNAscope on AWS Graviton3 to competitive EC2 instances.
- April 23, 2024
Defacto SoC Compiler performance on AWS Graviton3

Tim Thornton

In this blog, we compare the runtime performance and cost of using the Defacto SoC Compiler on Arm and x86-based Amazon EC2 instances.
- April 17, 2024
Arm Compiler for Linux and Arm Performance Libraries 24.04

Chris Goodyer

In this blog we outline some of the biggest changes available in version 24.04 of the Arm Compiler for Linux.
- April 16, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Technology Update: Scalable Vector Extension (SVE) for Armv8-A

What are the new features and the benefits of SVE compared to NEON?

Stay tuned for more information

Top Comments

Gencove adopts Sentieon and AWS Graviton to reduce cost of genetic research

Defacto SoC Compiler performance on AWS Graviton3

Arm Compiler for Linux and Arm Performance Libraries 24.04