Evolution of SIMD architecture with SVE2

November 3, 2021

9 minute read time.

In March 2021, Arm introduced the next-generation Armv9 architecture with increasingly capable security and artificial intelligence (AI). This was followed by the launch of the new Arm Total Compute solutions in May, which include the first ever Armv9 CPUs. The biggest new feature that developers will see immediately is the enhancement of vector processing. It will enable increased machine learning (ML) and digital signal processing (DSP) capabilities across a wider range of applications. In this blog post, we share the advantages and benefits of version two of the Scalable Vector Extension (SVE2).

Image showing Armv9 features

Figure 1. Extending Vector Processing for ML and DSP in Armv9 (from Arm Vision Day)

What is SVE and SVE2?

Applications that process large amounts of data can be sped up by taking advantage of parallel execution instructions, known as SIMD (Single Instruction Multiple Data) instructions. SVE was first introduced as an optional extension by Armv8.2 architecture, following the existing Neon technology. SVE2 was introduced for Armv9 CPUs as a feature extension of SVE. The main difference between SVE2 and SVE is the functional coverage of the instruction set. SVE was designed for High Performance Computing (HPC) and ML applications. SVE2 extends the SVE instruction set to enable data-processing domains beyond HPC and ML such as computer vision, multimedia, games, LTE baseband processing, and general-purpose software. We see SVE and SVE2 as an evolution of our SIMD architecture, bringing many useful features beyond those already provided by Neon.

Why use SVE2?

The SVE2 design concept enables developers to write and build software once, then run the same binaries on different AArch64 hardware with various SVE2 vector length implementations, as the name suggests. Since some laptop and mobile devices have different vector lengths, SVE2 can reduce the cost of cross-platform support by sharing code. Removing the requirement to rebuild binaries allows software to be ported more easily. The scalability and portability of the binaries means that developers do not have to know and care about the vector length for their target devices. This particular benefit of SVE2 is more effective when the software is shared across platforms or used over an extended period of time.

In addition to that, SVE2 produces more concise and easier to understand assembler code than Neon. This significantly reduces the complexity of the generated code, making it easier to develop and easier to maintain. This provides an overall better developer experience.

How to use SVE2

So, how can you make the most of SVE2? There are several ways to write or generate SVE2 code:A library that uses SVE2

1) A library that uses SVE2

There are already highly optimized libraries with SVE2 available, for example Arm Compute Library. The Arm Compute Library provides superior performance to other open-source alternatives and immediate support for SVE2.

2) SVE2-enabled Compiler

C/C++ compilers generate the SVE2 code from C/C++ loops. To generate SVE2 code, select the appropriate compiler options for the SVE2 features. For example, with armclang, one option that enables SVE2 optimizations is march=armv8-a+sve2.

3) SVE2 Intrinsics in C/C++

SVE2 intrinsics are function calls that the compiler replaces with appropriate SVE2 instructions. SVE2 intrinsics give you access to most of the SVE2 instruction set directly from C/C++ code. You can search SVE2 intrinsics via the intrinsics search engine here.

#include <arm_sve.h>

void saxpy(const float x[], float y[], float a, int n) {
  for (int i = 0; i < n; i += svcntw()) {
    svbool_t pg = svwhilelt_b32(i, n);
    svfloat32_t vec_x = svld1(pg, &x[i]);
    svfloat32_t vec_y = svld1(pg, &y[i]);
    vy = svmla_x(pg, vy, vx, a);
    svst1(pg, &y[i], vy);
  }
}

Code 1. SVE2 Intrinsic Example

4) SVE2 Assembly

You can write assembly files using SVE2 instructions or use inline assembly if you want.

  ld1w    { z0.s }, p0/z, [x1, x8, lsl #2] 
  ld1w    { z1.s }, p0/z, [x2, x8, lsl #2] 
  sub     z0.s, z0.s, z1.s 
  st1w    { z0.s }, p0, [x0, x8, lsl #2] 
  incw    x8 
  whilelo p0.s, x8, x9 
  b.mi    .LBB0_1

Code 2. SVE2 Assembly Example

If there are SVE2-enabled libraries that provide the functionality you need, then using them may be the easiest option. Assembly can generally give impressive performance for certain applications, but it is more difficult to write and maintain due to register management and readability. Another alternative approach is to use intrinsics, which generates appropriate SVE2 instructions and allows functions to be called from C/C++ code, thus improving readability. In addition to libraries and intrinsics, SVE2 allows you to let compilers auto-vectorize code, improving ease of use while maintaining high performance. More information about how to program for SVE2 can be found on this Arm Developer page.

Examples where SVE2 can be effective

SVE2 not only makes vector length scalable, but also has many other features. In this section, we will show you some examples of the benefit of using SVE2 and some of the new instructions that have been added.

Gather-Load and Scatter-Store

Non-linear data-access patterns are common in a variety of applications. Many existing SIMD algorithms spend a lot of time re-arranging data structures into a vectorizable form. SVE2’s gather-load and scatter-store allows direct data transfer between non-contiguous memory locations and SVE2 registers.

Chart showing Gather-Load and Scatter-Store

Figure 2. Gather-Load and Scatter-Store

An example of a process that can benefit from this is FFT (Fast Fourier Transform). This operation is useful in many fields such as image compression and wireless communications. This scatter-store feature is ideal for the butterfly operation addressing used in FFT.

Instructions with more computing power

The SVE2 instruction set implements complex-valued integer operations. They are especially useful for operations with complex calculations such as quaternions used to represent orientation and rotation of objects in games. For example, the multiplication of signed 16-bit complex vectors in SVE2 assembly can be up to 62% faster than in Neon assembly. Below is the C code version of the vector multiplication. Similarly, the computation of an 8x8 inverse matrix using complex numbers was found to be about 13% faster.

struct cplx_int16_t {
    int16_t re;
    int16_t im;
};

int16_t Sat(int32_t a) {
    int16_t b = (int16_t) a ;
    if (a > MAX_INT16) b = 0x7FFF; // MAX_INT16 = 0x00007FFF
    if (a < MIN_INT16) b = 0x8000; // MIN_INT16 = 0xFFFF8000
    return b ;
}

void vecmul(int64_t n, cplx_int16_t * a, cplx_int16_t * b, cplx_int16_t * c) {
    for (int64_t i=0; i<n; i++) {
        c[i].re = Sat((((int32_t)(a[i].re * b[i].re) +
                        (int32_t)0x4000)>>15) -
                      (((int32_t)(a[i].im * b[i].im) +
                        (int32_t)0x4000)>>15));
        c[i].im = Sat((((int32_t)(a[i].re * b[i].im) +
                        (int32_t)0x4000)>>15) +
                      (((int32_t)(a[i].im * b[i].re) +
                        (int32_t)0x4000)>>15));
    }
}

Code 3. C code of Vector Multiply with Complex 16-bit Integer Elements

Brand-new instructions for variety of applications

There are also several new instructions introduced in SVE2, such as bitwise permute, string processing, and cryptography. Among them, I would like to highlight the histogram acceleration instructions. Image histogram is widely used in the fields of computer vision and image processing, for example, by using libraries such as OpenCV. It can be used in techniques like image thresholding and image quality improvements. Modern day cameras and smartphones utilize this kind of information to calculate exposure control and white balance to provide better picture quality.

Figure 3. Image Histogram

The histogram acceleration instructions, newly introduced in SVE2, provide a count of two vector registers whose specific elements match. With these instructions, the histogram can be computed with fewer instructions and faster than before. For example, the histogram calculation is conventionally coded as below. In my experiment to check the capabilities of SVE2, compilers are not yet mature enough to recognize the loop pattern and pick up the newest instructions. Therefore, we prepared assembly code with the specialized instructions to allow this loop to be vectorized. The result was that the assembly optimized with SVE2 is about 29% faster than the C code compiled. Neon does not offer a way to vectorize this kind of process. The assembly code used to measure the performance in this section, as well as compiler version and options, can be found in the Appendix at the end of this blog.

void calc_histogram(unsigned int * histogram, uint8_t * records, unsigned int nb_records) {
    for (unsigned int i = 0; i < nb_records; i++) {
        histogram[records[i]] += 1;
    }
}

Code 4. C code of Histogram Computation for an Image

Conclusion

SVE2 is great instruction set for computer vision, games and beyond. There are many other features that we have not mentioned here, so if you want to know more about SVE2 please have a look at this page. Also, more detailed information on SVE2 programming examples can be found here. In the future, there will be more examples of use cases and applications using SVE2 that are effective, not just the highlighting the differences in primitive operations. Also, compiler optimization should get better as time goes on.

The first SVE2-enabled hardware releases will be available at the start of 2022. We are really excited that SVE2 will provide better programmer productivity and enable enhanced ML and DSP capabilities across a wider range of devices and applications. We are very much looking forward to the wider deployment of SVE2.

Learn more about SVE2

Appendix

Here is the assembly code used for the performance measurement in this blog post. I used clang version 12.0.0 to compile the code, and -march=armv8-a+sve2+sve2-bitperm as the compilation option. The platform environment is on a simulator using Cortex-A core with a 128-bit vector length.

size .req x0 // int64_t n
    aPtr .req x1 // cplx_int16_t * a
    bPtr .req x2 // cplx_int16_t * b
    outPtr .req x3 // cplx_int16_t * c

    aPtr_1st .req x4
    bPtr_1st .req x5
    outPtr_1st .req x6
    count .req x7

    PTRUE p2.h
    LSL size, size, #1
    DUP z31.s, #0

    CNTH count
    WHILELT p4.h, count, size
    B.NFRST .L_tail_vecmul

    ADDVL aPtr_1st, aPtr, #-1
    ADDVL bPtr_1st, bPtr, #-1
    ADDVL outPtr_1st, outPtr, #-1

.L_unrolled_loop_vecmul:
    LD1H z0.h, p2/z, [aPtr_1st, count, LSL #1]
    LD1H z2.h, p4/z, [aPtr, count, LSL #1]
    LD1H z1.h, p2/z, [bPtr_1st, count, LSL #1]
    LD1H z3.h, p4/z, [bPtr, count, LSL #1]

Code 5. Optimized SVE2 Assembler Code of Vector Multiply with Complex 16-bit Integer Elements

MOV w2, w2
    MOV x4, #0

    WHILELO p1.s, x4, x2
    B.NFRST .L_return

.L_loop:
    LD1B z1.s, p1/Z, [x1, x4]
    LD1W z2.s, p1/Z, [x0, z1.s, UXTW #2]
    HISTCNT z0.s, p1/Z, z1.s, z1.s
    ADD z2.s, p1/M, z2.s, z0.s
    ST1W z2.s, p1, [x0, z1.s, UXTW #2]
    INCW x4
    WHILELO p1.s, x4, x2
    B.FIRST .L_loop

.L_return:
    RET

Code 6. Vector Length Agnostic SVE2 Assembler Code of 8-bit Pixels Image Histogram

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog