SVE and SVE2 enablement in SIMD library

July 10, 2023

14 minute read time.

This blog describes how to add Arm Scalable Vector Extension (SVE) support to xsimd, an open-source SIMD library. This information is also helpful to port existing SIMD code to SVE.

What is a SIMD library?

A SIMD library wraps compiler intrinsics to:

Provide APIs for SIMD operations that are easier to use
Hide architecture dependent code

The major benefit of a SIMD library is that it reduces code maintenance effort, especially for a complex project.

SIMD libraries are often implemented as C++ header-only libraries. We expect that your compiler will inline and optimize the SIMD library together with the application. This eliminates potential performance penalties from software abstractions.

There are many open-source SIMD libraries. highway is another SIMD library implementation.

SIMD library uses

This section describes what a SIMD library does by comparing the codes which implement SIMD "vertical add" function with and without a SIMD library.

SIMD "vertical add" function adds two input arrays, x and y, element-wise and stores the results to output array z. That is, z[i] = x[i] + y[i]. Assume that we want to support both x86 SSE and Arm Neon.

Code with intrinsics

Review the source code with compiler explorer

#if defined(__SSE4_2__)
#include <nmmintrin.h>
#elif defined(__aarch64__)
#include <arm_neon.h>
#endif

// z[i] = x[i] + y[i]
void vadd(const int* x, const int* y, int* z, unsigned int count) {
    // process 4 integers (128bits) with simd
    unsigned int i = 0;
    for (; i + 4 <= count; i += 4) {
#if defined(__SSE4_2__)
        const __m128i vx = _mm_lddqu_si128((const __m128i*)(x + i));
        const __m128i vy = _mm_lddqu_si128((const __m128i*)(y + i));
        const __m128i vz = _mm_add_epi32(vx, vy);
        _mm_storeu_si128((__m128i*)(z + i), vz);
#elif defined(__aarch64__)
        const int32x4_t vx = vld1q_s32(x + i);
        const int32x4_t vy = vld1q_s32(y + i);
        const int32x4_t vz = vaddq_s32(vx, vy);
        vst1q_s32(z + i, vz);
#endif
    }

    // tail loop
    for (; i < count; ++i) {
        z[i] = x[i] + y[i];
    }
}

The source code is cluttered by architecture-dependent code wrapped inside predefined macros like "#ifdef __aarch64__".

This pattern is often used in open-source software supporting multiple architectures.

Code with SIMD library

xsimd

Review the source code with compiler explorer

#include <xsimd/xsimd.hpp>

// z[i] = x[i] + y[i]
void vadd(const int* x, const int* y, int* z, unsigned int count) {
    // process with simd
    using simd_batch = xsimd::simd_type<int>;
    const int lanes = simd_batch::size;
    unsigned int i = 0;
    for (; i + lanes <= count; i += lanes) {
        const auto vx = simd_batch::load_unaligned(x + i);
        const auto vy = simd_batch::load_unaligned(y + i);
        const auto vz = vx + vy;
        vz.store_unaligned(z + i);
    }

    // tail loop
    for (; i < count; ++i) {
        z[i] = x[i] + y[i];
    }
}

The code above uses xsimd, an open-source SIMD library. This code is more concise than the intrinsics example, without any architecture-dependent code. Confusing names such as "_mm_lddqu_si128" and "vld1q_s32" are unified to "load_unaligned", which is much easier to read.

highway

Review the source code with compiler explorer

#include <hwy/highway.h>

namespace HWY_NAMESPACE {
using namespace hwy::HWY_NAMESPACE;

// z[i] = x[i] + y[i]
void vadd(const int* x, const int* y, int* z, unsigned int count) {
    // process with simd
    const ScalableTag<int> tag;
    unsigned int i = 0;
    for (; i + Lanes(tag) <= count; i += Lanes(tag)) {
        const auto vx = Load(tag, x + i);
        const auto vy = Load(tag, y + i);
        const auto vz = Add(vx, vy);
        Store(vz, tag, z + i);
    }

    // tail loop
    for (; i < count; ++i) {
        z[i] = x[i] + y[i];
    }
}

}  // namespace HWY_NAMESPACE

highway is an open-source SIMD library from Google, though it is not an officially supported Google product. The code above is like xsimd. It has an additional "tag" argument for most operations because of the different design principles explained in the next section.

SIMD library implementations

Detailed implementations of SIMD libraries often look complicated, but the aim is simple: to implement a thin wrapper to compiler intrinsics.

There are two main ways to implement a SIMD library:

Vector: Wrap a SIMD register, such as __m128i, int32x4_t, in a class. Then attach methods, or SIMD operations, to it. xsimd uses this approach.
Zero-sized tag: Leverage C++ tag dispatching to implement function overloading for SIMD operations. highway uses this approach.

We can easily see the difference in the APIs of xsimd and highway.

As shown in the code below, load and store operations are class methods in xsimd. But they are global functions in highway. In highway, they require a tag argument, which encodes the SIMD register type, to select the correct overloading function.

// memcpy from *src to *dest by a vector

// xsimd: vector approach
// "simd_batch" is a class encapsulates a SIMD register
using simd_batch = xsimd::simd_type<int>;
// "load, store" are methods of class "simd_batch"
const auto v = simd_batch::load_unaligned(src);
v.store_unaligned(dest);

// ===========================================================

// highway: zero-sized tag approach
// "tag" encodes the SIMD register type
// only the value type of "tag" is useful, not the value itself
const ScalableTag<int> tag;
// "load, store" are statically dispatched to appropriate intrinsic per "tag"
const auto v = Load(tag, src);
Store(v, tag, dest);

The Vector approach is easy to follow, while the Zero-sized tag approach looks more complicated. highway explains their reasoning for using the Zero-sized tag approach in implementation details:

“The key to understanding Highway is to differentiate between vectors and zero-sized tag arguments. The former store actual data and are mapped by the compiler to vector registers. The latter (Simd<> and SizeTag<>) are only used to select among the various overloads of functions such as Set. This allows Highway to use builtin vector types without a class wrapper.”

The Zero-sized tag approach supports sizeless types, such as SVE sizeless register types. These cannot be embedded into a C++ class or C structure. The Vector approach requires declaring a SIMD register inside a class. This is incompatible with SVE variable sized registers and operations.

For example, the code below does not compile.

Review the source code with compiler explorer

// Source code
#include <arm_sve.h>

struct S {
    // cannot declare SVE sizeless data member inside a struct
    svint32_t v;
};

// Compilation failed
<source>:5:15: error: member variables cannot have SVE type 'svint32_t'
    5 |     svint32_t v;
      |               ^
Compiler returned: 1

We describe this limitation in later sections. Also, we explain why we use fixed-size SVE for xsimd.

Caveats of SIMD libraries

SIMD libraries do improve code readability and maintainability. However, they have limitations and might introduce some issues.

Though rare, we sometimes saw a significant performance gap between intrinsics and a SIMD library. Unless you are converting existing intrinsics code to a SIMD library and comparing the performance afterwards, this kind of issue is difficult to identify.

SIMD libraries cannot unify all the architecture-dependent code. Each Instruction Set Architecture (ISA) can implement some unique instructions which are good at solving specific problems. Even for common SIMD operations, minor gaps in details can lead to difficulties writing a unified code which gets the best performance on all architectures.

In general, we believe SIMD libraries are useful for larger scale software projects such as Apache Arrow, when maintainability is more important than extreme performance of some specific code path. For smaller software projects, with a dedicated purpose, such as xxHash, directly using compiler intrinsics might be a better choice.

Programming SVE

SVE is the next-generation SIMD extension of the Armv8-A instruction set. SVE is not an extension of Neon. It is a new set of vector instructions that are developed to target High Performance Computing (HPC) and Machine Learning (ML) workloads.

SVE2 extends the SVE instruction set to enable data-processing domains beyond HPC and ML.

SVE features

SVE has new functions such as gather-load, scatter-store, and speculative vectorization. It also has some unconventional features.

Size-agnostic vector

Unlike traditional SIMD ISAs with fixed vector size, for example, 128-bits for Neon and SSE, 256-bits for AVX, SVE vector size is determined at runtime. More importantly, a single binary can adapt to different CPUs with various SVE vector sizes. SVE vector sizes can be from 128, 256, 512, 1024, and 2048.

Size-agnostic vector introduces a different programming model to traditional fixed-size vector. Existing software must be improved to benefit from this feature.

Per-lane predication

A predicate is a bitmap to enable and disable specific lanes in a vector. Predication makes SIMD programming flexible and enables clever optimizations. One typical use case is to eliminate the tail loop which is often seen in SIMD code. This processes the last data block shorter than a vector, as shown above in the vertical add example codes.

Arm C Language Extension (ACLE) for SVE

ACLE for SVE describes SVE intrinsics and programming tips.

Compared with traditional ISAs such as NEON and SSE, SVE intrinsics have some interesting properties. The following code rewrites the vertical add example with SVE to show the main difference.

Review the source code with compiler explorer

#include <arm_sve.h>

// z[i] = x[i] + y[i]
void vadd(const int* x, const int* y, int* z, unsigned int count) {
    // process with sve
    const unsigned int lanes = svlen(svint32_t());
    for (unsigned int i = 0; i < count; i += lanes) {
        const svbool_t pred = svwhilelt_b32(i, count);
        const auto vx = svld1(pred, x + i);
        const auto vy = svld1(pred, y + i);
        const auto vz = svadd_z(pred, vx, vy);
        svst1(pred, z + i, vz);
    }

    // no tail loop
}

Sizeless data types

The previous code, built once, can run on machines with different SVE vector sizes. Intrinsic svlen(svint32_t) returns the number of int32 lanes of a vector at runtime. This is unlike Neon which is fixed to 4 lanes, int32x4_t, at compile time.

To represent underlying vectors which are size agnostic, SVE vector types (for example, svint32_t) are sizeless. A sizeless type is an extension to C/C++. For more information, see the ACLE document: sizeless types.

Compared with normal data types which are sized, for example, int, char, sizeless types have many restrictions. Some restrictions are:

Members of unions, structures and classes cannot have sizeless typei
The argument to sizeof() and _Alignof() cannot be a sizeless type
sizeless type cannot be used as the type of an array element

These restrictions make it impossible to use SVE sizeless types in SIMD libraries using the Vector approach, such as xsimd. xsimd cannot declare a sizeless SVE register inside a class.

Eliminate tail loop with predicates

In the previous code, with the help of per-lane predicate, there is no tail loop to process remaining data block shorter than a vector.

For the last iteration, the predicate, generated by svwhilelt_b32, disables lanes out of the bound of input buffer. Only partial, valid, data is read and processed. This way, the code of the last loop (partial register) is consistent with the main loop (full register).

Function overloading

Like C++ overloading, SVE intrinsic functions are overloaded per vector type. This greatly simplifies the client code.

Based on vector type, Neon provides the following functions for numerical vertical add:

vaddq_s8
vaddq_u8
vaddq_s64
vaddq_u64
vaddq_f32
vaddq_f64

Only one intrinsic function accepts different vector types in SVE: svadd_x.

Thanks to function overloading, in the xsimd source, the SVE implementation of ~1000 lines is much more concise than NEON’s ~4000 lines.

SVE in open-source community

Currently, in mid-2023, Amazon and Alibaba provide Arm SVE-enabled cloud instances. Open-source projects can develop SVE code or run CI on these instances.

You can also use software emulators for SVE development.

The QEMU user space emulator can run Arm code, including SVE, on x86_64 and other architectures. We adopted this approach in the xsimd CI.

If the community CI has an AArch64 host but without SVE support, it might be better to use Arm Instruction Emulator (ArmIE). ArmIE only emulates instructions not supported by the host, which is often faster than QEMU when running native code.

Though slower than real hardware, emulation has a big advantage: it can verify SVE with different vector sizes easily by passing the vector size as a command-line parameter.

Another method, used by highway, is to leverage Farm-SVE, where a single C++ header file implements ACLE for SVE. Farm-SVE enables you to develop SVE-based code without having an Arm CPU, without cross-compiling, and without the need of an emulator.

However, emulator-based approaches can only verify SVE functionalities, not the performance.

Add SVE support to xsimd

As described above, xsimd adopts the Vector approach. The core xsimd data structure, simd_register, declares a SIMD register as its data member. Also, xsimd extensively uses the register size as a compile time constant. This means that it cannot support size-agnostic SVE in xsimd because it is illegal to declare a variable of sizeless type in a class or struct, and we cannot apply sizeof to a sizeless type.

For more information, see the ximd Github issue: How to support Arm SVE. The conclusion is that xsimd only supports fixed-size SVE.

// from xsimd source code, simplified

struct simd_register<int8_t, xsimd::neon>
{
    // okay, int8x16 is fixed-size
    int8x16_t data;
};

struct simd_register<int8_t, xsimd::sve>
{
    // oops! illegal as svint8_t is sizeless!
    svint8_t data;
};

Fixed-size SVE

Besides sizeless data types, ACLE also supports fixed-size SVE data types. Fixed-size SVE types are like normal C types, int and char, without the restrictions from sizeless types. For more information, see the ACLE document: Fixed-length SVE types.

More specifically, for gcc-10+/clang-12+, we can pass compiler option -msve-vectors-bits to set vector size to a fixed value. Then we can create fixed-size SVE data types by aliasing sizeless types with specific attributes, as the code below shows.

Review the source code with compiler explorer

// compile with "-march=armv8-a+sve -msve-vector-bits=128"

// define fixed-size (128-bits) SVE int32 register type
typedef svint32_t svfixed_int32_t __attribute__((arm_sve_vector_bits(128)));

// okay use sizeof() to get size of SVE fixed-size type
int bits = sizeof(svfixed_int32_t) * 8;

// okay to declare SVE fixed-size type member in a struct
struct S {
    svfixed_int32_t v;
};

xsimd uses fixed-size SVE. Codes based on xsimd are not vector-size agnostic. There is still a single copy of the source code. However, we must compile the code with compiler option -msve-vector-bits to generate binaries matching different hardware.

Codes written with fixed-size SVE can only run on hardware supporting that vector size. Running SVE code built with -msve-vector-bits=128 on CPUs with vector size 256 bits results in undefined behavior. It may cause crashes, hangs, or some silent bugs that are hard to debug. This is a big issue for Linux distributions. We do not know what hardware we will be running on. A good practice is to check the underlying hardware at runtime and choose compatible SIMD type. For example, fallback to Neon if SVE vector size does not match. xsimd provides an API for runtime dispatching per CPU features.

Implementation

xsimd supports Arm NEON and x86 up to AVX512. It is a C++ header-only library. xsimd does not generate object code. Down to the source code, xsimd uses extensively C++ template meta-programming to keep the code generic over all ISAs.

Adding SVE support implements the interfaces of various SIMD operations using SVE intrinsics. The details are specific to xsimd code base. Looking through all the SVE code is tedious and might not be useful for other projects. For more information, see the pull request for details.

SIMD libraries adopting the Vector approach, such as xsimd, often share similar code structure. A trivial SIMD library is implemented to demonstrate the pattern. It supports x86 SSE, Arm Neon, and SVE, with only three operations: load, store, and vertical-add. For more information, see the source code of this trivial implementation.

Setup SVE CI

For xsimd, SVE implementations are verified by QEMU user mode emulator on a x86_64 host.

We can show the detailed steps with a simple SVE program. The program calculates the horizontal sum of an array, that is, it sums all the elements in an array.

Install prerequisites

Compilers: gcc-10+ / clang-12+
QEMU: v3.0+ for SVE, v6.0+ for SVE2

Ubuntu-20.04, x86_64

$ sudo apt install qemu-user gcc-10-aarch64-linux-gnu

Ubuntu-20.04, aarch64

$ sudo apt install qemu-user gcc-10

Build code

The Code (hsum.c)

Review the source code with compiler explorer

#include <arm_sve.h>
#include <stdio.h>

// returns x[0] + x[1] + ... + x[count-1]
static int hsum(const int* x, int count) {
    svint32_t vs = svdup_s32(0);
    for (int i = 0; i < count; i += svlen(vs)) {
        const svbool_t pred = svwhilelt_b32(i, count);
        const svint32_t vx = svld1(pred, x + i);
        vs = svadd_m(pred, vs, vx);
    }
    return svaddv(svptrue_b32(), vs);
}

int main() {
    // 1 + 2 + 3 + ... + 10 = 55
    const int x[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    printf("%d\n", hsum(x, sizeof(x)/sizeof(x[0])));
}

Cross compile on x86_64

$ aarch64-linux-gnu-gcc-10 -march=armv8-a+sve hsum.c -o hsum

Compile on aarch64

$ gcc-10 -march=armv8-a+sve hsum.c -o hsum

Run code with QEMU

QEMU user mode emulator can specify SVE vector sizes as command-line parameters. It is easy to test different vector sizes.

Run on x86_64

# vector size = 128 bits
$ qemu-aarch64 -cpu max,sve128=on -L /usr/aarch64-linux-gnu/ ./hsum
55

# vector size = 256 bits
$ qemu-aarch64 -cpu max,sve256=on -L /usr/aarch64-linux-gnu/ ./hsum
55

Run on aarch64

$ qemu-aarch64 -cpu max,sve128=on ./hsum
55

Architectures and Processors blog

Using SVE in C#

Alan Hayward

.NET 9 introduces SVE support on Arm, allowing users to write simplified vectorised code. This blog gives examples in C# and compares it to C++.
- November 20, 2024
Part 3: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Supporting C++ style exceptions and DWARF for Pointer Authentication Codes (PAC) signed pointers.
- November 20, 2024
Part 2: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Utilizing Pointer Authentication Codes (PAC) and Branch Target Instructions (BTI) together and optimizations in instruction counts.
- November 19, 2024