Arm and the partner ecosystem have been working hard to bring the Scalable Vector Extension (SVE) to life in recent years. We were delighted to see the Fugaku enter the Top500 list in first place. This machine is powered by the A64FX CPU from Fujitsu, featuring SVE. More importantly, it provides much-needed computational power to the global fight against the COVID-19 pandemic. Our own SVE and SVE2-enabled CPUs are also coming soon with Neoverse V1 and Neoverse N2 as follow-ups to the hugely successful Neoverse N1 platform addressing a wide range of power and performance design points for the infrastructure market.
It takes a huge amount of sustained planning, engineering, and innovation to produce a market leading CPU and preparing the enabling software to make the most of it is a significant effort too. The SVE ISA is, among other things, a great compiler auto-vectorization target. It is also flexible, allowing a CPU to implement a vector width best suited for its requirements, while keeping compatibility with the rest of the Arm software ecosystem. Indeed, the CPUs mentioned above all implement different hardware SVE vector widths, reflecting their target use cases.
Teaching a cutting edge, widely used, open-source compiler like GCC to target SVE effectively has been a fascinating ongoing journey for us and the GCC community. GCC first learned to target SVE for auto-vectorization with the GCC 8.1 release. It has has been picking up more SVE goodness since, like support for the SVE and SVE2 ACLE intrinsics in GCC 10.1. This post showcases some of the ways the GCC compiler can help you get the most out of SVE for your applications.
Do you prefer to write portable, high-level code and rely on the compiler to make the most of your hardware? Are willing to get your hands dirty with highly-optimised intrinsics code? GCC provides the facilities to help you.
GCC can automatically use many cool features from SVE designed for auto-vectorization like lane-predication, gather loads and scatter store, condition reduction instructions, hardware checking of read-after-write conflicts, and more. These are used, as appropriate when auto-vectorization is enabled, currently when using an optimization level of -O3 and above and when compiling for an SVE target. Let us give it a try.
f (float *x, int *filter, int n)
float res = 0.0f;
for (int i = 0; i < n; ++i)
res = x[i];
Compiling this code with -O3 -mcpu=neoverse-v1 generates the following assembly:
movi v0.2s, #0
cmp w2, 0
mov x3, 0
whilelo p0.s, wzr, w2
ld1w z1.s, p0/z, [x1, x3, lsl 2]
cmpne p0.s, p0/z, z1.s, #0
ld1w z1.s, p0/z, [x0, x3, lsl 2]
add x3, x3, x4
clastb s0, p0, s0, z1.s
whilelo p0.s, w3, w2
Here we can see GCC using the Z registers that are used by SVE instructions. If you are interested in how the loop control and predication features in SVE are used, check out our tutorials on SVE Vector length agnostic programming. The conditional control flow in the original code does not allow for efficient vectorization with Advanced SIMD. On SVE features like predication, conditional loads and the conditional reduction instruction CLASTB allow the compiler to generate quite an efficient vector sequence.
How to pick the right target option for your applications? It depends on how much you know about your target CPU. If you know your code will be running on only one type of CPU, it is as easy as using the -mcpu option for it. For example, if you are compiling code to run on a Neoverse V1 CPU you specify -mcpu=neoverse-v1. GCC knows that the Neoverse V1 CPU supports SVE and will make sure to use it when it is confident that it helps performance. If you are developing natively on the same system that you are running your code, you can use -mcpu=native when compiling. GCC detects the CPU it is running on and translate it into the most appropriate set of target flags. This allows you to reuse the same build rules and Makefiles for your project as long as you are compiling and running your application on the same target.
Some times you are compiling a binary that you expect to run on a wider range of SVE hardware, for example a Neoverse V1 or a Fujitsu A64FX system. You can use the -march option to specify the architecture features that your target CPUs have in common. GCC has switches that allow the user to enable desirable features that are not in the baseline Armv8-A architecture, like SVE. For example, if the range of CPUs you are targeting implement at least the extensions from Armv8.2-a and also include SVE you can use -march=armv8.2-a+sve. This will generate code that will be compatible across all CPUs implementing these features and optimize the code to give reasonable performance across that range of CPUs.
Note: Whether you use -mcpu or -march the SVE code produced follows the Vector length agnostic approach and will run correctly across any hardware SVE vector length. This is the Scalable part of SVE. As with any data processing system, the more precise you are with your requirements to the compiler, the better its optimization choices are. If you know the platform, you are targeting Arm recommends you use the -mcpu option for your target CPU for the best-tuned code generation.
For the power users that have a particular algorithm in mind and know which SVE and SVE2 features it maps down to, it is sometimes convenient to drop down to compiler intrinsics to express the algorithm, while still delegating low-level tasks like register allocation, scheduling, addressing mode selection to the compiler. As of GCC 10.1 you can access a large set of intrinsics for SVE and SVE2 as defined by the Arm C language extensions or ACLE. For example, a 3x3 Sobel horizontal filtering step may look like this:
sobel_hor_sve (float *input, // pointer to image[height, width]
float *out_hor, // pointer to horizontal filtering output
const float *kx, // Sobel horizontal filter coefficients
int64_t height, // image height
int64_t j, i;
svfloat32_t in1, in2, in3, res;
uint64_t vl = svcntw ();
for (j = 0; j < height; j++)
float32_t *in_ptr = &input[j * width];
float32_t *out_ptr = &out_hor[j * (width - 2)];
for (i = 0; i < width - 2; i += vl)
p_row = svwhilelt_b32 (i, width - 2);
in1 = svld1 (p_row, &in_ptr[i]);
in2 = svld1 (p_row, &in_ptr[i + 1]);
in3 = svld1 (p_row, &in_ptr[i + 2]);
res = svmul_x (p_row, in1, kx);
res = svmla_m (p_row, res, in2, kx);
res = svmla_m (p_row, res, in3, kx);
svst1 (p_row, &out_ptr[i], res);
The intrinsics are available when including the arm_sve.h header that comes automatically with your GCC distribution. You can use intrinsics to access all the interesting features in SVE including predication, loop control and partitioning, gather loads, scatter stores and more. Check out some popular algorithms ported to SVE in in our SVE programming examples document here.
SVE is an extension to the Armv8-A architecture, available in the AArch64 state. SVE2 is part of Armv9-a, again available for AArch64. Of course, AArch64 includes the Advanced SIMD architecture by default as part of the baseline, and there has been considerable investment by Arm and the ecosystem to use it efficiently. How does it tie in with SVE and SVE2? The good news is that when compiling normal portable code, you do not have to worry about it. When compiling with the appropriate options to GCC for your target, the compiler should know which tool to use for the job, taking into account properties of the program like estimated loop iteration count, the inherent parallelism in the required operations, CPU-specific parameters like the hardware SVE width (while still taking care to generate portable VLA code), latencies, and throughputs of the operations and properties of the instruction set like available instructions and auto-vectorization strategies possible with them. For example, if we take an integer rounding average loop:
avg (uint8_t *restrict x, uint8_t *restrict y, uint8_t *restrict z, int n)
for (int i = 0; i < n; i++)
z[i] = ((uint64_t)x[i] + y[i] + 1) >> 1;
a compiler can use any of a number of strategies to vectorize it with Advanced SIMD, SVE or SVE2:
Advanced SIMD loop using multiple vectorization factors
.L4: //16x vector factor
ldr q0, [x0, x4]
ldr q1, [x1, x4]
urhadd v0.16b, v0.16b, v1.16b
str q0, [x2, x4]
add x4, x4, 16
cmp x4, x5
... // 8x vector factor
... // scalar epilogue
// SVE loop avoiding epilogue loops by using predication
ld1b z0.b, p0/z, [x0, x4]
ld1b z2.b, p0/z, [x1, x4]
lsr z1.b, z0.b, #1
orr z0.d, z0.d, z2.d
lsr z2.b, z2.b, #1
and z0.b, z0.b, #0x1
add z1.b, z1.b, z2.b
add z0.b, z1.b, z0.b
st1b z0.b, p0, [x2, x4]
whilelo p0.b, w4, w3
// SVE2 loop using predication and SVE2 URHADD instruction
ld1b z0.b, p0/z, [x0, x4]
ld1b z1.b, p0/z, [x1, x4]
urhadd z0.b, p1/m, z0.b, z1.b
st1b z0.b, p0, [x2, x4]
whilelo p0.b, w4, w3
A compiler like GCC can even combine strategies, for example by using an Advanced SIMD sequence for the main vector body that uses the URHADD instruction that is not available in SVE (but added in SVE2). And using an SVE predicated epilogue, avoiding the need for multiple vectorization factors and unrolled epilogues for the tail of the loop.
If you are a power user writing SVE intrinsics code, you may find it useful to consult a software optimization guide for the processor you are targeting. For example, Arm publishes the Neoverse V1 software optimization guide that contains all the information you need to reason about the performance features of your SVE and Advanced SIMD code.
We have seen some of the ways you can use GCC 11 to enable SVE and SVE2 features in your application, from auto-vectorization to intrinsics. SVE and SVE2 are very Versatile instruction sets that allow you to get the most of your application. A compiler's job is never finished and we look forward to improving GCC with more SVE features and tuning in the future. We hope you are as excited as we are for the arrival of the first SVE-enabled Neoverse CPUs as we are. Check out our blog on some of the performance optimizations that went into the compiler over the past year.
In the meantime, have a look at the GCC 11 release notes for all the cool new features this major compiler release brings, try out some examples yourself on the excellent https://godbolt.org/ website, now with the latest AArch64 GCC builds, and the SVE Programmer's Guide for to learn how software can make use of SVE.
SVE Programmer's Guide