In March 2021, Arm introduced the next-generation Armv9 architecture with increasingly capable security and artificial intelligence (AI). This was followed by the launch of the new Arm Total Compute solutions in May, which include the first ever Armv9 CPUs. The biggest new feature that developers will see immediately is the enhancement of vector processing. It will enable increased machine learning (ML) and digital signal processing (DSP) capabilities across a wider range of applications. In this blog post, we share the advantages and benefits of version two of the Scalable Vector Extension (SVE2).
Figure 1. Extending Vector Processing for ML and DSP in Armv9 (from Arm Vision Day)
Applications that process large amounts of data can be sped up by taking advantage of parallel execution instructions, known as SIMD (Single Instruction Multiple Data) instructions. SVE was first introduced as an optional extension by Armv8.2 architecture, following the existing Neon technology. SVE2 was introduced for Armv9 CPUs as a feature extension of SVE. The main difference between SVE2 and SVE is the functional coverage of the instruction set. SVE was designed for High Performance Computing (HPC) and ML applications. SVE2 extends the SVE instruction set to enable data-processing domains beyond HPC and ML such as computer vision, multimedia, games, LTE baseband processing, and general-purpose software. We see SVE and SVE2 as an evolution of our SIMD architecture, bringing many useful features beyond those already provided by Neon.
The SVE2 design concept enables developers to write and build software once, then run the same binaries on different AArch64 hardware with various SVE2 vector length implementations, as the name suggests. Since some laptop and mobile devices have different vector lengths, SVE2 can reduce the cost of cross-platform support by sharing code. Removing the requirement to rebuild binaries allows software to be ported more easily. The scalability and portability of the binaries means that developers do not have to know and care about the vector length for their target devices. This particular benefit of SVE2 is more effective when the software is shared across platforms or used over an extended period of time.
In addition to that, SVE2 produces more concise and easier to understand assembler code than Neon. This significantly reduces the complexity of the generated code, making it easier to develop and easier to maintain. This provides an overall better developer experience.
So, how can you make the most of SVE2? There are several ways to write or generate SVE2 code:A library that uses SVE2
1) A library that uses SVE2
2) SVE2-enabled Compiler
3) SVE2 Intrinsics in C/C++
#include <arm_sve.h> void saxpy(const float x[], float y[], float a, int n) { for (int i = 0; i < n; i += svcntw()) { svbool_t pg = svwhilelt_b32(i, n); svfloat32_t vec_x = svld1(pg, &x[i]); svfloat32_t vec_y = svld1(pg, &y[i]); vy = svmla_x(pg, vy, vx, a); svst1(pg, &y[i], vy); } }
Code 1. SVE2 Intrinsic Example
4) SVE2 Assembly
ld1w { z0.s }, p0/z, [x1, x8, lsl #2] ld1w { z1.s }, p0/z, [x2, x8, lsl #2] sub z0.s, z0.s, z1.s st1w { z0.s }, p0, [x0, x8, lsl #2] incw x8 whilelo p0.s, x8, x9 b.mi .LBB0_1
Code 2. SVE2 Assembly Example
If there are SVE2-enabled libraries that provide the functionality you need, then using them may be the easiest option. Assembly can generally give impressive performance for certain applications, but it is more difficult to write and maintain due to register management and readability. Another alternative approach is to use intrinsics, which generates appropriate SVE2 instructions and allows functions to be called from C/C++ code, thus improving readability. In addition to libraries and intrinsics, SVE2 allows you to let compilers auto-vectorize code, improving ease of use while maintaining high performance. More information about how to program for SVE2 can be found on this Arm Developer page.
SVE2 not only makes vector length scalable, but also has many other features. In this section, we will show you some examples of the benefit of using SVE2 and some of the new instructions that have been added.
Non-linear data-access patterns are common in a variety of applications. Many existing SIMD algorithms spend a lot of time re-arranging data structures into a vectorizable form. SVE2’s gather-load and scatter-store allows direct data transfer between non-contiguous memory locations and SVE2 registers.
Figure 2. Gather-Load and Scatter-Store
An example of a process that can benefit from this is FFT (Fast Fourier Transform). This operation is useful in many fields such as image compression and wireless communications. This scatter-store feature is ideal for the butterfly operation addressing used in FFT.
The SVE2 instruction set implements complex-valued integer operations. They are especially useful for operations with complex calculations such as quaternions used to represent orientation and rotation of objects in games. For example, the multiplication of signed 16-bit complex vectors in SVE2 assembly can be up to 62% faster than in Neon assembly. Below is the C code version of the vector multiplication. Similarly, the computation of an 8x8 inverse matrix using complex numbers was found to be about 13% faster.
struct cplx_int16_t { int16_t re; int16_t im; }; int16_t Sat(int32_t a) { int16_t b = (int16_t) a ; if (a > MAX_INT16) b = 0x7FFF; // MAX_INT16 = 0x00007FFF if (a < MIN_INT16) b = 0x8000; // MIN_INT16 = 0xFFFF8000 return b ; } void vecmul(int64_t n, cplx_int16_t * a, cplx_int16_t * b, cplx_int16_t * c) { for (int64_t i=0; i<n; i++) { c[i].re = Sat((((int32_t)(a[i].re * b[i].re) + (int32_t)0x4000)>>15) - (((int32_t)(a[i].im * b[i].im) + (int32_t)0x4000)>>15)); c[i].im = Sat((((int32_t)(a[i].re * b[i].im) + (int32_t)0x4000)>>15) + (((int32_t)(a[i].im * b[i].re) + (int32_t)0x4000)>>15)); } }
Code 3. C code of Vector Multiply with Complex 16-bit Integer Elements
There are also several new instructions introduced in SVE2, such as bitwise permute, string processing, and cryptography. Among them, I would like to highlight the histogram acceleration instructions. Image histogram is widely used in the fields of computer vision and image processing, for example, by using libraries such as OpenCV. It can be used in techniques like image thresholding and image quality improvements. Modern day cameras and smartphones utilize this kind of information to calculate exposure control and white balance to provide better picture quality.
Figure 3. Image Histogram
The histogram acceleration instructions, newly introduced in SVE2, provide a count of two vector registers whose specific elements match. With these instructions, the histogram can be computed with fewer instructions and faster than before. For example, the histogram calculation is conventionally coded as below. In my experiment to check the capabilities of SVE2, compilers are not yet mature enough to recognize the loop pattern and pick up the newest instructions. Therefore, we prepared assembly code with the specialized instructions to allow this loop to be vectorized. The result was that the assembly optimized with SVE2 is about 29% faster than the C code compiled. Neon does not offer a way to vectorize this kind of process. The assembly code used to measure the performance in this section, as well as compiler version and options, can be found in the Appendix at the end of this blog.
void calc_histogram(unsigned int * histogram, uint8_t * records, unsigned int nb_records) { for (unsigned int i = 0; i < nb_records; i++) { histogram[records[i]] += 1; } }
Code 4. C code of Histogram Computation for an Image
SVE2 is great instruction set for computer vision, games and beyond. There are many other features that we have not mentioned here, so if you want to know more about SVE2 please have a look at this page. Also, more detailed information on SVE2 programming examples can be found here. In the future, there will be more examples of use cases and applications using SVE2 that are effective, not just the highlighting the differences in primitive operations. Also, compiler optimization should get better as time goes on.
The first SVE2-enabled hardware releases will be available at the start of 2022. We are really excited that SVE2 will provide better programmer productivity and enable enhanced ML and DSP capabilities across a wider range of devices and applications. We are very much looking forward to the wider deployment of SVE2.
[CTAToken URL = "https://developer.arm.com/documentation/102340/latest/" target="_blank" text="Learn more about SVE2" class ="green"]
Here is the assembly code used for the performance measurement in this blog post. I used clang version 12.0.0 to compile the code, and -march=armv8-a+sve2+sve2-bitperm as the compilation option. The platform environment is on a simulator using Cortex-A core with a 128-bit vector length.
size .req x0 // int64_t n aPtr .req x1 // cplx_int16_t * a bPtr .req x2 // cplx_int16_t * b outPtr .req x3 // cplx_int16_t * c aPtr_1st .req x4 bPtr_1st .req x5 outPtr_1st .req x6 count .req x7 PTRUE p2.h LSL size, size, #1 DUP z31.s, #0 CNTH count WHILELT p4.h, count, size B.NFRST .L_tail_vecmul ADDVL aPtr_1st, aPtr, #-1 ADDVL bPtr_1st, bPtr, #-1 ADDVL outPtr_1st, outPtr, #-1 .L_unrolled_loop_vecmul: LD1H z0.h, p2/z, [aPtr_1st, count, LSL #1] LD1H z2.h, p4/z, [aPtr, count, LSL #1] LD1H z1.h, p2/z, [bPtr_1st, count, LSL #1] LD1H z3.h, p4/z, [bPtr, count, LSL #1]
Code 5. Optimized SVE2 Assembler Code of Vector Multiply with Complex 16-bit Integer Elements
MOV w2, w2 MOV x4, #0 WHILELO p1.s, x4, x2 B.NFRST .L_return .L_loop: LD1B z1.s, p1/Z, [x1, x4] LD1W z2.s, p1/Z, [x0, z1.s, UXTW #2] HISTCNT z0.s, p1/Z, z1.s, z1.s ADD z2.s, p1/M, z2.s, z0.s ST1W z2.s, p1, [x0, z1.s, UXTW #2] INCW x4 WHILELO p1.s, x4, x2 B.FIRST .L_loop .L_return: RET
Code 6. Vector Length Agnostic SVE2 Assembler Code of 8-bit Pixels Image Histogram