.NET 9 introduces support for the Arm Scalable Vector Extension (SVE), Arm's latest SIMD architecture. SVE allows developers to write more efficient and simpler vectorization code. This document explores some practical examples, comparing it with the existing Neon API (referred to as AdvSimd in C#), and demonstrating its advantages. Plus, there will be comparisons made to the same routines in C++ using the Arm C Language Extensions (ACLE). By the end of this blog, you should have an appreciation of the main differences in concepts between Neon and SVE.
The C# SVE API has been the result of a yearlong collaboration between engineers in Microsoft and Arm. If you want to dig into the design choices and some details within the compiler itself, then I highly recommend you check out Kunal's blog Engineering the Scalable Vector Extension in .NET.
Launching alongside this release is the general availability of Colbalt 100 in the Azure cloud. These are Microsoft's Arm servers that are based on CSS N2, which include SVE running at 128bit vector length. More details are available here.
All the routines in this blog have been written with scalar, Neon, and SVE variants for both C# and C++. For space not all of them are included here. Instead, the full set can be found on GitLab.
For all routines covered, performance may be improved by unrolling the loop, rearranging instruction order and optimizing for specific micro architectures by considering instruction latencies, cycle counts, and the number of vector pipelines. I have avoided optimizing at this depth and instead have concentrated on general concepts. For this reason, I've avoided making performance comparisons. After reading the examples it should become apparent that making such optimizations will be inherently easier on SVE.
Let's start with a simple example. Take two fixed length arrays of shorts. Multiply each corresponding entry in the arrays, then sum the resulting values and return that value.
public static unsafe short MultiplyAddScalar(ref short* a, ref short* b, int length) { short res = 0; for (int i = 0; i < length; i++) { res += (short)(a[i] * b[i]); } return res; }
Vectorizing this code is simple. In a loop, load the next vectors worth of data from each array, use a multiply add instruction to multiply add the elements into a running total. Then once the loop is complete, add across the vector and return the result.
The length of the array may mean the final iteration of the loop does not fill an entire vector. Instead, a tail must be written which will handle these remaining elements.
In C#, using the AdvSimd API, this looks like:
public static unsafe short MultiplyAddNeon(ref short* a, ref short* b, int length) { Vector128<short> resVec = Vector128<short>.Zero; int i = 0; int incr = sizeof(Vector128<short>) / sizeof(short); // Increment by a vector's worth of data each loop. // Stop when there is less than a full vectors worth of data remaining. for (; i < length - incr; i += incr) { // Load the data. Vector128<short> aVec = AdvSimd.LoadVector128(a + i); Vector128<short> bVec = AdvSimd.LoadVector128(b + i); // For each element in the vectors, multiply the two inputs then add to the result. resVec = AdvSimd.MultiplyAdd(resVec, aVec, bVec); } // Sum all the results. short res = AdvSimd.Arm64.AddAcross(resVec).ToScalar(); // Process any remaining elements. for (; i < length; i++) { res += (short)(a[i] * b[i]); } return res; }
SVE removes the requirements for a tail loop. In SVE, the vector length is unknown, all the programmer knows that is at a power of two between 128-bits and 2048-bits. Instead, predication is used to mask lanes in a vector and can be used to process a partial vector at the end of the loop. Most operations take in a predicate argument and this dictates which elements are active. For inactive elements, the operation is not performed. Additional instructions are provided that directly work on predicates, some of which are designed for constructing loops.
More details on this can be found in the Arm resource Introduction to SVE.
Note that currently only 128-bit vector lengths are supported in C#. However, this is a restriction inside CoreCLR, not the API. The API is vector length agnostic and code written today with automatically work on wider vector length hardware once the restriction has been lifted in a future release.
Using the new C# SVE API, the loop now looks like:
public static unsafe short MultiplyAddSve(ref short* a, ref short* b, int length) { Vector<short> res = Vector<short>.Zero; Vector<short> pLoop = (Vector<short>)Sve.CreateWhileLessThanMask16Bit(i, length); // Increment by a vector's worth of data each loop. // Stop when pLoop contains no true elements. for (int i = 0; Sve.TestFirstTrue(Sve.CreateTrueMaskInt16(), pLoop); i += (int)Sve.Count16BitElements()) { // Load the data, but only the elements marked in pLoop. Vector<short> aVec = Sve.LoadVector(pLoop, a + i); Vector<short> bVec = Sve.LoadVector(pLoop, b + i); // Multiply the two vectors, incrementing the result. res = Sve.ConditionalSelect(pLoop, Sve.MultiplyAdd(res, aVec, bVec), res); // Create the predicate for the next iteration. // The predicate will be true for every iteration except the final iteration which may be partial. pLoop = (Vector<short>)Sve.CreateWhileLessThanMask16Bit(i, length); } // Sum all the results. return (short)Sve.AddAcross(res).ToScalar(); }
Here we create a predicate (or to use C# terminology, mask), pLoop, which is set based on the number of elements left to process in the loop. This will be all true except potentially on the final iteration. The predicate is used for each operation inside the loop.
pLoop
For those familiar with implementations of SVE in other languages, there are a few choices in the C# API which may be unexpected. It's worth taking a bit of time to go through these and why they matter.
Written in C++ ACLE for SVE, the loop looks like:
mutiplyaddSve(const int16_t *a, const int16_t *b, int length) { svint16_t resVec = svdup_s16(0); svbool_t pLoop = svwhilelt_b16(i, length); // Increment by a vector's worth of data each loop. // Stop when pLoop contains no true elements. for (int i = 0; svptest_first(svptrue_b16(), pLoop); i += svcnth()) { // Load the data, but only the elements marked in pLoop. svint16_t aVec = svld1(pLoop, a + i); svint16_t bVec = svld1(pLoop, b + i); // Multiply the two vectors, incrementing the result. resVec = svmla_m(pLoop, resVec, aVec, bVec); // Create the predicate for the next iteration. // The predicate will be true for every iteration except the final iteration // which may be partial. pLoop = svwhilelt_b16(i, length); } // Sum all the results. return svaddv(svptrue_b16(), resVec); }
The first thing you will notice is the difference is function naming. C++ ACLE sticks closely to the SVE instruction names, whereas C# goes for readability. If you are not overly familiar with the SVE instructions, then the C# code should be easier to parse and figure out what's going on. Both APIs follow the design that each method should resolve to a single SVE instruction, and there is a method for (almost) every instruction. For easy mapping, the summary blocks in C# API includes both the SVE instruction and the equivalent C++ ACLE method. For example:
/// <summary> /// <para>svbool_t svwhilelt_b16[_s32](int32_t op1, int32_t op2)</para> /// <para> WHILELT Presult.H, Wop1, Wop2</para> /// </summary> public static Vector<ushort> CreateWhileLessThanMask16Bit(int left, int right);
In C++ there is a special type for predicates, svbool_t. However, this type does not indicate the type of the elements being used to mask. For example, svptrue_b16() creates a mask suitable for masking a vector of 16bit values. It would be possible to use this predicate, for example, in an Abs() call where all the other vectors contained 32bit values. even though the underlying types do not match.
svbool_t
svptrue_b16()
Abs()
svuint32_t svabs_s32_m (svbool_t pg, svuint32_t op);
In C#, a standard vector<> is used instead. The equivalent in C# would be:
Vector<int> Abs(Vector<int> mask, Vector<int> value);
The mask must be of correct element type to be used. However, it is now possible to pass a predicate in as one of the parameters. This works because the mask is interpreted as a vector of 1s and 0s. Under the hood, CoreCLR makes sure the variable is in the correct register type and converts where necessary.
There is a further difference in C#. The API method for Abs shown above omits the mask parameter, becoming:
Abs
Vector<int> Abs(Vector<int> value);
Instead to use a masked add you need to wrap it in a ConditionalSelect() as shown in the example above for the MultiplyAdd() call. This says for each true element in the predicate, use the element from the first vector, otherwise use the element from the second. Where possible CoreCLR optimizes this down to a single predicated instruction. Where there is no mask used, CoreCLR uses an unmasked version of the instruction if it exists, otherwise it uses a mask containing all true elements. This removal of the mask from the API is used wherever a mask can be used via ConditionalSelect(). This enables the C# API to remain simple and abstracted away from masking where possible.
Let's consider a similar simple loop where we add all the elements in two arrays, storing the result to a third array, and returning the written final element as a simple checksum. The crucial issue with this loop is that the sizes of the elements are different.
The C# Neon code for this is:
public static unsafe int AddExtendNeon(ref int* res, ref int* a, ref sbyte* b, int length) { int i = 0; int incr = sizeof(Vector128<sbyte>) / sizeof(sbyte); // Process a Vector128<sbyte> vectors worth of data at a time for (i = 0; i < length - incr; i += incr) { // Load the next four instances of a Vector128<int> aVec1 = AdvSimd.LoadVector128(a + i); Vector128<int> aVec2 = AdvSimd.LoadVector128(a + i + 4); Vector128<int> aVec3 = AdvSimd.LoadVector128(a + i + 8); Vector128<int> aVec4 = AdvSimd.LoadVector128(a + i + 12); // Load one instance of b. Vector128<sbyte> bVec = AdvSimd.LoadVector128(b + i); // Extend b to 16its. Vector128<short> b16Vec1 = AdvSimd.ShiftLeftLogicalWideningLower( bVec.GetLower(), 0 ); Vector128<short> b16Vec2 = AdvSimd.ShiftLeftLogicalWideningUpper(bVec, 0); // Add b into a, extending to 32bits first. aVec1 = AdvSimd.AddWideningLower(aVec1, b16Vec1.GetLower()); aVec2 = AdvSimd.AddWideningUpper(aVec2, b16Vec1); aVec3 = AdvSimd.AddWideningLower(aVec3, b16Vec2.GetLower()); aVec4 = AdvSimd.AddWideningUpper(aVec4, b16Vec2); // Save out AdvSimd.Store(res + i, aVec1); AdvSimd.Store(res + i + 4, aVec2); AdvSimd.Store(res + i + 8, aVec3); AdvSimd.Store(res + i + 12, aVec4); } // Process any remaining elements for (; i < length; i++) { res[i] = a[i] + b[i]; } return res[length - 1]; }
Each iteration, 128-bits of array b are loaded from memory. These are extended to 16-bits, then again to 32-bits. Thankfully the second extension can be merged with the addition into a single widening add. However, to have enough elements from a, 512-bits are required to be loaded. As before the loop tail processes any remaining elements.
The SVE version is simplified with the use of widening loads:
public static unsafe int AddExtendSve(ref int* res, ref int* a, ref sbyte* b, int length) { Vector<int> pLoop = (Vector<int>)Sve.CreateWhileLessThanMask32Bit(i, length); // Increment by a vector's worth of data each loop. // Stop when pLoop contains no true elements. for (int i = 0; Sve.TestFirstTrue(Sve.CreateTrueMaskInt32(), pLoop); i += (int)Sve.Count32BitElements()) { // Load the data from a, but only the elements marked in pLoop. Vector<int> aVec = Sve.LoadVector(pLoop, a + i); // Load the data from a, extending each element to 32bits Vector<int> bVec = Sve.LoadVectorSByteSignExtendToInt32(pLoop, b + i); // Add the two arrays aVec = Sve.ConditionalSelect(pLoop, Sve.Add(aVec, bVec), aVec); // Store back to memory Sve.StoreAndZip(pLoop, res + i, aVec); // Create the predicate for the next iteration. // The predicate will be true for every iteration except the final iteration which may be partial. pLoop = (Vector<int>)Sve.CreateWhileLessThanMask32Bit(i, length); } return res[length - 1]; }
When the data from b is loaded only enough elements to fill a vector after widening to 32bits will be loaded (for 128bit vector lengths, that will be four 8bit elements).
This code is simpler to write than the Neon version, cutting down on time and reducing bugs as getting the offsets and widening correct is prone to errors. This is especially true for more complex loops as well as when unrolling the loop to perform more data per iteration.
This is the standard library routine - read bytes from an array until you hit zero then return the number of bytes read. The problem with this routine is that any addresses beyond the zero could be uninitialized memory, so the routine needs make sure all vector loads are aligned to prevent faults as an aligned 128bit load will not cross a page boundary.
public static unsafe ulong StrlenNeon(ref byte* src) { Vector128<byte> data = Vector128<byte>.Zero; ulong cmp = 0; ulong i = 0; ulong alignOffset = 0; // Check the first 128bits for a zero. for (i = 0; i < 16; i++) { if (src[i] == 0) { return i; } } // Align src to 128bits within the string. // We've already checked this area for zeros, so it's safe to skip. if (((ulong)src & 0xf) != 0) { alignOffset = (((ulong)src + 0xf) & 0xfffffffffffffff0) - (ulong)src; src = (byte*)(((ulong)src + 0xf) & 0xfffffffffffffff0); } // Look for vector of data containing a zero. while (cmp == 0) { data = AdvSimd.LoadVector128(src + i); Vector128<byte> min = AdvSimd.Arm64.MinPairwise(data, data); Vector64<byte> cmpVec = AdvSimd.CompareEqual(min.GetLower(), Vector64<byte>.Zero); cmp = cmpVec.AsUInt64().ToScalar(); i = i + (ulong)(sizeof(Vector128<byte>) / sizeof(byte)); } i = i - (ulong)(sizeof(Vector128<byte>) / sizeof(byte)); // Look for the location of the zero. Vector128<byte> cmpVecLoc = AdvSimd.CompareEqual(data, Vector128<byte>.Zero); Vector64<byte> shifted = AdvSimd.ShiftRightLogicalNarrowingLower( cmpVecLoc.AsUInt16(), 4 ); ulong syncd = shifted.AsUInt64().ToScalar(); int count = BitOperations.TrailingZeroCount(syncd); return i + (ulong)(count / 4) + alignOffset; }
With SVE, the routine can be simplified by using first faulting loads. If the first element in the vector would fault, then a fault is raised as normal. If any other element would fault then, instead of raising a fault the corresponding bit in the FFR predicate register is set. When this happens, the routine must look for the zero in the successfully loaded data, and if nothing is found, iterate again so that the faulting address is now the first element causing a fault.There is no requirement now to align the data before the vectorized loop, which removes the initial scalar loop and the alignment code.
public static unsafe ulong StrlenSve(ref byte* src) { Vector<byte> ptrue = Sve.CreateTrueMaskByte(); Vector<byte> pffr, pz; Vector<byte> data; ulong i = 0; ulong elemsInVector = Sve.Count8BitElements(); // Reset the FFR register. Sve.SetFfr(ptrue); while (true) { // Read a vector's worth of bytes, stopping on first fault. data = Sve.LoadVectorFirstFaulting(ptrue, src + i); // Read and check the state of the FFR register. pffr = Sve.GetFfrByte(); pz = Sve.ConditionalSelect( pffr, Sve.CompareEqual(data, Vector<byte>.Zero), Vector<byte>.Zero ); if (Sve.TestAnyTrue(pz, pz)) { // The compare found a zero in the data, end the loop. break; } else if (Sve.TestLastTrue(ptrue, pffr)) { // There weren't any suppressed faults. Increment by the vector size. i += elemsInVector; } else { // There was a suppressed fault. Increment the counter up to the point the fault // would have happened and reset the FFR. On the next iteration, the ldff1 will // try to load from the faulting address and raise a fault. This ensures the user // sees the same behaviour as the scalar version. Sve.SetFfr(ptrue); i += Sve.GetActiveElementCount(pffr, pffr); } } pz = Sve.CreateBreakBeforeMask(ptrue, pz); i += Sve.GetActiveElementCount(pz, pz); return i; }
The combination of predication and first faulting is a powerful tool when vectorizing code.
This final loop is one used by the sorting algorithm Quicksort. It takes an array of integers and two empty output arrays. It looks at the first entry in the array, and every entry in the array which is less than it is placed in the first output array. All other entries (including the first) are placed in the second output array.
Traditionally this loop was not normally vectorized. A few years back a version of this algorithm was created using vector table lookups, relying on a large table of binary data for the lookups. This code is not easy to read, debug or optimize. Implementing this is left as a task for the reader.
A much simpler version exists for SVE. A compare instruction is used to find elements that belong in the first array, the result of which is used by the compact instruction to push those elements together into the lower part of the vector which can now simply be saved out. This is then repeated for a reversed compare to find the elements for the second array.
public static unsafe ulong PartitionSve(ref uint* input, ref uint* left, ref uint* right, int length) { long i = 0; // Position within the output arrays. ulong indexLeft = 0; ulong indexRight = 0; // Create a vector filled with the first element from input. Vector<uint> firstElemVec = Sve.DuplicateSelectedScalarToVector( Sve.LoadVector(Sve.CreateTrueMaskUInt32(), input), 0 ); // Create a predicate for the loop. Vector<uint> pLoop = Sve.CreateWhileLessThanMask32Bit(i, length); while (Sve.TestAnyTrue(Sve.CreateTrueMaskUInt32(), pLoop)) { // Load from the input array based on the loop predicate. Vector<uint> data = Sve.LoadVector(pLoop, input + i); // Find all elements in input array less than the first element. Vector<uint> pInner = Sve.ConditionalSelect( pLoop, Sve.CompareLessThan(data, firstElemVec), Vector<uint>.Zero ); // Squash all found elements to the lower lanes of the vector. Vector<uint> compacted = Sve.Compact(pInner, data); // Store the squashed elements to the first output array. (This uses the loop predicate, so some additional // zeros may be stored). Sve.StoreAndZip(pLoop, left + indexLeft, compacted); // Increment the position in the first output array by the number of elements found. indexLeft += Sve.GetActiveElementCount(Sve.CreateTrueMaskUInt32(), pInner); // Do the same for the elements greater than or equal, storing into the second output array. pInner = Sve.ConditionalSelect( pLoop, Sve.CompareGreaterThanOrEqual(data, firstElemVec), Vector<uint>.Zero ); compacted = Sve.Compact(pInner, data); Sve.StoreAndZip(pLoop, right + indexRight, compacted); indexRight += Sve.GetActiveElementCount(Sve.CreateTrueMaskUInt32(), pInner); // Handle the loop. i = Sve.SaturatingIncrementBy32BitElementCount(i, 1); pLoop = Sve.CreateWhileLessThanMask32Bit(i, length); } return indexRight; }
Interestingly, Quicksort is used by CoreCLR in the garbage collector to sort the addresses of objects on the heap for easy traversal. Today CoreCLR uses introsort::sort, but on AVX machines it uses Vxsort which vectorizes the routine. There are plans underway to extend the sorting routine to also include an SVE version.
That is everything I wanted to talk about in this blog post. I hope over these few examples you have grown an appreciation how the introduction of SVE support in .NET empowers developers to write simpler vectorization code for Arm-based platforms that will scale with hardware capabilities. SVE is a valuable tool for .NET developers seeking to optimize their applications.
In the future, we hope to add support for SVE2 and other SVE extensions.. This will open further vectorization opportunities.