Another benefit of having the new load/store pass is that when generating code, we no longer must explicitly generate the load/store pair instructions as we can rely on the optimization pipeline to correctly form them.
Having to explicitly generate the special AArch64 instructions are problematic because passes that check for memory-based optimizations (for example, dead load/store elimination) would not know what these instructions are doing.
This has often led to us generating inefficient code when compiling common memmove idioms:
void bar(char * a,char * b) { char buffer[32]; __builtin_memcpy(buffer,b,32); __builtin_memcpy(a,buffer,32); }
Generates in GCC 13:
bar(char*, char*): ldp q0, q1, [x1] sub sp, sp, #32 stp q0, q1, [x0] stp q0, q1, [sp] add sp, sp, 32 ret
And in GCC 14:
bar(char*, char*): ldp q31, q30, [x1] sub sp, sp, #32 stp q31, q30, [x0] add sp, sp, 32 ret
Speaking of memmove, in GCC 14 we have also now inline small memmoves similarly to memcpy. For example the following:
void foo(char * a,char * b) { __builtin_memmove(a,b,32); }
No longer generates a library call like GCC 13 and earlier did:
foo(char*, char*): mov x2, 32 b memmove
But now inlines the sequence resulting in faster memory moves:
foo(char*, char*): ldp q31, q30, [x1] stp q31, q30, [x0] ret
GCC 14 also adds support for the SME and SME2 architecture extensions. These extensions allow users to write routines performing various matrix operations on large datasets. SME2 is a superset of SME which in turn is an extension of SVE. However, unlike other extensions, when SME is enabled not all SVE and Advanced SIMD instructions are available while in SME streaming mode. For details consult the SME documentation https://developer.arm.com/documentation/ddi0616/latest/
The following shows an example of using SME to perform outer dot product operation:
#include <arm_sme.h> void f (svuint8_t z0, svint8_t z1, svint8_t z5, svint8_t z6, svint8_t z7, svbool_t p0, svbool_t p1) __arm_streaming __arm_inout("za") { svusmopa_za32_u8_m (0, p0, p1, z0, z1); }
Which generates:
f: usmopa za0.s, p0/m, p1/m, z0.b, z1.b ret
More on Scalable Matrix Extension
As part of the implementation of SME/SME2 a new AArch64 pass has been added to deal with strided vector register operands in SME. The pass runs in addition to the standard register allocator.
This pass is on by default to fix some of the Advanced SIMD intrinsics register allocation issues that happen when using an instruction which requires multiple consecutive registers such as LD2/3/4.
As an example the following Advanced SIMD code:
#include <arm_neon.h> int16x8x3_t bsl (const uint16x8x3_t *check, const int16x8x3_t *in1, const int16x8x3_t *in2) { int16x8x3_t out; for (uint32_t j = 0; j < 3; j++) out.val[j] = vbslq_s16 (check->val[j], in1->val[j], in2->val[j]); return out; }
Used to generate:
bsl: ldp q6, q16, [x1] ldp q0, q4, [x2] ldp q5, q7, [x0] bsl v5.16b, v6.16b, v0.16b ldr q0, [x2, 32] ldr q6, [x1, 32] mov v1.16b, v5.16b ldr q5, [x0, 32] bsl v7.16b, v16.16b, v4.16b bsl v5.16b, v6.16b, v0.16b mov v0.16b, v1.16b mov v1.16b, v7.16b mov v2.16b, v5.16b ret
But now generates:
bsl: ldr q31, [x0, 32] ldr q30, [x1, 32] ldr q2, [x2, 32] ldp q6, q4, [x0] ldp q5, q3, [x1] ldp q0, q1, [x2] bit v2.16b, v30.16b, v31.16b bit v0.16b, v5.16b, v6.16b bit v1.16b, v3.16b, v4.16b ret
The remaining issues are planned to be addressed in GCC 15.
One common issue for vectorization is dealing with function calls in between the vector code. Such function calls completely block vectorization and we lose the speedup vectorization typically gives.
As an example, the following sequence fails to vectorize in older GCCs because of the use of the cosf math function:
#include <math.h> float double_sum (float *x, int n) { float res = 0; for (int i = 0; i < (n & -8); i++) res += cosf (x[i] * x[i]); return res; }
This is rather unfortunate and in such a trivial example the gains may not be much, but in larger code a single function call can block the entire function.
GCC 7-14 now support vectorizing such functions with Advanced SIMD if a new enough glibc is used.
The example above now generates:
.L3: ldp q0, q31, [x19], 32 fmul v23.4s, v31.4s, v31.4s fmul v0.4s, v0.4s, v0.4s bl _ZGVnN4v_cosf mov v31.16b, v0.16b mov v0.16b, v23.16b mov v23.16b, v31.16b bl _ZGVnN4v_cosf fadd v22.4s, v22.4s, v23.4s fadd v21.4s, v21.4s, v0.4s cmp x20, x19 bne .L3
As can be seen these new SIMD function calls have a special name mangling that differ from normal AArch64 PCS name mangling. This new mangling comes with a new calling procedure standard (PCS) that’s specifically designed to be efficient during vectorized loops. One major change is the number of caller and callee saved registers involved in such a call. This new PCS is specified in the Arm ABI documentation.
Read more on Arm Vector PCS
This ABI has been implemented in GCC as far back as GCC 7, but the missing part of the story was how to inform the compiler of the availability of the vector math routines. There is no magic involved, but instead we use existing OpenMP annotations to decorate the header files which when parsed by the compiler will inform the vectorizer about the availability of the functions.
These new annotations and the associated vector math functions have now been added to GLIBCs which is why older GCCs now also benefit from work done years ago.
As there is nothing special built into the compiler for this, it also means that users can provide their own vector functions for the compiler to use in their own code or libraries.
For example, the following declares and uses a new vector function my_cos:
#include <arm_neon.h> __attribute__ ((__simd__ ("notinbranch"), const)) double my_cos (double); void foo (float *a, double *b, int n) { for (int i = 0; i < n; i++) { b[i] = my_cos (5.0 * a[i]); } } __attribute__ ((aarch64_vector_pcs)) float64x2_t _ZGVnN2v_my_cos (float64x2_t s) { // Do fancy cos calculation return s; }
One thing to note is that by declaring support for a particular function you must have an implementation of all possible vector sizes for that type. In the example above, it means we must provide an implementation for float32x4_t and float32x2_t.
To do this is quite simple and GCC has annotations to properly do the name mangling and to inform the compiler that the function being declared is using the vector PCS if using C++.
Measuring these changed on Neoverse V1 shows the following improvements in SPEC2017 CPU Fprate:
Libatomic in GCC 14, is extended with support for LSE128 atomics through ifuncs. This means that binaries using this new libatomic will be automatically accelerated using the new sequences without any change needed by the user. It also preserves support for architectures without these extensions.
One addition in these extensions is the addition of 128-bit atomic operations such as fetch_and/fetch_or and 128-bit swap instructions.
GCC-14 adds supports for the following new Arm cores (-mcpu/-march/-mtune values between the brackets):
This release also adds support for new generic tunings generic-armv8-a and generic-armv9-a
Improvements in various optimization passes in GCC 13 ended up exposing a missing optimization in GCC when it comes to if-conversion. Because of this, workloads that have many nested branches end up with a significant slowdown when if-converted and vectorized, compared to GCC 12 or earlier.
To illustrate this, the testcase:
void foo (int *f, int d, int e) { for (int i = 0; i < 1024; i++) { int a = f[i]; int t; if (a < 0) t = 1; else if (a < e) t = 1 - a * d; else t = 0; f[i] = t; } }
Generated in GCC 12:
a_10 = *_3; _44 = (unsigned int) a_10; _46 = _44 * _45; _48 = 1 - _46; t_13 = (int) _48; _18 = a_10 = e_11(D); _26 = _7 & _25; _ifc__42 = a_10 < 0 ? 1 : t_13; _ifc__43 = _21 ? t_13 : _ifc__42; t_6 = _26 ? 0 : _ifc__43; *_3 = t_6;
Where we have 4 different comparisons to navigate through the if-else statements inside the loop. However this is highly inefficient and we in fact only require two comparisons, since if we don’t need to test both the condition and its inverse.
When the previous code vectorizes these 4 comparisons are materialized in the resulting code which drastically slows down the vector code:
.L2: ldr q0, [x0] mov v3.16b, v6.16b mls v3.4s, v0.4s, v7.4s cmge v4.4s, v0.4s, #0 cmlt v2.4s, v0.4s, #0 cmgt v1.4s, v5.4s, v0.4s cmge v0.4s, v0.4s, v5.4s bsl v2.16b, v6.16b, v3.16b and v1.16b, v1.16b, v4.16b and v0.16b, v0.16b, v4...
Because of how late it was discovered in the GCC 13 development cycle we were unable to fix it, however a small improvement was made following the observation that when traversing nested conditions like the above we get several “paths” through the control flow. If we have 3 paths, we only need to test 2 of them and if neither have been taken then this means the third one must implicitly be true.
This means we can simply not test the last condition which results in the following CFG:
a_10 = *_3; _7 = a_10 = 0; _22 = a_10 < e_11(D); _23 = _21 & _22; _ifc__42 = _23 ? t_13 : 0; t_6 = _7 ? 1 : _ifc__42; *_3 = t_6;
And the following codegen:
.L2: ldr q0, [x0] mov v1.16b, v3.16b cmge v2.4s, v0.4s, #0 cmgt v4.4s, v5.4s, v0.4s mls v1.4s, v0.4s, v6.4s cmlt v0.4s, v0.4s, #0 and v2.16b, v2.16b, v4.16b and v1.16b, v1.16b, v2.16b bsl v0.16b, v3.16b, v1.16b str q0, [x0], 16 cmp...
which has removed one comparison, but we still check both _7 and its inverse _21. In GCC 14 we finally fix this by tracking truth values. In essence, we keep a set of known “truths” for a particular path and when we encounter a condition, we evaluate it under the context of known predicates.
Thus, if we have a path gated by condA and if we take the path where condA is True, then we insert condA=True into the map. If we go down the opposite branch, we add ~condA=True. This then allows us to drop inverse comparisons and in GCC 14 we generate:
a_10 = *_3; _7 = a_10 < 0; _43 = (unsigned int) a_10; _45 = _43 * _44; _47 = 1 - _45; t_13 = (int) _47; _22 = a_10 < e_11(D); _ifc__42 = _22 ? t_13 : 0; t_6 = _7 ? 1 : _ifc__42; *_3 = t_6;
And we get the optimal number of comparisons:
.L2: ldr q28, [x0] mov v27.16b, v29.16b mls v27.4s, v28.4s, v31.4s cmgt v26.4s, v30.4s, v28.4s cmlt v28.4s, v28.4s, #0 and v26.16b, v27.16b, v26.16b bsl v28.16b, v29.16b, v26.16b str q28, [x0], 16 cmp x1, x0 bne .L2
The curious reader might be wondering how we chose the order of the paths to know which one to not have to test, but also more importantly, does which path we drop matter?
The answer to that is yes, when dropping a path, it’s beneficial to not test the most complicated path as that saves us the most amount of work. In GCC 13 this path was determined by looking at the number of occurrences of an SSA value after all the paths converge. Or put differently, we count the number of occurrences of a value inside the PHI node in the merge blocks and sort by increasing occurrence. The idea being that if a value can be produced from multiple paths, we must do extra work to disambiguate between them.
This however does not consider dependency chains and does not work when all the values in the PHI node are unique. In GCC 14 we additionally also count the number of Boolean operators inside the condition and sort based on a tuple of occurrences and num_of_operations.
As an example, for deeply nested compares such as:
_79 = vr_15 > 20; _80 = _68 & _79; _82 = vr_15 = -20; _88 = _73 & _87; _ifc__111 = _55 ? 10 : 12; _ifc__112 = _70 ? 7 : _ifc__111; _ifc__113 = _85 ? 8 : _ifc__112; _ifc__114 = _88 ? 9 : _ifc__113; _ifc__115 = _45 ? 1 : _ifc__114; _ifc__116...
We now generate one less compare, but also crucially the longest dependency chain is smaller. Prior to GCC 14, we would generate 5 compares as the longest chain:
cmple p7.s, p4/z, z29.s, z30.s cmpne p7.s, p7/z, z29.s, #0 cmple p6.s, p7/z, z31.s, z30.s cmpge p6.s, p6/z, z27.s, z25.s cmplt p15.s, p6/z, z28.s, z21.s
and from GCC 14 we generate:
cmple p7.s, p3/z, z27.s, z30.s cmpne p7.s, p7/z, z27.s, #0 cmpgt p7.s, p7/z, z31.s, z30.s
and (x <= y) && (x != 0) && (z > y) cannot be reduced further.
GCC 14 added support for the _BitInt(N) C2x standard type to support arbitrary precision integers. To accommodate this new type we had to design and implement a new AArch64 ABI.
Arch64 BitInt ABI
The following example shows how to create and use a 125-bit integer and perform addition on it:
_BitInt (125) foo (unsigned _BitInt(125) a, unsigned _BitInt(125) *p) { return a + *p; }
Note that a target may not always support a particular precision. To aid in this the compiler defines a macro __BITINT_MAXWIDTH__ to denote the maximum _BitInt precision it supports. To be complete the above example is better written as:
#if __BITINT_MAXWIDTH__ >= 125 _BitInt (125) foo (unsigned _BitInt(125) a, unsigned _BitInt(125) *p) { return a + *p; } #endif
For the given example we generate in GCC-14:
foo: ldp x3, x2, [x2] and x4, x1, 2305843009213693951 adds x3, x3, x0 and x1, x2, 2305843009213693951 adc x1, x1, x4 and x0, x3, 2305843009213693951 extr x1, x1, x3, 61 orr x0, x0, x1, lsl 61 asr x1, x1, 3 ret
Zero extension is a common operation which on AArch64 is typically done for vectors using the uxtl instruction. This instruction on most Arm microarchitectures is throughput limited. Or put it differently, we cannot execute as many as them in parallel as we would like.
However conceptually zero extension is just simply inserting zeros before each lane. That is, zero extending the low bits of a vector of 4 shorts to 2 ints requires the insertion of zero values below the bottom two elements in the vector (one before each). This is essentially what the zip permutes do in AArch64. As we have covered before, most modern Arm microarchitectures give us a way to create a vector of 0s for free. Since permutes can use the whole vector unit, replacing the throughput limited instructions with zips will give us much better performance overall.
In versions before GCC 14 this loop:
void f (int *b, unsigned short *a, int n) { for (int i = 0; i < (n & -4); i++) b[i] = a[i]; }
Generates:
.L5: ldr q0, [x4], 16 uxtl v1.4s, v0.4h uxtl2 v0.4s, v0.8h stp q1, q0, [x3] add x3, x3, 32 cmp x5, x4 bne .L5
while in GCC 14 we now generate:
movi v31.4s, 0 .L5: ldr q30, [x4], 16 zip1 v29.8h, v30.8h, v31.8h zip2 v30.8h, v30.8h, v31.8h stp q29, q30, [x3], 32 cmp x4, x5 bne .L5
With the improved autovectorization in GCC 14 we also wanted to provide the user with a way to explicitly tell the vectorizer not to vectorize a loop. In previous incarnations of the compiler users typically did this by adding an assembly statement inside the loop:
void f (int *b, unsigned short *a, int n) { for (int i = 0; i < (n & -4); i++) { asm (""); b[i] = a[i]; } }
Which will then abort vectorization as the vectorizer is unable to see what is happening inside the assembly blocks to know if it’s safe to vectorize. Unfortunately, this approach has two big downsides in that the use of the asm statement can have other effects on codegen, and it can require you to add explicit braces to your loop if they were not already there.
In GCC 14, we provide a new pragma to explicitly state to the vectorizer not to vectorize a loop:
void f (int *b, unsigned short *a, int n) { #pragma GCC novector for (int i = 0; i < (n & -4); i++) b[i] = a[i]; }
The pragma needs to be placed before the loop conditional.
In Part 3, we talk about the following topics:
Read GCC 14 Part 3
If you missed part 1, read here.