In GCC 15 Arm, GCC Community, and our partners have continued innovating and improving code generation for Arm based platforms. GCC 15 continues the trend to support control flow vectorization, mixing SVE and Adv. SIMD instructions along with general improvements to basic things such as addressing modes or constant creations. This release also contains many improvements for our Neoverse cores along with an estimated 3-5% improvement in SPECCPU 2017 Intrate on Neoverse CPUs.
As mentioned in the GCC-14 blog post GCC has successfully transitioned to having only one loop vectorizer instead of two. With GCC 15, only the loop-aware SLP vectorizer is enabled. Having one vectorizer makes it easier to extend the vectorizer’s capabilities as there’s only one to maintain. Features can be provided faster and and more flexibility. Much work was done in enhancing the SLP loop vectorizer with functionality that only the non-SLP loop vectorizer had. One such feature was the early break support in GCC-14.
GCC 14 was the enabling release for vectorization of loops with early exits. An early exit is any construct that allows you to exit the loop, whether it’s goto, return, break, abort etc. As an example in GCC 14
int y[100], x[100]; int foo (int n) { int res = 0; for (int i = 0; i < n; i++) { y[i] = x[i] * 2; res += x[i] + y[i]; if (x[i] > 5) break; } return res; }
generated
.L4: st1w z27.s, p7, [x2, x1, lsl 2] add x1, x1, x3 whilelo p7.s, w1, w0 b.none .L14 .L5: ld1w z31.s, p7/z, [x4, x1, lsl 2] cmpgt p14.s, p7/z, z31.s, #5 lsl z27.s, z31.s, #1 add z31.s, z31.s, z27.s mov z29.d, z30.d ptest p15, p14.b add z30.s, p7/m, z30.s, z31.s mov z31.d, z28.d incw z28.s b.none .L4 umov w1, v31.s[0]
Where there were several inefficiencies in the code generation that made the loop more expensive than it needed to be.
With GCC 15 many of these have been cleaned up and we generate a much cleaner loop:
.L4: st1w z28.s, p7, [x2, x1, lsl 2] add z30.s, z30.s, z28.s add x1, x1, x3 add z29.s, p7/m, z29.s, z30.s incw z31.s whilelo p7.s, w1, w0 b.none .L15 .L5: ld1w z30.s, p7/z, [x4, x1, lsl 2] cmpgt p14.s, p7/z, z30.s, #5 add z28.s, z30.s, z30.s ptest p15, p14.b b.none .L4 umov w1, v31.s[0]
This fixes liveness issues with the loop invariant which necessitated copying values in GCC-14. There is still some room for improvement such as eliminating the ptest which is planned for GCC-16. Adding support for early breaks SLP also means that several new features are easier to implement which are coming in GCC 16. For now though lets continue with what is new in GCC 15.
One limitation the GCC-14 implementation of early break vectorization was that it required the size of the buffer being read to be statically known or it required additional context to know that the vector loop could never read beyond what would be safe to read (in other words, that the vector loop doesn’t cross a page boundary when the scalar loop wouldn’t have). As an example,
#define END 505 int foo (int *x) { for (unsigned int i = 0; i < END; ++i) { if (x[i] > 0) return 0; } return -1; }
Would not vectorize in GCC 14 because it’s unclear what the runtime alignment of x is and thus it’s not possible to tell whether it’s safe to vectorize. In GCC 15 this limitation is now lifted for Adv. SIMD and fixed vector size SVE. Generic vector length agnostic SVE requires first faulting loads which will be supported in GCC 16.
In GCC 15 the example generates
.L5: add v30.4s, v30.4s, v27.4s add v29.4s, v29.4s, v28.4s cmp x2, x1 beq .L24 .L7: ldr q31, [x1] add x1, x1, 16 cmgt v31.4s, v31.4s, #0 umaxp v31.4s, v31.4s, v31.4s fmov x5, d31 cbz x5, .L5
Peeling for alignment works by using a scalar loop to handle enough elements such that the pointer becomes aligned to the required vector alignment and only then starts the vector code. This is, however, only possible when there’s only a single pointer with a misalignment. When there are multiple pointers which have an unknown alignment GCC will instead version the loop and insert a mutual alignment check before allowing access to the vector loop:
#define END 505 int foo (int *x, int *y) { for (unsigned int i = 0; i < END; ++i) { if (x[i] > 0 || y[i] > 0) return 0; } return -1; }
When using fixed size SVE we can also vectorize such loops. Instead of a scalar loop GCC will instead create an initial mask that performs the initial peeling:
Generates
.L2: add x1, x1, 4 mov w3, 0 add z31.s, z31.s, #4 sub z30.s, z30.s, #4 whilelo p7.s, w1, w2 b.none .L11 .L5: ld1w z29.s, p7/z, [x4, x1, lsl 2] cmpgt p7.s, p7/z, z29.s, #0 ptest p15, p7.b b.none .L2
This enabled GCC to vectorize many loops in HPC workloads such as GROMACs showing the impact of control flow support in GCC.
Programmers sometimes insert prefetch hints in their code (using __builtin_prefetch) to inform the CPU that a block of memory is about to be read. On many Arm cores, the instructions used for the hints are mostly ignored as the CPU's own predictors can already do much better. Arm recommends that you do not insert prefetch hints unless you have benchmarked your code on the machine you intend to deploy it on.
Furthermore, the presence of these hints means that the vectorizer cannot vectorize these scalar loops due to their influencing of memory. i.e. they have a side effect.
With GCC 14 the following
void foo (double *restrict a, double *restrict b, int n) { int i; for (i = 0; i < n; ++i) { a[i] = a[i] + b[i]; __builtin_prefetch (&(b[i + 8])); } }
Would not vectorize and instead stay scalar
.L3: ldr d30, [x1, -64] ldr d31, [x0] prfm PLDL1KEEP, [x1] add x1, x1, 8 fadd d31, d31, d30 str d31, [x0], 8 cmp x2, x0 bne .L3
With GCC 15 we now drop these prefetch calls during vectorization as the prefetches are not very useful for vector code. GCC 15 generates:
.L3: ld1d z31.d, p7/z, [x0, x3, lsl 3] ld1d z30.d, p7/z, [x1, x3, lsl 3] fadd z31.d, p7/m, z31.d, z30.d st1d z31.d, p7, [x0, x3, lsl 3] add x3, x3, 2 whilelo p7.d, w3, w2 b.any .L3
GCC 15 adds support for autovectorizing using SVE2.1’s two-way dot product. The example
#include <stdint.h> uint32_t udot2(int n, uint16_t* data) { uint32_t sum = 0; for (int i=0; i<n; i+=1) sum += data[i] * data[i]; return sum; }
In GCC 14 would vectorize using an MLA
.L3: ld1h z29.s, p7/z, [x1, x2, lsl 1] add x2, x2, x3 mla z30.s, p7/m, z29.s, z29.s whilelo p7.s, w2, w0 b.any .L3 uaddv d31, p6, z30.s
but In GCC 15 will use two-way udot:
.L3: ld1h z28.h, p7/z, [x1, x2, lsl 1] add x2, x2, x3 sel z27.h, p7, z28.h, z29.h whilelo p7.h, w2, w0 udot z30.s, z27.h, z28.h b.any .L3 ptrue p7.b, all uaddv d31, p7, z30.s
There is still scope for improvement in this loop as the sel is not needed. Today GCC does not have a clear definition of the value of the inactive elements after a load. In this case it isn’t known that these values are zero and this explicitly tries to zero them. Improving this is on the backlog.
GCC Also supports autovectorization of functions such as this for SME. As an example, adding the streaming mode attribute to the function
#include <stdint.h> uint32_t udot2 (int n, uint16_t *data) __arm_streaming { uint32_t sum = 0; for (int i = 0; i < n; i += 1) sum += data[i] * data[i]; return sum; }
Enables GCC to autovectorize as
.L4: incb x3, all, mul #2 ld1h z26.h, p7/z, [x2] ld1h z25.h, p7/z, [x2, #1, mul vl] ld1h z2.h, p7/z, [x2, #2, mul vl] ld1h z1.h, p7/z, [x2, #3, mul vl] udot z0.s, z26.h, z26.h incb x2, all, mul #4 udot z28.s, z25.h, z25.h udot z29.s, z2.h, z2.h udot z30.s, z1.h, z1.h cmp w4, w3 bcs .L4
Enabling seamless usage of CME functionality through the autovectorizer.
Libmvec support for SVE
In GCC 14 support was added for auto-vectorizing math routines using glibc’s libmvec implementation.
The following
#include <stdint.h> #include <math.h> void test_fn3 (float *a, float *b, int n) { for (int i = 0; i < n; ++i) a[i] = sinf (b[i]); }
in GCC 15 at -Ofast vectorizes as
.L4: ld1w z0.s, p7/z, [x22, x19, lsl 2] mov p0.b, p7.b bl _ZGVsMxv_sinf st1w z0.s, p7, [x21, x19, lsl 2] add x19, x19, x23 whilelo p7.s, w19, w20 b.any .L4
This also extends to user defined functions, as an example using the appropriate attributes
#include <stdint.h> extern char __attribute__ ((simd, const)) fn3 (int, char); void test_fn3 (int *a, int *b, int *c, int n) { for (int i = 0; i < n; ++i) a[i] = (int) (fn3 (b[i], c[i]) + c[i]); }
Is vectorized in GCC 15 as
.L4: ld1w z23.s, p7/z, [x23, x19, lsl 2] ld1w z0.s, p7/z, [x22, x19, lsl 2] mov p0.b, p7.b mov z1.d, z23.d ptrue p6.b, all bl _ZGVsMxvv_fn3 uxtb z0.s, p6/m, z0.s add z0.s, z0.s, z23.s st1w z0.s, p7, [x21, x19, lsl 2] incw x19 whilelo p7.s, w19, w20 b.any .L4
It is expected that the user provides an implementation for this function following the expected name mangling outlined in the Arm vector calling convention.
In addition, the Arm’s math routines in glibc have been improved providing additional value in vector routines
If you are writing your own math or string functions heavy program or can't use the latest glibc or aren't on Linux these improved routines are available on our Arm Optimized Routines github.
In GCC 15 the vectorizer now can track when values are being conditionally stored and optimize the store through better usages of predicates. As an example
#include <arm_neon.h> void foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int stride) { if (stride <= 1) return; for (int i = 0; i < n; i++) { int res = c[i]; int t = b[i + stride]; if (a[i] != 0) res = t; c[i] = res; } }
Generated in GCC 14:
.L3: ld1b z29.s, p7/z, [x0, x5] ld1w z31.s, p7/z, [x2, x5, lsl 2] ld1w z30.s, p7/z, [x1, x5, lsl 2] cmpne p15.b, p6/z, z29.b, #0 sel z30.s, p15, z30.s, z31.s st1w z30.s, p7, [x2, x5, lsl 2] add x5, x5, x4 whilelo p7.s, w5, w3 b.any .L3
And now generates:
.L3: ld1b z30.s, p7/z, [x0, x5] ld1w z31.s, p7/z, [x4, x5, lsl 2] cmpne p7.b, p7/z, z30.b, #0 st1w z31.s, p7, [x2, x5, lsl 2] add x5, x5, x1 whilelo p7.s, w5, w3 b.any .L3
Eliminating the need to perform the load of the old values.
GCC 15 now has support for detection and usage of saturating instructions both as scalar and vector instructions.
#include <arm_neon.h> #include <limits.h> #define UT unsigned int #define UMAX UINT_MAX #define UMIN 0 #define VT uint32x4_t UT uadd2 (UT a, UT b) { UT c; if (!__builtin_add_overflow(a, b, &c)) return c; return UMAX; } void uaddq (UT *out, UT *a, UT *b, int n) { for (int i = 0; i < n; i++) { UT sum = a[i] + b[i]; out[i] = sum < a[i] ? UMAX : sum; } }
Used to vectorize as:
.L11: ld1w z30.s, p7/z, [x1, x4, lsl 2] ld1w z29.s, p7/z, [x2, x4, lsl 2] add z29.s, z30.s, z29.s cmpls p15.s, p6/z, z30.s, z29.s sel z29.s, p15, z29.s, z31.s st1w z29.s, p7, [x0, x4, lsl 2] add x4, x4, x5 whilelo p7.s, w4, w3 b.any .L11
But in GCC 15 generates with SVE:
.L6: ld1w z31.s, p7/z, [x1, x4, lsl 2] ld1w z30.s, p7/z, [x2, x4, lsl 2] uqadd z30.s, z31.s, z30.s st1w z30.s, p7, [x0, x4, lsl 2] add x4, x4, x5 whilelo p7.s, w4, w3 b.any .L6
Or with Adv. SIMD:
.L6: ldr q31, [x1, x4] ldr q30, [x2, x4] uqadd v30.4s, v31.4s, v30.4s str q30, [x0, x4] add x4, x4, 16 cmp x5, x4 bne .L6
Additional care has been taken to only use the scalar instructions in cases where they would be beneficial to and otherwise use a sequence of scalar statements instead.
As an example at -O2 the code above uses arguments that are all on the GPR and so transferring these to FPR to use the saturating instruction would make things slower. Instead, we generate a sequence using only GPR with the usage of conditional instructions:
.L5: ldr w5, [x1, x4] ldr w6, [x2, x4] adds w5, w5, w6 csinv w5, w5, wzr, cc str w5, [x0, x4] add x4, x4, 4 cmp x3, x4 bne .L5
Population count is a popular operation in many source code and GCC has idiom recognition support for various ways that programmers can write popcount. Adv. SIMD only has popcount on vectors of byte elements. GCC supports other datatypes by re-using the existing byte popcount and summing the result, but this wasn’t done in an optimal way. As an example:
void bar (unsigned int *__restrict b, unsigned int *__restrict d) { d[0] = __builtin_popcount (b[0]); d[1] = __builtin_popcount (b[1]); d[2] = __builtin_popcount (b[2]); d[3] = __builtin_popcount (b[3]); }
bar: ldp s29, s28, [x0] ldp s31, s30, [x0, 8] cnt v29.8b, v29.8b cnt v28.8b, v28.8b cnt v31.8b, v31.8b cnt v30.8b, v30.8b addv b29, v29.8b addv b28, v28.8b addv b31, v31.8b addv b30, v30.8b stp s29, s28, [x1] stp s31, s30, [x1, 8] ret
But in GCC 15 without SVE:
bar: ldr q31, [x0] cnt v31.16b, v31.16b uaddlp v31.8h, v31.16b uaddlp v31.4s, v31.8h str q31, [x1] ret
bar: ldr q31, [x0] ptrue p7.b, vl16 cnt z31.s, p7/m, z31.s str q31, [x1] ret
And with dotproduct enabled:
bar: ldr q31, [x0] movi v30.16b, 0x1 movi v29.4s, 0 cnt v31.16b, v31.16b udot v29.4s, v31.16b, v30.16b str q29, [x1] ret
32-bit Arm now supports loop tail predication which is a technique that avoids needing a scalar tail for vector loops when the number of elements inside the loop is not a multiple of the vectorization factor. This is similar to loop masking in SVE. As an example, the loop:
#include <arm_mve.h> void fn (int32_t *a, int32_t *b, int32_t *c, int n) { for (int i = 0; i < n; i+=4) { mve_pred16_t p = vctp32q (n-i); int32x4_t va = vld1q_z (&a[i], p); int32x4_t vb = vld1q_z (&b[i], p); int32x4_t vc = vaddq (va, vb); vst1q_p (&c[i], vc, p); } }
In GCC 14 generated:
.L3: vctp.32 r3 vpst vldrwt.32 q3, [r0], #16 vpst vldrwt.32 q2, [r1], #16 vadd.i32 q3, q3, q2 subs r3, r3, #4 vpst vstrwt.32 q3, [r2] adds r2, r2, #16 le lr, .L3
And with loop predication:
.L3: vldrw.32 q3, [r0], #16 vldrw.32 q2, [r1], #16 vadd.i32 q3, q3, q2 vstrw.32 q3, [r2], #16 letp lr, .L3
If not desired -mno-dlstp disables the optimization. Tail predication begins with the ‘dlstp’ instruction setting the start count of elements to be masked as true and sets the ‘size’ of the elements. The `letp` instruction reduces the count, recomputes the mask and branches back to start if there are any elements left to work on.
As part of the 2023 Armv9 ISA update GCC 15 now has support for FP8. FP8 is a floating-point format where the size of the type is 8 bits but how these 8 bits are split between to define the exponent and mantissa is not fixed. The current version of this supports two formats:
Which format is in use is determined by a new system register FPMR. The format of the parameters and result value of an instruction is determined by the status of the FPMR and the format of the arguments and result type can be set independently.
These instructions are usable through intrinsics in GCC 15 where the intrinsics each take an fpm_t type which determines the types for the instructions that is being requested. Because setting the FPMR before every FP8 instruction would be rather slow GCC 15 also includes a new pass to track the liveness of this special register and optimize its usage. As an example
#include <arm_neon.h> void foo2 (float16_t ap[20000], mfloat8x16_t b, mfloat8x16_t c) { fpm_t mode = __arm_fpm_init(); mode = __arm_set_fpm_src1_format(mode, __ARM_FPM_E5M2); mode = __arm_set_fpm_src2_format(mode, __ARM_FPM_E5M2); for (int i = 0; i < 103; i++) { float16x8_t a = vld1q_f16 (ap + 8 * i); a = vmlalbq_f16_mf8_fpm (a, b, c, mode); a = vmlalbq_f16_mf8_fpm (a, b, c, mode); vst1q_f16 (ap + 8 * i, a); } }
foo2: mrs x2, fpmr add x1, x0, 1648 cbz x2, .L2 msr fpmr, xzr .L2: ldr q31, [x0] fmlalb v31.8h, v0.16b, v1.16b fmlalb v31.8h, v0.16b, v1.16b str q31, [x0], 16 cmp x1, x0 bne .L2 ret
Where the FPMR doesn’t change within the loop, and so it is moved out of the loop. By default, GCC assumes that writing the FPMR register is more expensive than reading it. This can be overridden by using -moverride=tune=cheap_fpmr_write which then generates:
foo2: add x1, x0, 1648 msr fpmr, xzr .L2: ldr q31, [x0] fmlalb v31.8h, v0.16b, v1.16b fmlalb v31.8h, v0.16b, v1.16b str q31, [x0], 16 cmp x1, x0 bne .L2 ret
Giving GCC the ability to control FP8 codegen based on costing.
GCC 15 now has support for the RCPC3 atomic instructions in libatomic which will be used automatically whenever a core supports them.
The introduction of the optional RCPC3 architectural extension for Armv8.2-A upwards provides additional support for the release consistency model, introducing the Load-Acquire RCpc Pair Ordered, and Store-Release Pair Ordered operations in the form of LDIAPP and STILP.
These operations are single-copy atomic on cores which also implement LSE2 and, as such, support for these operations is added to Libatomic and employed accordingly when the LSE2 and RCPC3 features are detected in each core at runtime.
GCC 15 adds support for the following new CPUs supported by -mcpu,-mtune (GCC identifiers in parentheses):
* NOTE: Support for these CPUs at this time only means that when running on Linux the compiler will enable the relevant architectural features and a reasonable cost model.
The existing Neoverse cost models have been updated as well with various tunings around such things as predicate costing.
Prior to GCC 15 the default port size for L1 cache line size was 256-bytes due to the existence of an Armv8-A core which implements this size. Consequently, on cores where the default size is 64-bytes we end up with more memory pressure than is needed, particularly when running large multithreaded applications.
In GCC-15 the default cache line size has been changed to 64-bytes for all Neoverse cores and the port default for Armv9-a changed to 64-bytes. This reduces the total memory usage on threaded applications.
With GCC 15 the Neoverse cost models have been updated with their natural reassociation width for the individual cores and support for pipelined Fused Multiple and Accumulated added. This means that based on how many parallel FMAs the core can execute GCC will create reassociation chains that attempt to fill the entire pipeline of the core and take advantage of FMA accumulator forwarding and pipelining.
In GCC 15 support for new instruction fusions support in Neoverse cores were added. This includes CMP+CSEL fusion. This means that GCC will endeavor to keep compares and conditional selects together in program order such that the CPU can fuse them.
As this operation has no downsize on cores that don’t support the fusion they are enabled by default for all cores.
GCC 15 supports the following architecture by -march and other source level constructs (GCC identifiers in parentheses):
And the following features are now supported by -march, -mcpu, -mtune and other source level constructs (GCC identifiers in parentheses):
The features listed as being enabled by default for Armv8.7-A or earlier were previously only selectable using the associated architecture level. For example, FEAT_FCMA was previously selected by -march=armv8.3-a and above (as it still is), but it wasn't previously selectable independently.
GCC 15 adds support for using standard C++ overloaded operators to work on SVE ACLE types.
This addition allows you to write SVE intrinsics code in a more natural way rather than having to use specific intrinsics to perform the operation. As an example:
#include <arm_neon.h> #include <arm_sve.h> svint32_t f (svint32_t a, svint32_t b) { return a * b; }
Now works and generates the expected
f: ptrue p3.b, all mul z0.s, p3/m, z0.s, z1.s ret
(New and Improved) SVE/OpenMP interoperability support
GCC 15 now has better support for SVE and OpenMP offloading including support for OpenMP parallel sections, for and lastprivate.
As an example
#include <arm_neon.h> #include <arm_sve.h> extern int omp_get_thread_num (void); static void __attribute__((noipa)) vec_compare (svint32_t *x, svint32_t y) { svbool_t p = svnot_b_z (svptrue_b32 (), svcmpeq_s32 (svptrue_b32 (), *x, y)); if (svptest_any (svptrue_b32 (), p)) __builtin_abort (); } void __attribute__ ((noipa)) omp_firstprivate_sections () { int b[8], c[8]; svint32_t vb, vc; int i; #pragma omp parallel for for (i = 0; i < 8; i++) { b[i] = i; c[i] = i + 1; } vb = svld1_s32 (svptrue_b32 (), b); vc = svld1_s32 (svptrue_b32 (), c); #pragma omp parallel sections firstprivate (vb, vc) { #pragma omp section vec_compare (&vb, svindex_s32 (0, 1)); vec_compare (&vc, svindex_s32 (1, 1)); #pragma omp section vec_compare (&vb, svindex_s32 (0, 1)); vec_compare (&vc, svindex_s32 (1, 1)); } }
Now vectorizes with SVE.
GCC 15 also supports the SVE2.1 and SME2.1 extensions through both intrinsics and autovectorization.
New instrinsics such as
#include <arm_neon.h> #include <arm_sve.h> svfloat64_t foo (svfloat64_t z0, svuint64_t z4) { z0 = svtblq (z0, z4); return z0; }
Are now supported and generate the expected
foo: tblq z0.d, {z0.d}, z1.d ret
Notably SVE2.1 makes more SME and SME2 extensions available outside of streaming mode.
GCC includes a pass called IVopts (Induction variable optimizations) which optimized aspects such as addressing modes of memory accesses based on what the target supports and the cost of using particular addressing modes.
One of the ways the pass does this is to compare IV expressions against each other and to determine whether they are computing the same thing, or whether one could be expressed in terms of the other.
As an example if a loop calculates a * b + 8 and a * b + 12 then the base IV can be c = a * b + 8 and the two usages c and c + 4.
Prior to GCC 15 this code did not deal with when one of the expressions is a signed expression and the other an unsigned. If you have ((int)(a * 3) + 8) * 4U and (((a * 3) + 8) * 4U the unsigned version would be simplified to ((a * 12) + 32) and so IV opts now needs to use two IVs.
The reason for this is that simplification of the signed expression doesn’t happen due to possible UB on overflow.
Normally this is not a problem because in most languages like C and C++ address calculations are unsigned. But some languages like Fortran don’t have an unsigned type and thus use signed address calculations. The compiler has various passes, however, that can convert a signed expression into an unsigned one, for example when it knows based on context or ranges that no overflow can occur. As the compiler improved this resulted in less efficient calculations when these cases occur.
As a result in GCC 14 the following code would generate post vectorization:
mov x27, -108 mov x24, -72 mov x23, -36 add x21, x1, x0, lsl 2 add x19, x20, x22 .L5: add x0, x22, x19 add x19, x19, 324 ldr d1, [x0, x27] add v1.2s, v1.2s, v15.2s str d1, [x20, 216] ldr d0, [x0, x24] add v0.2s, v0.2s, v15.2s str d0, [x20, 252] ldr d31, [x0, x23] add v31.2s, v31.2s, v15.2s str d31, [x20, 288] bl digits_20_ cmp x21, x19 bne .L5
Which ended up creating a mix of complex and simple addressing modes.
In GCC 15 IV opts was improved to perform two things:
In GCC 15 the example above generates:
.L5: ldr d1, [x19, -108] add v1.2s, v1.2s, v15.2s str d1, [x20, 216] ldr d0, [x19, -72] add v0.2s, v0.2s, v15.2s str d0, [x20, 252] ldr d31, [x19, -36] add x19, x19, 324 add v31.2s, v31.2s, v15.2s str d31, [x20, 288] bl digits_20_ cmp x21, x19 bne .L5
Which results in much faster code for some Fortran workloads.
Support for the ILP32 (32-bit integer, long and pointer mode on 64-bit Arm) ABI has been deprecated in GCC 15. Attempting to use -mabi=ilp32 will result in a warning and support for ILP32 will be removed entirely in future releases.
Notably the embedded target of AArch64 (aarch64-*-elf) no longer builds the ILP32 multilib by default.
The option -mcpu=native is used to detect the current platform the compiler is being used on and generate code that is optimized for this platform. Historically this relied on being able to detect the core identifier. A limitation of this approach, however, is that since GCC is often used as a system compiler old versions tend to be quite long lived in distros. This means that an older compiler is often used on newer CPUs and the compiler would not be able to identify the CPU.
In GCC 14, this restriction was relaxed for homogeneous systems. If the compiler couldn't identify the CPU, it would instead examine individual feature bits to enable any detected extensions and apply one of the newer, more modern costing models.
In GCC 15 this handling was extended to heterogeneous systems. The compiler now detects the common feature over any number of cores to support unknown big.LITTLE systems.
As more and more code is written in C++ we’ve started taking a closer look at the performance of code generated from the libstdc++ standard library. The community has done a tremendous amount of work this GCC release. Below two examples of changes made in GCC 15:
The loop implementation for std::find in libstdc++ always used to be manually unrolled. Such manual unrolling gives the benefit of reducing the amount of checks that have to be performed every iteration. However modern CPUs are pretty good at speculative execution and while unrolling still gives some benefits the manual unrolled loop prevents further optimizations of the loop.
In GCC 15 this loop is no longer manually unrolled. For loops finding 8-bit quantities libstdc++ now uses GLIBC’s memchr which has optimized variants for individual platforms. For other sizes libstdc++ now uses a simple non-unrolled loop and instead adds a pragma GCC unroll to indicate to the compiler that it should unroll the loop.
One main benefit of this approach is that it allows the loop to be vectorizable. As mentioned earlier, GCC15 now supports vectorization of loops with early breaks and unknown buffer sizes. In other words, we support vectorization of loops such as those in std::find.
#include <bits/stdc++.h> int foo(int *arr, int n) { // Search an element 6 auto it = std::find(arr, arr + n, 6); // Print index return std::distance(arr, it); }
Used to generate
.L3: ldr w3, [x2, 4] cmp w3, 6 beq .L25 ldr w3, [x2, 8] cmp w3, 6 beq .L26 ldr w3, [x2, 12] cmp w3, 6 beq .L27 add x2, x2, 16 cmp x4, x2 beq .L28 .L8: ldr w3, [x2] cmp w3, 6 bne .L3
GCC 15 now generates
.L57: lsl x3, x5, 4 ldr q31, [x6, x3] cmeq v31.4s, v31.4s, v29.4s umaxp v31.4s, v31.4s, v31.4s fmov x3, d31 cbnz x3, .L76 add v30.2d, v27.2d, v28.2d add x5, x5, 1
And with fixed length SVE:
.L57: ld1w z31.s, p6/z, [x5, x3, lsl 2] cmpeq p15.s, p7/z, z31.s, #6 b.any .L76 add x3, x3, 4 movprfx z30, z29 add z30.d, z30.d, #16
This support also extends to all derivatives of std::find, including std::find_if where the user can specify a custom predicate to check
#include <bits/stdc++.h> bool is_even(int i) { return i > 5 || i < 3; } int foo(int *arr, int n) { // Search an element 6 auto it = std::find_if(arr, arr + n, is_even); // Print index return std::distance(arr, it); }
Now vectorizes as
ldr q31, [x6, x3] add v31.4s, v31.4s, v29.4s cmhi v31.4s, v31.4s, v28.4s umaxp v31.4s, v31.4s, v31.4s fmov x3, d31 cbnz x3, .L60 add v26.2d, v26.2d, v27.2d
SVE support using first faulting loads will be added in GCC 16.
In GCC 12 there was a regression of approximately 40% in the performance of hashmap->find.
This regression came about accidentally:
Before GCC 12 the find function was small enough that IPA would inline it even though it wasn't marked inline. In GCC-12 an optimization was added to perform a linear search when the entries in the hashmap are small.
This increased the size of the function enough that IPA would no longer inline. Inlining had two benefits:
In GCC 15 this was fixed and hashtable->find is now considered for inlining again (subjective to compiler heuristics). This results In large improvements across the board for various element counts and types
The graph tests various hashmap sizes and cases such as finding none, one or multiple values.
Additional work was also done in the probabilities of the branches inside the loop. Prior to GCC 15 a loop was generated with more branches and compares that would be typically needed as the loop was laid out assuming the common case is that the element was not found. This causes the branch density within a single region to be quite dense and branch predictors typically have issues with such dense loops.
By changing the probabilities to assume the common case is that an element is found we get ~0-10% performance improvements and the cases where the entry is not found exhibit no slowdowns.
Suppress default Cortex-A53 erratum when they don’t apply
Many distros enable the Cortex-A53 erratum by default when configuring GCC. These erratum fixes (-mfix-cortex-a53-835769 and -mfix-cortex-a53-843419) can significantly reduceperformance. As measured on SPECCPU 2017 Intrate they account for an estimated ~1% overall loss
Due to this GCC 15 will now suppress them when generating code that cannot execute on CPUs containing the erratum; for example, code using SVE, since the Cortex-a53 has no support for this extension. This includes detecting a CPU that can’t be Cortex-A53 when using -mcpu=native.
Particularly when used on Neoverse CPUs this results in significantly better performance and closes the gap between distro defaults and GCC defaults.
One of the differences between SVE and Adv. SIMD is the range of immediates that instructions such as XOR, ORR and AND take. By taking advantage of these instructions and the SVE bitmask move instructions we can extend immediate generation for Adv. SIMD by using SVE instructions.
The simplest example of this is
#include <stdint.h> #include <stdlib.h> #include <arm_neon.h> int32x4_t t1 () { return (int32x4_t) { 0xffc, 0xffc, 0xffc, 0xffc }; }
Where in GCC 14 we’d generate
t1(): adrp x0, .LC0 ldr q0, [x0, #:lo12:.LC0] ret .LC0: .word 4092 .word 4092 .word 4092 .word 4092
But in GCC 15 with SVE enabled
t1(): mov z0.s, #4092 ret
Prior to GCC 15 when vector constants were being created, we would only use instructions that belong to that instruction class. i.e. when creating a 32-bit floating point value we would always use instructions that worked on 32-bit numbers or a load.
In GCC 15 we instead use any instruction that results in the same pattern as the constant being created.
As an example, the following
#include <arm_neon.h> uint32x4_t f1() { return vdupq_n_u32(0x3f800000); }
Used to require a literal load in GCC 14
f1(): adrp x0, .LC0 ldr q0, [x0, #:lo12:.LC0] ret .LC0: .word 1065353216 .word 1065353216 .word 1065353216 .word 1065353216
But in GCC 15 gets created using a floating point fmov
f1(): fmov v0.4s, 1.0e+0 ret
The SVE architecture has an instruction called index that is used to generate registers with a regular sequence of numbers.
For example the sequence [0,1,..,n] can be created using index .., #0, #1 with the first constant being the starting index and the second the step.
The snippet:
typedef int v4si __attribute__ ((vector_size (16))); v4si f_v4si (void) { return (v4si){ 0, 1, 2, 3 }; }
f_v4si: adrp x0, .LC0 ldr q0, [x0, #:lo12:.LC0] ret .LC0: .word 0 .word 1 .word 2 .word 3
f_v4si: index z0.s, #0, #1 ret
Arm CPUs have various idioms which are considered zero latency moves. These are documented in the Arm Software Optimization guides. One such idiom is how registers are zeroed.
Prior to GCC 14 we would create zeros using the same size as the datatype being zeroed. i.e. If zero-ing a vector of 4 ints we would use movi v4.s, #0. While this works it does mean that unless the compiler realizes that all zeros are the same that we may accidentally create multiple zero values.
In GCC 15 we now create all zeros using the same base mode, which ensures that they are shared. Additionally for SVE when not in streaming mode we use Adv. SIMD instructions to zero the SVE register.
The example
#include <arm_sve.h> svint32_t f (svint32_t a, svbool_t pg) { return svsub_x (pg, a, a); }
f(__SVInt32_t, __SVBool_t): mov z0.b, #0 ret
But now generates
f(__SVInt32_t, __SVBool_t): movi d0, #0 ret
GCC 15 contains various optimizations on permutes, far too numerous to mention. Below are some examples:
When a two reg TBL is performed with one operand being a zero vector we can instead use a single reg TBL and map the indices for accessing the zero vector to an out-of-range constant.
On AArch64 out of range indices into a TBL have a defined semantics of setting the element to zero. This sequence is generated often by OpenMP and aside from the strict performance impact of this change, it also gives better register allocation as we no longer have the consecutive register limitation.
typedef unsigned int v4si __attribute__ ((vector_size (16))); v4si f1 (v4si a) { v4si zeros = {0,0,0,0}; return __builtin_shufflevector (a, zeros, 1, 5, 1, 6); }
Generated in GCC 14
f1(unsigned int __vector(4)): movi v30.4s, 0 adrp x0, .LC0 ldr q31, [x0, #:lo12:.LC0] mov v1.16b, v30.16b tbl v0.16b, {v0.16b - v1.16b}, v31.16b ret .LC0: .byte 4 .byte 5 .byte 6 .byte 7 .byte 20 .byte 21 .byte 22 .byte 23 .byte 4 .byte 5 .byte 6 .byte 7 .byte 24 .byte 25 .byte 26 .byte 27
Note that this optimization is not possible with SVE because at VL 2048 the index 255 is still within range.
In some circumstances GCC could not realize that it was permuting the same register and ended up copying the register to pass it to a 2 reg TBL.
This resulted in a needless register copy because TBLs require registers to be sequential in numbers. The example
extern float a[32000], b[32000]; void s1112() { for (int i = 32000 - 1; i >= 0; i--) { a[i] = b[i] + (float)1.; } }
.L2: add x2, x4, x0 add x1, x3, x0 add x2, x2, 65536 add x1, x1, 65536 sub x0, x0, #16 ldr q30, [x2, 62448] mov v31.16b, v30.16b tbl v30.16b, {v30.16b - v31.16b}, v29.16b fadd v30.4s, v30.4s, v28.4s mov v31.16b, v30.16b tbl v30.16b, {v30.16b - v31.16b}, v29.16b str q30, [x1, 62448] cmp x0, x5 bne .L2
But in GCC 15:
.L2: add x2, x4, x0 add x1, x3, x0 add x2, x2, 65536 add x1, x1, 65536 sub x0, x0, #16 ldr q29, [x2, 62448] tbl v29.16b, {v29.16b}, v31.16b fadd v29.4s, v29.4s, v30.4s tbl v29.16b, {v29.16b}, v31.16b str q29, [x1, 62448] cmp x0, x5 bne .L2
#include <stdint.h> #include <stdlib.h> #include <arm_neon.h> uint64x2_t foo (uint64x2_t r) { uint32x4_t a = vreinterpretq_u32_u64 (r); uint32_t t; t = a[0]; a[0] = a[1]; a[1] = t; t = a[2]; a[2] = a[3]; a[3] = t; return vreinterpretq_u64_u32 (a); }
Used to perform the reversals element wise due to the reinterpretation of the 64 bit element vectors to 32 element vector
foo(__Uint64x2_t): mov v31.16b, v0.16b ins v0.s[0], v0.s[1] ins v0.s[1], v31.s[0] ins v0.s[2], v31.s[3] ins v0.s[3], v31.s[2] ret
But in GCC 15 we now correctly generate
foo(__Uint64x2_t): rev64 v0.4s, v0.4s ret
GCC’s early register allocator pass was enhanced this year to help optimize sequences that are generated with TBLs.
typedef unsigned int v4si __attribute__((vector_size(16))); unsigned int foo (v4si *ptr) { v4si x = __builtin_shufflevector (ptr[0], ptr[2], 0, 7, 1, 5); v4si y = x ^ ptr[4]; return (y[0] + y[1] + y[2] + y[3]) >> 1; }
Used to generate additional moves around the TBLs, but in GCC 15 we generate
adrp x1, .LC0 ldr q30, [x0] ldr q31, [x0, 32] ldr q29, [x1, #:lo12:.LC0] ldr q0, [x0, 64] tbl v30.16b, {v30.16b - v31.16b}, v29.16b eor v0.16b, v0.16b, v30.16b
GCC 15 has a new late combiner pass to help optimize instructions generated late in the compilation pipeline. It runs twice, once before register allocator but before the first instruction splitter and once after register allocation.
The pass’s objective is to remove definitions by substituting them into all uses. This is particularly handy to force complex addressing modes back into loads that may have been split off earlier to try to share part of the computation. On modern Arm cores most complex addressing modes are cheap and so it’s more beneficial to use them.
extern const int constellation_64qam[64]; void foo(int nbits, const char *p_src, int *p_dst) { while (nbits > 0U) { char first = *p_src++; char index1 = ((first & 0x3) << 4) | (first >> 4); *p_dst++ = constellation_64qam[index1]; nbits--; } }
.L3: ldrb w4, [x1, x3] ubfiz w5, w4, 4, 2 orr w4, w5, w4, lsr 4 ldr w4, [x6, w4, sxtw 2] str w4, [x2, x3, lsl 2] add x3, x3, 1 cmp x0, x3 bne .L3
With the addressing mode guaranteed to be created.
GCC has multiple instruction scheduling passes. One that happens before register allocation and one that happens after. Up to a third of the compilation time of an application can be taken up by early scheduling but at lower optimization levels this does not provide any meaningful performance gains on modern out of order CPUs.
This is partly because modern OoO cores need far less scheduling and partly because the early register allocation is working on pseudo registers where it’s not yet able to tell that different pseudos refer to the same values. Because it lacks this context it can end up scheduling dependent instructions after e.g. an increment. This results in the register allocator having to insert a spill to keep both the old and new values live slowing down performance.
In GCC 15 the early scheduler is now disabled on AArch64 for optimization levels lower than O3. This results in compilation time reduction for up to ~50% and a code size reduction of ~0.2% on SPECCPU 2017.
The SVE intrinsics in GCC are mostly modelled as opaque structures in the front-end and it’s only the backend that knows what they do. A consequence of this is that simple operations working on constants were not being optimized. GCC 15 does a whole host of optimizations on SVE intrinsics, far too many to list. However, examples of optimizations being performed now are:
#include <stdint.h> #include "arm_sve.h" svint64_t s64_x_pg (svbool_t pg) { return svdiv_x (pg, svdup_s64 (5), svdup_s64 (3)); } svint64_t s64s_x_pg (svbool_t pg) { return svmul_x (pg, svdup_s64 (5), svdup_s64 (3)); } svint64_t s64_x_pg_op1 (svbool_t pg, svint64_t op2) { return svdiv_x (pg, svdup_s64 (0), op2); }
Which in GCC 14 generated
s64_x_pg: mov z31.d, #5 mov z0.d, #3 sdivr z0.d, p0/m, z0.d, z31.d ret s64s_x_pg: mov z0.d, #5 mul z0.d, z0.d, #3 ret s64_x_pg_op1: mov z31.b, #0 sdivr z0.d, p0/m, z0.d, z31.d ret
These are now optimized in GCC 15 as one would expect to
s64_x_pg: mov z0.d, #1 ret s64s_x_pg: mov z0.d, #15 ret s64_x_pg_op1: movi d0, #0 ret
GCC 14 vectorizes CTZ for SVE to .CTZ (X) = (PREC - 1) - .CLZ (X & -X) which can be improved by using RBIT. As an example the sequence
#include <stdint.h> void ctz_uint8 (uint8_t *__restrict x, uint8_t *__restrict y, int n) { for (int i = 0; i < n; i++) x[i] = __builtin_ctzg (y[i]); }
.L3: ld1b z31.b, p7/z, [x1, x3] movprfx z30, z31 add z30.b, z30.b, #255 bic z30.d, z30.d, z31.d clz z30.b, p6/m, z30.b subr z30.b, z30.b, #8 st1b z30.b, p7, [x0, x3] add x3, x3, x4 whilelo p7.b, w3, w2 b.any .L3
.L3: ld1b z31.b, p7/z, [x1, x3] rbit z31.b, p6/m, z31.b clz z31.b, p6/m, z31.b st1b z31.b, p7, [x0, x3] add x3, x3, x4 whilelo p7.b, w3, w2 b.any .L3
Some vector rotate operations can be implemented in a single instruction rather than using the fallback SHL+USRA sequence.
When the rotation is half the element size we can use a suitable REV instruction.
More generally, rotates by a byte amount can be implemented using vector permutes.
typedef unsigned long long __attribute__ ((vector_size (16))) v2di; v2di G1 (v2di r) { return (r >> 32) | (r << 32); }
In GCC 14 used to generate
G1(unsigned long long __vector(2)): shl v31.2d, v0.2d, 32 ushr v0.2d, v0.2d, 32 orr v0.16b, v0.16b, v31.16b ret
And in GCC 15
G1(unsigned long long __vector(2)): rev64 v0.4s, v0.4s ret
We can make also use of the integrated rotate step of the XAR instruction to implement most vector integer rotates, as long we zero out one of the input registers for it. This allows for a lower-latency sequence than the fallback SHL+USRA, especially when we can hoist the zeroing operation away from loops and hot parts.
typedef unsigned long long __attribute__ ((vector_size (16))) v2di; v2di G2 (v2di r) { return (r >> 39) | (r << 25); }
G2(unsigned long long __vector(2)): shl v31.2d, v0.2d, 25 ushr v0.2d, v0.2d, 39 orr v0.16b, v0.16b, v31.16b ret
And in GCC 15 with SVE2
G2(unsigned long long __vector(2)): movi v31.4s, 0 xar z0.d, z0.d, z31.d, #39 ret
GCC 15 has a new pass to automatically detect sequences calculating CRC and emitting optimized instructions for architectures that have hardware accelerated instructions.
This pass can support two specific computations:
The following example
#include <stdint.h> #include <stdlib.h> uint32_t _crc32 (uint32_t crc, uint16_t data) { int i; crc = crc ^ data; for (i = 0; i < 8; i++) { if (crc & 1) crc = (crc >> 1) ^ 0xEDB88320; else crc = (crc >> 1); } return crc; }
_crc32(unsigned int, unsigned short): and w1, w1, 65535 mov w3, 33568 eor w0, w1, w0 movk w3, 0xedb8, lsl 16 mov w1, 8 .L2: sbfx x2, x0, 0, 1 subs w1, w1, #1 and w2, w2, w3 eor w0, w2, w0, lsr 1 bne .L2 ret
And in GCC 15 with +crc
_crc32(unsigned int, unsigned short): and w1, w1, 65535 eor w1, w1, w0 mov w0, 0 crc32b w0, w1, w0 ret
The bit reverse sequence
#include <stdint.h> #include <stdlib.h> #define POLY (0x1070U << 3) #define u8 uint8_t #define u16 uint16_t u8 crc8 (u16 data) { int i; for (i = 0; i < 8; i++) { if (data & 0x8000) data = data ^ POLY; data = data << 1; } return (u8)(data >> 8); }
crc8(unsigned short): and w0, w0, 65535 mov w2, 8 mov w3, -31872 .L7: eor w1, w0, w3 tst x0, 32768 and w1, w1, 65535 csel w0, w1, w0, ne subs w2, w2, #1 ubfiz w0, w0, 1, 15 bne .L7 ubfx x0, x0, 8, 8 ret
and in GCC 15
crc8(unsigned short): mov x2, 1813 ubfx x1, x0, 8, 8 movk x2, 0x1, lsl 16 fmov d30, x1 fmov d31, x2 mov x1, 1792 ubfiz x0, x0, 8, 16 pmull v30.1q, v30.1d, v31.1d fmov d31, x1 ushr d30, d30, 16 pmull v30.1q, v30.1d, v31.1d fmov w1, s30 eor w0, w0, w1 ubfx x0, x0, 8, 8 ret
In GCC 14 the following expressions
#include <stdint.h> #include <stdlib.h> float test_ldexpf (float x, int i) { return __builtin_ldexpf (x, i); } float test_powif_1(float a, int i) { return a * __builtin_powif(2.0f, i); }
would result in calls to library implementation of these functions
test_ldexpf(float, int): b ldexpf test_powif_1(float, int): stp x29, x30, [sp, -32]! mov x29, sp str d15, [sp, 16] fmov s15, s0 fmov s0, 2.0e+0 bl __powisf2 fmul s0, s0, s15 ldr d15, [sp, 16] ldp x29, x30, [sp], 32 ret
However, when SVE is enabled the SVE instruction FSCALE can be used to implement ldexp, powif and powof2.
In GCC 15 with SVE enabled we generate:
test_ldexpf(float, int): fmov s31, w0 ptrue p7.b, vl4 fscale z0.s, p7/m, z0.s, z31.s ret test_powif_1(float, int): fmov s31, w0 ptrue p7.b, vl4 fscale z0.s, p7/m, z0.s, z31.s ret
And thus avoid the expansive library calls
Continuing the trend that GCC 14 started we now use more SVE instructions to patch up inefficiencies in Adv. SIMD codegen. The following example
#include <stdint.h> int8_t M_int8_t_8[8]; void asrd_int8_t_8 () { for (int i = 0; i < 8; i++) { M_int8_t_8[i] /= 4; } }
When vectorized using Adv. SIMD used to generate
asrd_int8_t_8(): adrp x0, .LANCHOR0 movi v29.8b, 0x3 ldr d31, [x0, #:lo12:.LANCHOR0] cmlt v30.8b, v31.8b, #0 and v29.8b, v29.8b, v30.8b add v29.8b, v31.8b, v29.8b sshr v29.8b, v29.8b, 2 str d29, [x0, #:lo12:.LANCHOR0] ret
But in GCC 15 with SVE enabled will generate
asrd_int8_t_8(): adrp x0, .LANCHOR0 ptrue p7.b, vl8 ldr d31, [x0, #:lo12:.LANCHOR0] asrd z31.b, p7/m, z31.b, #2 str d31, [x0, #:lo12:.LANCHOR0] ret
GCC 15 has a new optimization for optimizing code layout for locality between callees and callers with LTO. The basic idea is that it’s a good idea to minimize branch distances between frequently called functions.
This is particularly useful for larger applications where you typically have a grouping of functions that work together in a closely related cluster. Keeping branch distances short can improve icache efficiency and make better use of internal CPU data structures.
This new optimization is also able to use PGO to guide this heuristic. In the absence of PGO it will try to use GCC’s static predictors to determine what the best locality could be.
The new pass can be turned on with the -fipa-reorder-for-locality flag.
GLIBC now has an improved __libc_malloc by splitting it into 2 parts:
This results in significant performance gains since __libc_malloc doesn't need to setup a frame and we delay tcache initialization and setting of errno until later.
On Neoverse V2, bench-malloc-simple improves by 6.7% overall (up to 8.5% for ST case) and bench-malloc-thread improves by 20.3% for 1 thread and 14.4% for 32 threads.
GCC 15, along with Binutils 2.44 and Glibc 2.41 brings support for Guarded Control Stack extension for systems that have this hardware capability and run Linux kernel 6.13 or newer. This extension allows to use shadow stacks on AArch64 systems. A shadow stack is maintained transparently for the programmer and has limited access, so it may not be corrupted in the same manner as the main stack of the application. Each return from a function comes along with a check of return address in the link register and in the corresponding shadow stack frame. If mismatch is detected, execution is interrupted instead of proceeding to an incorrect address.
GCS is opt-in and needs all object files or libraries of the application to be built with GCS support (similar to BTI, GCS relies on ELF marking) and it must be explicitly activated at runtime. GCS is currently not enabled by default and needs to be turned on explicitly.
To build your application with GCS support, use the `-mbranch-protection=gcs` compiler flag:
$ gcc -mbranch-protection=gcs hello.c -o hello
Of course in practice you would usually use `-mbranch-protection=standard` which includes GCS. To check that your binary is marked as supporting GCS, look at the GNU property notes:
$ readelf -n hello ... Properties: AArch64 feature: GCS ...
When running an application linked against Glibc, use new GCS tunable `glibc.cpu.aarch64_gcs` to activate GCS runtime support. A Glibc library needs to be built with standard branch protection to support GCS, however it would work with both applications that use GCS and those that don't. New tunable allows users to choose one of the following behaviours:
To run your app with one of these behaviours, use the `GLIBC_TUNABLES` environment variable:
$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./hello
You can find more information about how Glibc enables GCS via the `prctl` system call in corresponding documentation.
It is important to make a note about `dlopen`. If GCS is enabled at the time of calling `dlopen` and the library we are about to load is not GCS-marked, it will return an error.
In multi-threaded applications each thread will have its own shadow stack.
With return addresses being carefully checked against shadow stack, it is important that the application does not manipulate its stack frame records in a way that is not PCS compliant. However, some tricky cases are already covered. Glibc 2.41 or newer has GCS-friendly implementations of `setjmp`, `longjmp` and `makecontext`, `setcontext` or `swapcontext`, and `vfork`. It is expected that for many applications and libraries GCS will be transparent and should not require any code change.
We will try to demonstrate effects of GCS in practice using over-simplified example adopted from the Stack Buffer Overflow learning path. So, let's write some code.
GCC 15 implements ACLE builtins relevant to GCS, one of them is `__chkfeat` which allows to do runtime checks for certain CPU features. Under the hood it is just a few instructions, so these checks are quick. To use it, include the `arm_acle.h` header:
// hack.c #include <stdio.h> #include <arm_acle.h> int main (int argc, char *argv[]) { /* Show if GCS is enabled (uses ACLE). */ if (__chkfeat (_CHKFEAT_GCS)) printf ("GCS enabled\n"); else printf ("GCS not enabled\n"); return 0; }
Build this with GCS branch protection:
$ aarch64-none-linux-gnu-gcc hack.c -mbranch-protection=gcs -o hack \ --sysroot=/path/to/sysroot/aarch64-none-linux-gnu
And run using GCS tunable:
$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./hack GCS enabled
and (notice value `0` that we use this time):
$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=0 ./hack GCS not enabled
To run these examples, if you don't have access to hardware that supports GCS, you can use a cool tool called Shrinkwrap that allows you to run programs on Arm Fixed Virtual Platforms (FVP) models in a very user-friendly way.
Next we will write a program that has some explicit bugs and implements a ROP attack via stack buffer overflow, overwriting the return address of the `main` function and achieving execution of malicious code. This example is very crude for the sake of being brief and simple, but it will allow us to see how this program behaves with and without GCS.
When a GCS detects mismatch of return addresses, program receives `SIGSEGV`, so to catch this we will setup signal handler (some code omitted for brevity):
// hack.c #include <signal.h> <...> /* We'll arrive here when GCS catches mismatch of return addresses. */ static void handler (int) { puts ("something's phishy... exiting"); exit (0); } int main (int argc, char *argv[]) { <...> /* Setup signal handler for GCS errors. */ signal (SIGSEGV, handler); <...> }
Now let's introduce our buggy code that may result in stack buffer overflow.
// hack.c #include <stdlib.h> <...> /* Buffer which attacker managed to write to. */ unsigned long buffer[3] = {}; /* To make example simpler we use this crude mem copy implementation. */ static void copy (char *dst, const char *src, size_t len) { const char *end = src + len; while (src < end) *dst++ = *src++; } /* Buggy function. */ static int fun (const char *src, size_t len) { /* This is a bug as LEN can be more than 8. */ char local[8]; copy (local, src, len); return *(int *)local; } int main (int argc, char *argv[]) { <...> /* Do stuff. After hack instead of returning from main we will jump to `hacked` due to return address corruption. */ return fun ((const char *)buffer, sizeof (buffer)); }
Copying into the `local` buffer that is allocated on stack may result in writing past the end of this buffer which means we may overwrite other data stored on stack above this buffer.
Now for the hacker bits:
/* Pointer to data of attacker's choice. */ const char command[] = "whoami"; /* Code that attacker wants to run. */ static void hacked (const char *command) { printf ("you've been hacked! executing `%s`\n", command); exit (1); }
and in the `main` function after we installed signal handler and before we call our `fun` function, let's use arbitrary write gadget that attacker managed to achieve:
int main (int argc, char *argv[]) { <...> /* Attacker uses their arbitrary write gadgets. */ buffer[0] = (unsigned long)command; buffer[2] = (unsigned long)&hacked; <...> }
Putting it all together and compiling with (note that this example is sensitive to code generation, that's why we use specific optimisation level `-O0` below):
$ aarch64-none-linux-gnu-gcc hack.c -O0 -mbranch-protection=gcs -o hack \ --sysroot=/path/to/sysroot/aarch64-none-linux-gnu
And run using GCS tunable with GCS disabled:
$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=0 ./hack GCS not enabled you've been hacked! executing `whoami`
and (notice value `1` that we use this time):
$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./hack GCS enabled something's phishy... exiting
With the help of GCS we managed to prevent execution of malicious code and handle the unexpected situation appropriately. You can of course run this example on a non-GCS AArch64 system, but GCS will not be enabled due to lack of hardware GCS support.
The guarded control stack is used under the hood for tracking return addresses but it can also be useful for other things. While userspace applications are not allowed to write to shadow stack directly, we can read from it, and this may be handy for gathering call stacks for debugging and profiling purposes in a potentially faster and more robust way than via previously available tools.
Let's write a simple example. We will use the new ACLE builtin `__gcspr` that returns shadow stack pointer. We will also use `AT_HWCAP` to check if the system we are running on can support GCS (otherwise the following code would not be able to run). Even if the system supports GCS, it doesn't mean that GCS is enabled at runtime. To check this we will use, as before, the `__chkfeat` ACLE builtin. If GCS is not enabled, shadow stack pointer will be null. Finally, if all is good and GCS is enabled, we'll be able to access a stack of return addresses and print them in order (most recent all first).
// callstack.c #include <stdio.h> #include <arm_acle.h> #include <sys/auxv.h> int main(int argc, char const *argv[]) { /* Check if system may support GCS at all. */ if (!(getauxval (AT_HWCAP) & HWCAP_GCS)) { printf ("GCS is not supported on this system\n"); return 1; } /* Get GCS pointer. */ const unsigned long *gcspr = __gcspr (); /* Check of GCS is enabled. */ if (!__chkfeat (_CHKFEAT_GCS)) { printf ("GCS is not enabled so GCS pointer is null: %016lx\n", (unsigned long)gcspr); return 2; } else { printf ("GCS is enabled so GCS pointer is not null: %016lx\n", (unsigned long)gcspr); } /* Print callstack. */ printf ("callstack:\n"); do { printf (" - return address: %08lx\n", *gcspr); } while (*gcspr++); return 0; }
Build with standard branch protection:
$ aarch64-none-linux-gnu-gcc callstack.c -mbranch-protection=standard -o callstack \ --sysroot=/path/to/sysroot/aarch64-none-linux-gnu -static
And run, as before with one of the values for the GCS tunable, for example:
$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./callstack
Use `objdump` to check disassembly of this program to confirm the return addresses shown in the output:
$ objdump -D callstack | less
GCC 15 was a significant step forward in support for Arm architectures and optimizations. As work on the internal structure of the compiler continues the improvements made with each release will continue to be more and more. Look out for much more to come in GCC 16!In the meantime, check out the previous year's entry for GCC 14.