GCC 15: Continuously Improving

June 26, 2025

50 minute read time.

Contents:
Vectorization improvements
Saturating arithmetic support
MVE Tail predication
FP8
RCPC3 (FEAT_LRCPC3) Libatomic
New cores and updated Neoverse tunings
Default L1 data cache line change
Pipelined FMA code generation
CMP+CSEL fusion
New architecture and features support
SVE support for C/C++ operators
SVE2.1 & SME2.1
IV opts and addressing modes
ILP32 deprecation

-mcpu=native detection on unknown heterogeneous systems
Libstdc++ improvement
Improvement to SIMD immediate generations
Disconnect immediate mode from container/type
Update zero initialization
Permute optimizations
Disable early scheduling for lower optimization levels
SVE Intrinsics optimizations
Improve vector rotates
Improve CRC detections
Use SVE FSCALE, ASRD and more with Adv. SIMD
Code locality optimizations
Improved malloc in glibc
Guarded Control Stack (GCS)
Summary

In GCC 15 Arm, GCC Community, and our partners have continued innovating and improving code generation for Arm based platforms. GCC 15 continues the trend to support control flow vectorization, mixing SVE and Adv. SIMD instructions along with general improvements to basic things such as addressing modes or constant creations. This release also contains many improvements for our Neoverse cores along with an estimated 3-5% improvement in SPECCPU 2017 Intrate on Neoverse CPUs.

neoverse-v2 ggcc 15

Vectorization improvements

As mentioned in the GCC-14 blog post GCC has successfully transitioned to having only one loop vectorizer instead of two. With GCC 15, only the loop-aware SLP vectorizer is enabled. Having one vectorizer makes it easier to extend the vectorizer’s capabilities as there’s only one to maintain. Features can be provided faster and and more flexibility. Much work was done in enhancing the SLP loop vectorizer with functionality that only the non-SLP loop vectorizer had. One such feature was the early break support in GCC-14.

Early break SLP support

GCC 14 was the enabling release for vectorization of loops with early exits. An early exit is any construct that allows you to exit the loop, whether it’s goto, return, break, abort etc. As an example in GCC 14

int y[100], x[100];
int foo (int n)
{
  int res = 0;
  for (int i = 0; i < n; i++)
    {
       y[i] = x[i] * 2;
       res += x[i] + y[i];

       if (x[i] > 5)
         break;
    }
  return res;
}

generated

.L4:
        st1w    z27.s, p7, [x2, x1, lsl 2]
        add     x1, x1, x3
        whilelo p7.s, w1, w0
        b.none  .L14
.L5:
        ld1w    z31.s, p7/z, [x4, x1, lsl 2]
        cmpgt   p14.s, p7/z, z31.s, #5
        lsl     z27.s, z31.s, #1
        add     z31.s, z31.s, z27.s
        mov     z29.d, z30.d
        ptest   p15, p14.b
        add     z30.s, p7/m, z30.s, z31.s
        mov     z31.d, z28.d
        incw    z28.s
        b.none  .L4
        umov    w1, v31.s[0]

Where there were several inefficiencies in the code generation that made the loop more expensive than it needed to be.

With GCC 15 many of these have been cleaned up and we generate a much cleaner loop:

.L4:
        st1w    z28.s, p7, [x2, x1, lsl 2]
        add     z30.s, z30.s, z28.s
        add     x1, x1, x3
        add     z29.s, p7/m, z29.s, z30.s
        incw    z31.s
        whilelo p7.s, w1, w0
        b.none  .L15
.L5:
        ld1w    z30.s, p7/z, [x4, x1, lsl 2]
        cmpgt   p14.s, p7/z, z30.s, #5
        add     z28.s, z30.s, z30.s
        ptest   p15, p14.b
        b.none  .L4
        umov    w1, v31.s[0]

This fixes liveness issues with the loop invariant which necessitated copying values in GCC-14. There is still some room for improvement such as eliminating the ptest which is planned for GCC-16. Adding support for early breaks SLP also means that several new features are easier to implement which are coming in GCC 16. For now though lets continue with what is new in GCC 15.

Peeling for alignment for early breaks

One limitation the GCC-14 implementation of early break vectorization was that it required the size of the buffer being read to be statically known or it required additional context to know that the vector loop could never read beyond what would be safe to read (in other words, that the vector loop doesn’t cross a page boundary when the scalar loop wouldn’t have). As an example,

#define END 505

int foo (int *x)
{
  for (unsigned int i = 0; i < END; ++i)
    {
      if (x[i] > 0)
        return 0;
    }
  return -1;
}

Would not vectorize in GCC 14 because it’s unclear what the runtime alignment of x is and thus it’s not possible to tell whether it’s safe to vectorize. In GCC 15 this limitation is now lifted for Adv. SIMD and fixed vector size SVE. Generic vector length agnostic SVE requires first faulting loads which will be supported in GCC 16.

In GCC 15 the example generates

.L5:
        add     v30.4s, v30.4s, v27.4s
        add     v29.4s, v29.4s, v28.4s
        cmp     x2, x1
        beq     .L24
.L7:
        ldr     q31, [x1]
        add     x1, x1, 16
        cmgt    v31.4s, v31.4s, #0
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x5, d31
        cbz     x5, .L5

Peeling for alignment works by using a scalar loop to handle enough elements such that the pointer becomes aligned to the required vector alignment and only then starts the vector code. This is, however, only possible when there’s only a single pointer with a misalignment. When there are multiple pointers which have an unknown alignment GCC will instead version the loop and insert a mutual alignment check before allowing access to the vector loop:

#define END 505

int foo (int *x, int *y)
{
  for (unsigned int i = 0; i < END; ++i)
    {
      if (x[i] > 0 || y[i] > 0)
        return 0;
    }
  return -1;
}

Fixed Size SVE Peeling for alignment

When using fixed size SVE we can also vectorize such loops. Instead of a scalar loop GCC will instead create an initial mask that performs the initial peeling:

#define END 505

int foo (int *x)
{
  for (unsigned int i = 0; i < END; ++i)
    {
      if (x[i] > 0)
        return 0;
    }
  return -1;
}

Generates

.L2:
        add     x1, x1, 4
        mov     w3, 0
        add     z31.s, z31.s, #4
        sub     z30.s, z30.s, #4
        whilelo p7.s, w1, w2
        b.none  .L11
.L5:
        ld1w    z29.s, p7/z, [x4, x1, lsl 2]
        cmpgt   p7.s, p7/z, z29.s, #0
        ptest   p15, p7.b
        b.none  .L2

This enabled GCC to vectorize many loops in HPC workloads such as GROMACs showing the impact of control flow support in GCC.

Support vectorization of loops with __builtin_prefetch calls

Programmers sometimes insert prefetch hints in their code (using __builtin_prefetch) to inform the CPU that a block of memory is about to be read. On many Arm cores, the instructions used for the hints are mostly ignored as the CPU's own predictors can already do much better. Arm recommends that you do not insert prefetch hints unless you have benchmarked your code on the machine you intend to deploy it on.

Furthermore, the presence of these hints means that the vectorizer cannot vectorize these scalar loops due to their influencing of memory. i.e. they have a side effect.

With GCC 14 the following

void
foo (double *restrict a, double *restrict b, int n)
{
    int i;
    for (i = 0; i < n; ++i)
        {
            a[i] = a[i] + b[i];
            __builtin_prefetch (&(b[i + 8]));
        }
}

Would not vectorize and instead stay scalar

.L3:
        ldr     d30, [x1, -64]
        ldr     d31, [x0]
        prfm    PLDL1KEEP, [x1]
        add     x1, x1, 8
        fadd    d31, d31, d30
        str     d31, [x0], 8
        cmp     x2, x0
        bne     .L3

With GCC 15 we now drop these prefetch calls during vectorization as the prefetches are not very useful for vector code. GCC 15 generates:

.L3:
        ld1d    z31.d, p7/z, [x0, x3, lsl 3]
        ld1d    z30.d, p7/z, [x1, x3, lsl 3]
        fadd    z31.d, p7/m, z31.d, z30.d
        st1d    z31.d, p7, [x0, x3, lsl 3]
        add     x3, x3, 2
        whilelo p7.d, w3, w2
        b.any   .L3

Support two way dot product

GCC 15 adds support for autovectorizing using SVE2.1’s two-way dot product. The example

#include <stdint.h>

uint32_t udot2(int n, uint16_t* data) {
  uint32_t sum = 0;
  for (int i=0; i<n; i+=1)
    sum += data[i] * data[i];
  return sum;
}

In GCC 14 would vectorize using an MLA

.L3:
        ld1h    z29.s, p7/z, [x1, x2, lsl 1]
        add     x2, x2, x3
        mla     z30.s, p7/m, z29.s, z29.s
        whilelo p7.s, w2, w0
        b.any   .L3
        uaddv   d31, p6, z30.s

but In GCC 15 will use two-way udot:

.L3:
        ld1h    z28.h, p7/z, [x1, x2, lsl 1]
        add     x2, x2, x3
        sel     z27.h, p7, z28.h, z29.h
        whilelo p7.h, w2, w0
        udot    z30.s, z27.h, z28.h
        b.any   .L3
        ptrue   p7.b, all
        uaddv   d31, p7, z30.s

There is still scope for improvement in this loop as the sel is not needed. Today GCC does not have a clear definition of the value of the inactive elements after a load. In this case it isn’t known that these values are zero and this explicitly tries to zero them. Improving this is on the backlog.

GCC Also supports autovectorization of functions such as this for SME. As an example, adding the streaming mode attribute to the function

#include <stdint.h>

uint32_t
udot2 (int n, uint16_t *data) __arm_streaming
{
    uint32_t sum = 0;
    for (int i = 0; i < n; i += 1)
        sum += data[i] * data[i];
    return sum;
}

Enables GCC to autovectorize as

.L4:
        incb    x3, all, mul #2
        ld1h    z26.h, p7/z, [x2]
        ld1h    z25.h, p7/z, [x2, #1, mul vl]
        ld1h    z2.h, p7/z, [x2, #2, mul vl]
        ld1h    z1.h, p7/z, [x2, #3, mul vl]
        udot    z0.s, z26.h, z26.h
        incb    x2, all, mul #4
        udot    z28.s, z25.h, z25.h
        udot    z29.s, z2.h, z2.h
        udot    z30.s, z1.h, z1.h
        cmp     w4, w3
        bcs     .L4

Enabling seamless usage of CME functionality through the autovectorizer.

Libmvec support for SVE

In GCC 14 support was added for auto-vectorizing math routines using glibc’s libmvec implementation.

The following

#include <stdint.h>
#include <math.h>

void test_fn3 (float *a, float *b, int n)
{
  for (int i = 0; i < n; ++i)
    a[i] = sinf (b[i]);
}

in GCC 15 at -Ofast vectorizes as

.L4:
        ld1w    z0.s, p7/z, [x22, x19, lsl 2]
        mov     p0.b, p7.b
        bl      _ZGVsMxv_sinf
        st1w    z0.s, p7, [x21, x19, lsl 2]
        add     x19, x19, x23
        whilelo p7.s, w19, w20
        b.any   .L4

This also extends to user defined functions, as an example using the appropriate attributes

#include <stdint.h>

extern char __attribute__ ((simd, const)) fn3 (int, char);
void test_fn3 (int *a, int *b, int *c, int n)
{
  for (int i = 0; i < n; ++i)
    a[i] = (int) (fn3 (b[i], c[i]) + c[i]);
}

Is vectorized in GCC 15 as

.L4:
        ld1w    z23.s, p7/z, [x23, x19, lsl 2]
        ld1w    z0.s, p7/z, [x22, x19, lsl 2]
        mov     p0.b, p7.b
        mov     z1.d, z23.d
        ptrue   p6.b, all
        bl      _ZGVsMxvv_fn3
        uxtb    z0.s, p6/m, z0.s
        add     z0.s, z0.s, z23.s
        st1w    z0.s, p7, [x21, x19, lsl 2]
        incw    x19
        whilelo p7.s, w19, w20
        b.any   .L4

It is expected that the user provides an implementation for this function following the expected name mangling outlined in the Arm vector calling convention.

In addition, the Arm’s math routines in glibc have been improved providing additional value in vector routines

libmvec improvement

If you are writing your own math or string functions heavy program or can't use the latest glibc or aren't on Linux these improved routines are available on our Arm Optimized Routines github.

Conditional store optimizations

In GCC 15 the vectorizer now can track when values are being conditionally stored and optimize the store through better usages of predicates. As an example

#include <arm_neon.h>

void
foo1 (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
{
    if (stride <= 1)
        return;

    for (int i = 0; i < n; i++)
        {
            int res = c[i];
            int t = b[i + stride];
            if (a[i] != 0)
                res = t;
            c[i] = res;
        }
}

Generated in GCC 14:

.L3:
        ld1b    z29.s, p7/z, [x0, x5]
        ld1w    z31.s, p7/z, [x2, x5, lsl 2]
        ld1w    z30.s, p7/z, [x1, x5, lsl 2]
        cmpne   p15.b, p6/z, z29.b, #0
        sel     z30.s, p15, z30.s, z31.s
        st1w    z30.s, p7, [x2, x5, lsl 2]
        add     x5, x5, x4
        whilelo p7.s, w5, w3
        b.any   .L3

And now generates:

.L3:
        ld1b    z30.s, p7/z, [x0, x5]
        ld1w    z31.s, p7/z, [x4, x5, lsl 2]
        cmpne   p7.b, p7/z, z30.b, #0
        st1w    z31.s, p7, [x2, x5, lsl 2]
        add     x5, x5, x1
        whilelo p7.s, w5, w3
        b.any   .L3

Eliminating the need to perform the load of the old values.

Saturating arithmetic support

GCC 15 now has support for detection and usage of saturating instructions both as scalar and vector instructions.

The following

#include <arm_neon.h>
#include <limits.h>

#define UT unsigned int
#define UMAX UINT_MAX
#define UMIN 0
#define VT uint32x4_t

UT uadd2 (UT a, UT b)
{
        UT c;
        if (!__builtin_add_overflow(a, b, &c))
                return c;
        return UMAX;
}

void uaddq (UT *out, UT *a, UT *b, int n)
{
  for (int i = 0; i < n; i++)
    {
      UT sum = a[i] + b[i];
      out[i] = sum < a[i] ? UMAX : sum;
    }
}

Used to vectorize as:

.L11:
        ld1w    z30.s, p7/z, [x1, x4, lsl 2]
        ld1w    z29.s, p7/z, [x2, x4, lsl 2]
        add     z29.s, z30.s, z29.s
        cmpls   p15.s, p6/z, z30.s, z29.s
        sel     z29.s, p15, z29.s, z31.s
        st1w    z29.s, p7, [x0, x4, lsl 2]
        add     x4, x4, x5
        whilelo p7.s, w4, w3
        b.any   .L11

But in GCC 15 generates with SVE:

.L6:
        ld1w    z31.s, p7/z, [x1, x4, lsl 2]
        ld1w    z30.s, p7/z, [x2, x4, lsl 2]
        uqadd   z30.s, z31.s, z30.s
        st1w    z30.s, p7, [x0, x4, lsl 2]
        add     x4, x4, x5
        whilelo p7.s, w4, w3
        b.any   .L6

Or with Adv. SIMD:

.L6:
        ldr     q31, [x1, x4]
        ldr     q30, [x2, x4]
        uqadd   v30.4s, v31.4s, v30.4s
        str     q30, [x0, x4]
        add     x4, x4, 16
        cmp     x5, x4
        bne     .L6

Additional care has been taken to only use the scalar instructions in cases where they would be beneficial to and otherwise use a sequence of scalar statements instead.

As an example at -O2 the code above uses arguments that are all on the GPR and so transferring these to FPR to use the saturating instruction would make things slower. Instead, we generate a sequence using only GPR with the usage of conditional instructions:

.L5:
        ldr     w5, [x1, x4]
        ldr     w6, [x2, x4]
        adds    w5, w5, w6
        csinv   w5, w5, wzr, cc
        str     w5, [x0, x4]
        add     x4, x4, 4
        cmp     x3, x4
        bne     .L5

Improve vectorization of popcount

Population count is a popular operation in many source code and GCC has idiom recognition support for various ways that programmers can write popcount. Adv. SIMD only has popcount on vectors of byte elements. GCC supports other datatypes by re-using the existing byte popcount and summing the result, but this wasn’t done in an optimal way. As an example:

void
bar (unsigned int *__restrict b, unsigned int *__restrict d)
{
  d[0] = __builtin_popcount (b[0]);
  d[1] = __builtin_popcount (b[1]);
  d[2] = __builtin_popcount (b[2]);
  d[3] = __builtin_popcount (b[3]);
}

Generated in GCC 14:

bar:
        ldp     s29, s28, [x0]
        ldp     s31, s30, [x0, 8]
        cnt     v29.8b, v29.8b
        cnt     v28.8b, v28.8b
        cnt     v31.8b, v31.8b
        cnt     v30.8b, v30.8b
        addv    b29, v29.8b
        addv    b28, v28.8b
        addv    b31, v31.8b
        addv    b30, v30.8b
        stp     s29, s28, [x1]
        stp     s31, s30, [x1, 8]
        ret

But in GCC 15 without SVE:

bar:
        ldr     q31, [x0]
        cnt     v31.16b, v31.16b
        uaddlp  v31.8h, v31.16b
        uaddlp  v31.4s, v31.8h
        str     q31, [x1]
        ret

And with SVE enabled:

bar:
        ldr     q31, [x0]
        ptrue   p7.b, vl16
        cnt     z31.s, p7/m, z31.s
        str     q31, [x1]
        ret

And with dotproduct enabled:

bar:
        ldr     q31, [x0]
        movi    v30.16b, 0x1
        movi    v29.4s, 0
        cnt     v31.16b, v31.16b
        udot    v29.4s, v31.16b, v30.16b
        str     q29, [x1]
        ret

MVE Tail predication

32-bit Arm now supports loop tail predication which is a technique that avoids needing a scalar tail for vector loops when the number of elements inside the loop is not a multiple of the vectorization factor. This is similar to loop masking in SVE. As an example, the loop:

#include <arm_mve.h>

void fn (int32_t *a, int32_t *b, int32_t *c, int n)
{
    for (int i = 0; i < n; i+=4)
    {
        mve_pred16_t p = vctp32q (n-i);
        int32x4_t va = vld1q_z (&a[i], p);
        int32x4_t vb = vld1q_z (&b[i], p);
        int32x4_t vc = vaddq (va, vb);
        vst1q_p (&c[i], vc, p);
    }
}

In GCC 14 generated:

.L3:
        vctp.32 r3
        vpst
        vldrwt.32       q3, [r0], #16
        vpst
        vldrwt.32       q2, [r1], #16
        vadd.i32        q3, q3, q2
        subs    r3, r3, #4
        vpst
        vstrwt.32       q3, [r2]
        adds    r2, r2, #16
        le      lr, .L3

And with loop predication:

.L3:
        vldrw.32        q3, [r0], #16
        vldrw.32        q2, [r1], #16
        vadd.i32        q3, q3, q2
        vstrw.32        q3, [r2], #16
        letp    lr, .L3

If not desired -mno-dlstp disables the optimization. Tail predication begins with the ‘dlstp’ instruction setting the start count of elements to be masked as true and sets the ‘size’ of the elements. The `letp` instruction reduces the count, recomputes the mask and branches back to start if there are any elements left to work on.

FP8

As part of the 2023 Armv9 ISA update GCC 15 now has support for FP8. FP8 is a floating-point format where the size of the type is 8 bits but how these 8 bits are split between to define the exponent and mantissa is not fixed. The current version of this supports two formats:

Arm E4M3 (Exponent 4-bits, Mantissa 3-bits)
Arm E5M2 (Exponent 5-bits, Mantissa 2-bits)

Which format is in use is determined by a new system register FPMR. The format of the parameters and result value of an instruction is determined by the status of the FPMR and the format of the arguments and result type can be set independently.

These instructions are usable through intrinsics in GCC 15 where the intrinsics each take an fpm_t type which determines the types for the instructions that is being requested. Because setting the FPMR before every FP8 instruction would be rather slow GCC 15 also includes a new pass to track the liveness of this special register and optimize its usage. As an example

#include <arm_neon.h>

void
foo2 (float16_t ap[20000], mfloat8x16_t b, mfloat8x16_t c)
{
    fpm_t mode = __arm_fpm_init();
    mode = __arm_set_fpm_src1_format(mode, __ARM_FPM_E5M2);
    mode = __arm_set_fpm_src2_format(mode, __ARM_FPM_E5M2);
    for (int i = 0; i < 103; i++)
        {
            float16x8_t a = vld1q_f16 (ap + 8 * i);
            a = vmlalbq_f16_mf8_fpm (a, b, c, mode);
            a = vmlalbq_f16_mf8_fpm (a, b, c, mode);
            vst1q_f16 (ap + 8 * i, a);
        }
}

Generates

foo2:
        mrs     x2, fpmr
        add     x1, x0, 1648
        cbz     x2, .L2
        msr     fpmr, xzr
.L2:
        ldr     q31, [x0]
        fmlalb  v31.8h, v0.16b, v1.16b
        fmlalb  v31.8h, v0.16b, v1.16b
        str     q31, [x0], 16
        cmp     x1, x0
        bne     .L2
        ret

Where the FPMR doesn’t change within the loop, and so it is moved out of the loop. By default, GCC assumes that writing the FPMR register is more expensive than reading it. This can be overridden by using -moverride=tune=cheap_fpmr_write which then generates:

foo2:
        add     x1, x0, 1648
        msr     fpmr, xzr
.L2:
        ldr     q31, [x0]
        fmlalb  v31.8h, v0.16b, v1.16b
        fmlalb  v31.8h, v0.16b, v1.16b
        str     q31, [x0], 16
        cmp     x1, x0
        bne     .L2
        ret

Giving GCC the ability to control FP8 codegen based on costing.

RCPC3 (FEAT_LRCPC3) Libatomic

GCC 15 now has support for the RCPC3 atomic instructions in libatomic which will be used automatically whenever a core supports them.

The introduction of the optional RCPC3 architectural extension for Armv8.2-A upwards provides additional support for the release consistency model, introducing the Load-Acquire RCpc Pair Ordered, and Store-Release Pair Ordered operations in the form of LDIAPP and STILP.

These operations are single-copy atomic on cores which also implement LSE2 and, as such, support for these operations is added to Libatomic and employed accordingly when the LSE2 and RCPC3 features are detected in each core at runtime.

New cores and updated Neoverse tunings

GCC 15 adds support for the following new CPUs supported by -mcpu,-mtune (GCC identifiers in parentheses):

Apple A12 (apple-a12)*
Apple M1 (apple-m1)*
Apple M2 (apple-m2)*
Apple M3 (apple-m3)*
Arm Cortex-A520AE (cortex-a520ae)
Arm Cortex-A720AE (cortex-a720ae)
Arm Cortex-A725 (cortex-a725)
Arm Cortex-R82AE (cortex-r82ae)
Arm Cortex-X925 (cortex-x925)
Arm Neoverse N3 (neoverse-n3)
Arm Neoverse V3 (neoverse-v3)
Arm Neoverse V3AE (neoverse-v3ae)
FUJITSU-MONAKA (fujitsu-monaka)
NVIDIA Grace (grace)
NVIDIA Olympus (olympus)
Qualcomm Oryon-1 (oryon-1)

* NOTE: Support for these CPUs at this time only means that when running on Linux the compiler will enable the relevant architectural features and a reasonable cost model.

The existing Neoverse cost models have been updated as well with various tunings around such things as predicate costing.

Default L1 data cache line change

Prior to GCC 15 the default port size for L1 cache line size was 256-bytes due to the existence of an Armv8-A core which implements this size. Consequently, on cores where the default size is 64-bytes we end up with more memory pressure than is needed, particularly when running large multithreaded applications.

In GCC-15 the default cache line size has been changed to 64-bytes for all Neoverse cores and the port default for Armv9-a changed to 64-bytes. This reduces the total memory usage on threaded applications.

Pipelined FMA code generation

With GCC 15 the Neoverse cost models have been updated with their natural reassociation width for the individual cores and support for pipelined Fused Multiple and Accumulated added. This means that based on how many parallel FMAs the core can execute GCC will create reassociation chains that attempt to fill the entire pipeline of the core and take advantage of FMA accumulator forwarding and pipelining.

CMP+CSEL fusion

In GCC 15 support for new instruction fusions support in Neoverse cores were added. This includes CMP+CSEL fusion. This means that GCC will endeavor to keep compares and conditional selects together in program order such that the CPU can fuse them.

As this operation has no downsize on cores that don’t support the fusion they are enabled by default for all cores.

New Architecture and features support

GCC 15 supports the following architecture by -march and other source level constructs (GCC identifiers in parentheses):

Armv9.5-A (arm9.5-a)

And the following features are now supported by -march, -mcpu, -mtune and other source level constructs (GCC identifiers in parentheses):

FEAT_CPA (+cpa), enabled by default for Arm9.5-A and above
FEAT_FAMINMAX (+faminmax), enabled by default for Arm9.5-A and above
FEAT_FCMA (+fcma), enabled by default for Armv8.3-A and above or when sve is enabled
FEAT_FLAGM2 (+flagm2), enabled by default for Armv8.5-A and above
FEAT_FP8 (+fp8)
FEAT_FP8DOT2 (+fp8dot2)
FEAT_FP8DOT4 (+fp8dot4)
FEAT_FP8FMA (+fp8fma)
FEAT_FRINTTS (+frintts), enabled by default for Armv8.5-A and above
FEAT_JSCVT (+jscvt), enabled by default for Armv8.3-A and above
FEAT_LUT (+lut), enabled by default for Arm9.5-A and above
FEAT_LRCPC2 (+rcpc2), enabled by default for Armv8.4-A and above
FEAT_SME_B16B16 (+sme-b16b16)
FEAT_SME_F16F16 (+sme-f16f16)
FEAT_SME2p1 (+sme2p1)
FEAT_SSVE_FP8DOT2 (+ssve-fp8dot2)
FEAT_SSVE_FP8DOT4 (+ssve-fp8dot4)
FEAT_SSVE_FP8FMA (+ssve-fp8fma)
FEAT_SVE_B16B16 (+sve-b16b16)
FEAT_SVE2p1 (+sve2p1), enabled by default for Armv9.4-A and above
FEAT_WFXT (+wfxt), enabled by default for Armv8.7-A and above
FEAT_XS (+xs), enabled by default for Armv8.7-A and above

The features listed as being enabled by default for Armv8.7-A or earlier were previously only selectable using the associated architecture level. For example, FEAT_FCMA was previously selected by -march=armv8.3-a and above (as it still is), but it wasn't previously selectable independently.

SVE Support for C/C++ operators

GCC 15 adds support for using standard C++ overloaded operators to work on SVE ACLE types.

This addition allows you to write SVE intrinsics code in a more natural way rather than having to use specific intrinsics to perform the operation. As an example:

#include <arm_neon.h>
#include <arm_sve.h>

svint32_t f (svint32_t a, svint32_t b)
{
    return a * b;
}

Now works and generates the expected

f:
        ptrue   p3.b, all
        mul     z0.s, p3/m, z0.s, z1.s
        ret

(New and Improved) SVE/OpenMP interoperability support

GCC 15 now has better support for SVE and OpenMP offloading including support for OpenMP parallel sections, for and lastprivate.

As an example

#include <arm_neon.h>
#include <arm_sve.h>

extern int omp_get_thread_num (void);

static void __attribute__((noipa))
vec_compare (svint32_t *x, svint32_t y)
{
  svbool_t p = svnot_b_z (svptrue_b32 (), svcmpeq_s32 (svptrue_b32 (), *x, y));

  if (svptest_any (svptrue_b32 (), p))
    __builtin_abort ();
}

void  __attribute__ ((noipa))
omp_firstprivate_sections ()
{
  int b[8], c[8];
  svint32_t vb, vc;
  int i;

#pragma omp parallel for
  for (i = 0; i < 8; i++)
    {
      b[i] = i;
      c[i] = i + 1;
    }

  vb = svld1_s32 (svptrue_b32 (), b);
  vc = svld1_s32 (svptrue_b32 (), c);

#pragma omp parallel sections firstprivate (vb, vc)
  {
    #pragma omp section
    vec_compare (&vb, svindex_s32 (0, 1));
    vec_compare (&vc, svindex_s32 (1, 1));

    #pragma omp section
    vec_compare (&vb, svindex_s32 (0, 1));
    vec_compare (&vc, svindex_s32 (1, 1));
  }

}

Now vectorizes with SVE.

SVE2.1 & SME2.1

GCC 15 also supports the SVE2.1 and SME2.1 extensions through both intrinsics and autovectorization.

New instrinsics such as

#include <arm_neon.h>
#include <arm_sve.h>

svfloat64_t foo (svfloat64_t z0, svuint64_t z4)
{
	z0 = svtblq (z0, z4);
    return z0;
}

Are now supported and generate the expected

foo:
        tblq    z0.d, {z0.d}, z1.d
        ret

Notably SVE2.1 makes more SME and SME2 extensions available outside of streaming mode.

IV opts/Addressing modes

GCC includes a pass called IVopts (Induction variable optimizations) which optimized aspects such as addressing modes of memory accesses based on what the target supports and the cost of using particular addressing modes.

One of the ways the pass does this is to compare IV expressions against each other and to determine whether they are computing the same thing, or whether one could be expressed in terms of the other.

As an example if a loop calculates a * b + 8 and a * b + 12 then the base IV can be c = a * b + 8 and the two usages c and c + 4.

Prior to GCC 15 this code did not deal with when one of the expressions is a signed expression and the other an unsigned. If you have ((int)(a * 3) + 8) * 4U and (((a * 3) + 8) * 4U the unsigned version would be simplified to ((a * 12) + 32) and so IV opts now needs to use two IVs.

The reason for this is that simplification of the signed expression doesn’t happen due to possible UB on overflow.

Normally this is not a problem because in most languages like C and C++ address calculations are unsigned. But some languages like Fortran don’t have an unsigned type and thus use signed address calculations. The compiler has various passes, however, that can convert a signed expression into an unsigned one, for example when it knows based on context or ranges that no overflow can occur. As the compiler improved this resulted in less efficient calculations when these cases occur.

As a result in GCC 14 the following code would generate post vectorization:

        mov     x27, -108
        mov     x24, -72
        mov     x23, -36
        add     x21, x1, x0, lsl 2
        add     x19, x20, x22
.L5:
        add     x0, x22, x19
        add     x19, x19, 324
        ldr     d1, [x0, x27]
        add     v1.2s, v1.2s, v15.2s
        str     d1, [x20, 216]
        ldr     d0, [x0, x24]
        add     v0.2s, v0.2s, v15.2s
        str     d0, [x20, 252]
        ldr     d31, [x0, x23]
        add     v31.2s, v31.2s, v15.2s
        str     d31, [x20, 288]
        bl      digits_20_
        cmp     x21, x19
        bne     .L5

Which ended up creating a mix of complex and simple addressing modes.

In GCC 15 IV opts was improved to perform two things:

When an invariant expression is available both as a signed and unsigned variant, use the unsigned variant.
When comparing invariant expressions compare them for equality under two’s complement equality. That is, if the resulting bit pattern of the two expressions are the same, consider them equal.
Consider 0 a multiple of every number or expression.

In GCC 15 the example above generates:

.L5:
        ldr     d1, [x19, -108]
        add     v1.2s, v1.2s, v15.2s
        str     d1, [x20, 216]
        ldr     d0, [x19, -72]
        add     v0.2s, v0.2s, v15.2s
        str     d0, [x20, 252]
        ldr     d31, [x19, -36]
        add     x19, x19, 324
        add     v31.2s, v31.2s, v15.2s
        str     d31, [x20, 288]
        bl      digits_20_
        cmp     x21, x19
        bne     .L5

Which results in much faster code for some Fortran workloads.

ILP32 deprecation

Support for the ILP32 (32-bit integer, long and pointer mode on 64-bit Arm) ABI has been deprecated in GCC 15. Attempting to use -mabi=ilp32 will result in a warning and support for ILP32 will be removed entirely in future releases.

Notably the embedded target of AArch64 (aarch64-*-elf) no longer builds the ILP32 multilib by default.

-mcpu=native detection on unknown heterogeneous systems

The option -mcpu=native is used to detect the current platform the compiler is being used on and generate code that is optimized for this platform. Historically this relied on being able to detect the core identifier. A limitation of this approach, however, is that since GCC is often used as a system compiler old versions tend to be quite long lived in distros. This means that an older compiler is often used on newer CPUs and the compiler would not be able to identify the CPU.

In GCC 14, this restriction was relaxed for homogeneous systems. If the compiler couldn't identify the CPU, it would instead examine individual feature bits to enable any detected extensions and apply one of the newer, more modern costing models.

In GCC 15 this handling was extended to heterogeneous systems. The compiler now detects the common feature over any number of cores to support unknown big.LITTLE systems.

Libstdc++ improvement

As more and more code is written in C++ we’ve started taking a closer look at the performance of code generated from the libstdc++ standard library. The community has done a tremendous amount of work this GCC release. Below two examples of changes made in GCC 15:

std::find

The loop implementation for std::find in libstdc++ always used to be manually unrolled. Such manual unrolling gives the benefit of reducing the amount of checks that have to be performed every iteration. However modern CPUs are pretty good at speculative execution and while unrolling still gives some benefits the manual unrolled loop prevents further optimizations of the loop.

In GCC 15 this loop is no longer manually unrolled. For loops finding 8-bit quantities libstdc++ now uses GLIBC’s memchr which has optimized variants for individual platforms. For other sizes libstdc++ now uses a simple non-unrolled loop and instead adds a pragma GCC unroll to indicate to the compiler that it should unroll the loop.

One main benefit of this approach is that it allows the loop to be vectorizable. As mentioned earlier, GCC15 now supports vectorization of loops with early breaks and unknown buffer sizes. In other words, we support vectorization of loops such as those in std::find.

As an example

#include <bits/stdc++.h>

int foo(int *arr, int n) {

    // Search an element 6
    auto it = std::find(arr, arr + n, 6);

	// Print index
    return std::distance(arr, it);
}

Used to generate

.L3:
        ldr     w3, [x2, 4]
        cmp     w3, 6
        beq     .L25
        ldr     w3, [x2, 8]
        cmp     w3, 6
        beq     .L26
        ldr     w3, [x2, 12]
        cmp     w3, 6
        beq     .L27
        add     x2, x2, 16
        cmp     x4, x2
        beq     .L28
.L8:
        ldr     w3, [x2]
        cmp     w3, 6
        bne     .L3

GCC 15 now generates

.L57:
        lsl     x3, x5, 4
        ldr     q31, [x6, x3]
        cmeq    v31.4s, v31.4s, v29.4s
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x3, d31
        cbnz    x3, .L76
        add     v30.2d, v27.2d, v28.2d
        add     x5, x5, 1

And with fixed length SVE:

.L57:
        ld1w    z31.s, p6/z, [x5, x3, lsl 2]
        cmpeq   p15.s, p7/z, z31.s, #6
        b.any   .L76
        add     x3, x3, 4
        movprfx z30, z29
        add     z30.d, z30.d, #16

This support also extends to all derivatives of std::find, including std::find_if where the user can specify a custom predicate to check

As an example

#include <bits/stdc++.h>

bool is_even(int i)
{
    return i > 5 || i < 3;
}

int foo(int *arr, int n) {

    // Search an element 6
    auto it = std::find_if(arr, arr + n, is_even);

	// Print index
    return std::distance(arr, it);
}

Now vectorizes as

        ldr     q31, [x6, x3]
        add     v31.4s, v31.4s, v29.4s
        cmhi    v31.4s, v31.4s, v28.4s
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x3, d31
        cbnz    x3, .L60
        add     v26.2d, v26.2d, v27.2d

SVE support using first faulting loads will be added in GCC 16.

std::hashtable

In GCC 12 there was a regression of approximately 40% in the performance of hashmap->find.

This regression came about accidentally:

Before GCC 12 the find function was small enough that IPA would inline it even though it wasn't marked inline. In GCC-12 an optimization was added to perform a linear search when the entries in the hashmap are small.

This increased the size of the function enough that IPA would no longer inline. Inlining had two benefits:

The return value is a reference. so it must be returned and dereferenced even though the search loop may have already dereference it.
The pattern is a hard pattern to track for branch predictors. This causes a large number of branch misses if the value is immediately checked and branched on. i.e. if (a != m.end()) which is a common pattern.

In GCC 15 this was fixed and hashtable->find is now considered for inlining again (subjective to compiler heuristics). This results In large improvements across the board for various element counts and types

stdh::hashmap improvements

The graph tests various hashmap sizes and cases such as finding none, one or multiple values.

Additional work was also done in the probabilities of the branches inside the loop. Prior to GCC 15 a loop was generated with more branches and compares that would be typically needed as the loop was laid out assuming the common case is that the element was not found. This causes the branch density within a single region to be quite dense and branch predictors typically have issues with such dense loops.

By changing the probabilities to assume the common case is that an element is found we get ~0-10% performance improvements and the cases where the entry is not found exhibit no slowdowns.

branch probability improvements

Suppress default Cortex-A53 erratum when they don’t apply

Many distros enable the Cortex-A53 erratum by default when configuring GCC. These erratum fixes (-mfix-cortex-a53-835769 and -mfix-cortex-a53-843419) can significantly reduceperformance. As measured on SPECCPU 2017 Intrate they account for an estimated ~1% overall loss

Due to this GCC 15 will now suppress them when generating code that cannot execute on CPUs containing the erratum; for example, code using SVE, since the Cortex-a53 has no support for this extension. This includes detecting a CPU that can’t be Cortex-A53 when using -mcpu=native.

Particularly when used on Neoverse CPUs this results in significantly better performance and closes the gap between distro defaults and GCC defaults.

Improvement to SIMD immediate generations

One of the differences between SVE and Adv. SIMD is the range of immediates that instructions such as XOR, ORR and AND take. By taking advantage of these instructions and the SVE bitmask move instructions we can extend immediate generation for Adv. SIMD by using SVE instructions.

The simplest example of this is

#include <stdint.h>
#include <stdlib.h>
#include <arm_neon.h>

int32x4_t t1 ()
{
  return (int32x4_t) { 0xffc, 0xffc, 0xffc, 0xffc };
}

Where in GCC 14 we’d generate

t1():
        adrp    x0, .LC0
        ldr     q0, [x0, #:lo12:.LC0]
        ret
.LC0:
        .word   4092
        .word   4092
        .word   4092
        .word   4092

But in GCC 15 with SVE enabled

t1():
        mov     z0.s, #4092
        ret

Disconnect immediate mode from container/type

Prior to GCC 15 when vector constants were being created, we would only use instructions that belong to that instruction class. i.e. when creating a 32-bit floating point value we would always use instructions that worked on 32-bit numbers or a load.

In GCC 15 we instead use any instruction that results in the same pattern as the constant being created.

As an example, the following

#include <arm_neon.h>

uint32x4_t f1() {
    return vdupq_n_u32(0x3f800000);
}

Used to require a literal load in GCC 14

f1():
        adrp    x0, .LC0
        ldr     q0, [x0, #:lo12:.LC0]
        ret
.LC0:
        .word   1065353216
        .word   1065353216
        .word   1065353216
        .word   1065353216

But in GCC 15 gets created using a floating point fmov

f1():
        fmov    v0.4s, 1.0e+0
        ret

Improve constant generation with SVE

The SVE architecture has an instruction called index that is used to generate registers with a regular sequence of numbers.

For example the sequence [0,1,..,n] can be created using index .., #0, #1 with the first constant being the starting index and the second the step.

The snippet:

typedef int v4si __attribute__ ((vector_size (16)));
v4si
f_v4si (void)
{
  return (v4si){ 0, 1, 2, 3 };
}

Generated a constant load in GCC 14:

f_v4si:
        adrp    x0, .LC0
        ldr     q0, [x0, #:lo12:.LC0]
        ret
.LC0:
        .word   0
        .word   1
        .word   2
        .word   3

Even when SVE was enabled. With GCC 15 we now generate:

f_v4si:
        index   z0.s, #0, #1
        ret

Update zero initialization

Arm CPUs have various idioms which are considered zero latency moves. These are documented in the Arm Software Optimization guides. One such idiom is how registers are zeroed.

Prior to GCC 14 we would create zeros using the same size as the datatype being zeroed. i.e. If zero-ing a vector of 4 ints we would use movi v4.s, #0. While this works it does mean that unless the compiler realizes that all zeros are the same that we may accidentally create multiple zero values.

In GCC 15 we now create all zeros using the same base mode, which ensures that they are shared. Additionally for SVE when not in streaming mode we use Adv. SIMD instructions to zero the SVE register.

The example

#include <arm_sve.h>

svint32_t f (svint32_t a, svbool_t pg)
{
  return svsub_x (pg, a, a);
}

Used to generate

f(__SVInt32_t, __SVBool_t):
        mov     z0.b, #0
        ret

But now generates

f(__SVInt32_t, __SVBool_t):
        movi    d0, #0
        ret

Permute optimizations

GCC 15 contains various optimizations on permutes, far too numerous to mention. Below are some examples:

Optimize zero registers 2 TBL to 1 TBL

When a two reg TBL is performed with one operand being a zero vector we can instead use a single reg TBL and map the indices for accessing the zero vector to an out-of-range constant.

On AArch64 out of range indices into a TBL have a defined semantics of setting the element to zero. This sequence is generated often by OpenMP and aside from the strict performance impact of this change, it also gives better register allocation as we no longer have the consecutive register limitation.

The example

typedef unsigned int v4si __attribute__ ((vector_size (16)));

v4si f1 (v4si a)
{
  v4si zeros = {0,0,0,0};
  return __builtin_shufflevector (a, zeros, 1, 5, 1, 6);
}

Generated in GCC 14

f1(unsigned int __vector(4)):
        movi    v30.4s, 0
        adrp    x0, .LC0
        ldr     q31, [x0, #:lo12:.LC0]
        mov     v1.16b, v30.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v31.16b
        ret
.LC0:
        .byte   4
        .byte   5
        .byte   6
        .byte   7
        .byte   20
        .byte   21
        .byte   22
        .byte   23
        .byte   4
        .byte   5
        .byte   6
        .byte   7
        .byte   24
        .byte   25
        .byte   26
        .byte   27

But now generates

f1(unsigned int __vector(4)):
        movi    v30.4s, 0
        adrp    x0, .LC0
        ldr     q31, [x0, #:lo12:.LC0]
        mov     v1.16b, v30.16b
        tbl     v0.16b, {v0.16b - v1.16b}, v31.16b
        ret
.LC0:
        .byte   4
        .byte   5
        .byte   6
        .byte   7
        .byte   20
        .byte   21
        .byte   22
        .byte   23
        .byte   4
        .byte   5
        .byte   6
        .byte   7
        .byte   24
        .byte   25
        .byte   26
        .byte   27

Note that this optimization is not possible with SVE because at VL 2048 the index 255 is still within range.

Avoiding unnecessary 2 TBL permutes

In some circumstances GCC could not realize that it was permuting the same register and ended up copying the register to pass it to a 2 reg TBL.

This resulted in a needless register copy because TBLs require registers to be sequential in numbers. The example

extern float a[32000], b[32000];

void s1112()
{
    for (int i = 32000 - 1; i >= 0; i--) {
      a[i] = b[i] + (float)1.;
    }
}

Generated in GCC 14:

.L2:
        add     x2, x4, x0
        add     x1, x3, x0
        add     x2, x2, 65536
        add     x1, x1, 65536
        sub     x0, x0, #16
        ldr     q30, [x2, 62448]
        mov     v31.16b, v30.16b
        tbl     v30.16b, {v30.16b - v31.16b}, v29.16b
        fadd    v30.4s, v30.4s, v28.4s
        mov     v31.16b, v30.16b
        tbl     v30.16b, {v30.16b - v31.16b}, v29.16b
        str     q30, [x1, 62448]
        cmp     x0, x5
        bne     .L2

But in GCC 15:

.L2:
        add     x2, x4, x0
        add     x1, x3, x0
        add     x2, x2, 65536
        add     x1, x1, 65536
        sub     x0, x0, #16
        ldr     q29, [x2, 62448]
        tbl     v29.16b, {v29.16b}, v31.16b
        fadd    v29.4s, v29.4s, v30.4s
        tbl     v29.16b, {v29.16b}, v31.16b
        str     q29, [x1, 62448]
        cmp     x0, x5
        bne     .L2

Support folding permutes over vector converts

The example

#include <stdint.h>
#include <stdlib.h>
#include <arm_neon.h>

uint64x2_t foo (uint64x2_t r) {
    uint32x4_t a = vreinterpretq_u32_u64 (r);
    uint32_t t;
    t = a[0]; a[0] = a[1]; a[1] = t;
    t = a[2]; a[2] = a[3]; a[3] = t;
    return vreinterpretq_u64_u32 (a);
}

Used to perform the reversals element wise due to the reinterpretation of the 64 bit element vectors to 32 element vector

foo(__Uint64x2_t):
        mov     v31.16b, v0.16b
        ins     v0.s[0], v0.s[1]
        ins     v0.s[1], v31.s[0]
        ins     v0.s[2], v31.s[3]
        ins     v0.s[3], v31.s[2]
        ret

But in GCC 15 we now correctly generate

foo(__Uint64x2_t):
        rev64   v0.4s, v0.4s
        ret

Early-ra improvements

GCC’s early register allocator pass was enhanced this year to help optimize sequences that are generated with TBLs.

As an example

typedef unsigned int v4si __attribute__((vector_size(16)));

unsigned int
foo (v4si *ptr)
{
  v4si x = __builtin_shufflevector (ptr[0], ptr[2], 0, 7, 1, 5);
  v4si y = x ^ ptr[4];
  return (y[0] + y[1] + y[2] + y[3]) >> 1;
}

Used to generate additional moves around the TBLs, but in GCC 15 we generate

        adrp    x1, .LC0
        ldr     q30, [x0]
        ldr     q31, [x0, 32]
        ldr     q29, [x1, #:lo12:.LC0]
        ldr     q0, [x0, 64]
        tbl     v30.16b, {v30.16b - v31.16b}, v29.16b
        eor     v0.16b, v0.16b, v30.16b

New late combine pass

GCC 15 has a new late combiner pass to help optimize instructions generated late in the compilation pipeline. It runs twice, once before register allocator but before the first instruction splitter and once after register allocation.

The pass’s objective is to remove definitions by substituting them into all uses. This is particularly handy to force complex addressing modes back into loads that may have been split off earlier to try to share part of the computation. On modern Arm cores most complex addressing modes are cheap and so it’s more beneficial to use them.

As an example

extern const int constellation_64qam[64];

void foo(int nbits,
         const char *p_src,
         int *p_dst) {

  while (nbits > 0U) {
    char first = *p_src++;

    char index1 = ((first & 0x3) << 4) | (first >> 4);

    *p_dst++ = constellation_64qam[index1];

    nbits--;
  }
}

Generates

.L3:
        ldrb    w4, [x1, x3]
        ubfiz   w5, w4, 4, 2
        orr     w4, w5, w4, lsr 4
        ldr     w4, [x6, w4, sxtw 2]
        str     w4, [x2, x3, lsl 2]
        add     x3, x3, 1
        cmp     x0, x3
        bne     .L3

With the addressing mode guaranteed to be created.

Disable early scheduling for lower optimization levels

GCC has multiple instruction scheduling passes. One that happens before register allocation and one that happens after. Up to a third of the compilation time of an application can be taken up by early scheduling but at lower optimization levels this does not provide any meaningful performance gains on modern out of order CPUs.

This is partly because modern OoO cores need far less scheduling and partly because the early register allocation is working on pseudo registers where it’s not yet able to tell that different pseudos refer to the same values. Because it lacks this context it can end up scheduling dependent instructions after e.g. an increment. This results in the register allocator having to insert a spill to keep both the old and new values live slowing down performance.

In GCC 15 the early scheduler is now disabled on AArch64 for optimization levels lower than O3. This results in compilation time reduction for up to ~50% and a code size reduction of ~0.2% on SPECCPU 2017.

SVE Intrinsics optimizations

The SVE intrinsics in GCC are mostly modelled as opaque structures in the front-end and it’s only the backend that knows what they do. A consequence of this is that simple operations working on constants were not being optimized. GCC 15 does a whole host of optimizations on SVE intrinsics, far too many to list. However, examples of optimizations being performed now are:

#include <stdint.h>
#include "arm_sve.h"

svint64_t s64_x_pg (svbool_t pg)
{
  return svdiv_x (pg, svdup_s64 (5), svdup_s64 (3));
}

svint64_t s64s_x_pg (svbool_t pg)
{
  return svmul_x (pg, svdup_s64 (5), svdup_s64 (3));
}

svint64_t s64_x_pg_op1 (svbool_t pg, svint64_t op2)
{
  return svdiv_x (pg, svdup_s64 (0), op2);
}

Which in GCC 14 generated

s64_x_pg:
        mov     z31.d, #5
        mov     z0.d, #3
        sdivr   z0.d, p0/m, z0.d, z31.d
        ret
s64s_x_pg:
        mov     z0.d, #5
        mul     z0.d, z0.d, #3
        ret
s64_x_pg_op1:
        mov     z31.b, #0
        sdivr   z0.d, p0/m, z0.d, z31.d
        ret

These are now optimized in GCC 15 as one would expect to

s64_x_pg:
        mov     z0.d, #1
        ret
s64s_x_pg:
        mov     z0.d, #15
        ret
s64_x_pg_op1:
        movi    d0, #0
        ret

Improve CTZ optimizations

GCC 14 vectorizes CTZ for SVE to .CTZ (X) = (PREC - 1) - .CLZ (X & -X) which can be improved by using RBIT. As an example the sequence

#include <stdint.h>

void
ctz_uint8 (uint8_t *__restrict x, uint8_t *__restrict y, int n) {
  for (int i = 0; i < n; i++)
    x[i] = __builtin_ctzg (y[i]);
}

Generated in GCC 14

.L3:
        ld1b    z31.b, p7/z, [x1, x3]
        movprfx z30, z31
        add     z30.b, z30.b, #255
        bic     z30.d, z30.d, z31.d
        clz     z30.b, p6/m, z30.b
        subr    z30.b, z30.b, #8
        st1b    z30.b, p7, [x0, x3]
        add     x3, x3, x4
        whilelo p7.b, w3, w2
        b.any   .L3

But now generates

.L3:
        ld1b    z31.b, p7/z, [x1, x3]
        rbit    z31.b, p6/m, z31.b
        clz     z31.b, p6/m, z31.b
        st1b    z31.b, p7, [x0, x3]
        add     x3, x3, x4
        whilelo p7.b, w3, w2
        b.any   .L3

Improve vector rotates

Some vector rotate operations can be implemented in a single instruction rather than using the fallback SHL+USRA sequence.

When the rotation is half the element size we can use a suitable REV instruction.

More generally, rotates by a byte amount can be implemented using vector permutes.

The example

typedef unsigned long long __attribute__ ((vector_size (16))) v2di;

v2di
G1 (v2di r)
{
  return (r >> 32) | (r << 32);
}

In GCC 14 used to generate

G1(unsigned long long __vector(2)):
        shl     v31.2d, v0.2d, 32
        ushr    v0.2d, v0.2d, 32
        orr     v0.16b, v0.16b, v31.16b
        ret

And in GCC 15

G1(unsigned long long __vector(2)):
        rev64   v0.4s, v0.4s
        ret

We can make also use of the integrated rotate step of the XAR instruction to implement most vector integer rotates, as long we zero out one of the input registers for it. This allows for a lower-latency sequence than the fallback SHL+USRA, especially when we can hoist the zeroing operation away from loops and hot parts.

The example

typedef unsigned long long __attribute__ ((vector_size (16))) v2di;

v2di
G2 (v2di r) {
    return (r >> 39) | (r << 25);
}

Generated in GCC 14

G2(unsigned long long __vector(2)):
        shl     v31.2d, v0.2d, 25
        ushr    v0.2d, v0.2d, 39
        orr     v0.16b, v0.16b, v31.16b
        ret

And in GCC 15 with SVE2

G2(unsigned long long __vector(2)):
        movi    v31.4s, 0
        xar     z0.d, z0.d, z31.d, #39
        ret

Improve CRC detections

GCC 15 has a new pass to automatically detect sequences calculating CRC and emitting optimized instructions for architectures that have hardware accelerated instructions.

This pass can support two specific computations:

Bit-Forward CRC

The following example

#include <stdint.h>
#include <stdlib.h>

uint32_t _crc32 (uint32_t crc, uint16_t data) {
  int i;
  crc = crc ^ data;

  for (i = 0; i < 8; i++) {
      if (crc & 1)
        crc = (crc >> 1) ^ 0xEDB88320;
      else
        crc = (crc >> 1);
    }

  return crc;
}

Generated in GCC 14

_crc32(unsigned int, unsigned short):
        and     w1, w1, 65535
        mov     w3, 33568
        eor     w0, w1, w0
        movk    w3, 0xedb8, lsl 16
        mov     w1, 8
.L2:
        sbfx    x2, x0, 0, 1
        subs    w1, w1, #1
        and     w2, w2, w3
        eor     w0, w2, w0, lsr 1
        bne     .L2
        ret

And in GCC 15 with +crc

_crc32(unsigned int, unsigned short):
        and     w1, w1, 65535
        eor     w1, w1, w0
        mov     w0, 0
        crc32b  w0, w1, w0
        ret

Bit-Reversed CRC

The bit reverse sequence

#include <stdint.h>
#include <stdlib.h>

#define POLY (0x1070U << 3)
#define u8 uint8_t
#define u16 uint16_t

u8 crc8 (u16 data) {
    int i;
    for (i = 0; i < 8; i++) {
        if (data & 0x8000)
            data = data ^ POLY;
        data = data << 1;
    }
    return (u8)(data >> 8);
}

Generated in GCC 14

crc8(unsigned short):
        and     w0, w0, 65535
        mov     w2, 8
        mov     w3, -31872
.L7:
        eor     w1, w0, w3
        tst     x0, 32768
        and     w1, w1, 65535
        csel    w0, w1, w0, ne
        subs    w2, w2, #1
        ubfiz   w0, w0, 1, 15
        bne     .L7
        ubfx    x0, x0, 8, 8
        ret

and in GCC 15

crc8(unsigned short):
        mov     x2, 1813
        ubfx    x1, x0, 8, 8
        movk    x2, 0x1, lsl 16
        fmov    d30, x1
        fmov    d31, x2
        mov     x1, 1792
        ubfiz   x0, x0, 8, 16
        pmull   v30.1q, v30.1d, v31.1d
        fmov    d31, x1
        ushr    d30, d30, 16
        pmull   v30.1q, v30.1d, v31.1d
        fmov    w1, s30
        eor     w0, w0, w1
        ubfx    x0, x0, 8, 8
        ret

Use SVE FSCALE, ASRD and more with Adv. SIMD

In GCC 14 the following expressions

#include <stdint.h>
#include <stdlib.h>

float
test_ldexpf (float x, int i)
{
    return __builtin_ldexpf (x, i);
}

float test_powif_1(float  a, int i)
{
  return a * __builtin_powif(2.0f, i);
}

would result in calls to library implementation of these functions

test_ldexpf(float, int):
        b       ldexpf
test_powif_1(float, int):
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        str     d15, [sp, 16]
        fmov    s15, s0
        fmov    s0, 2.0e+0
        bl      __powisf2
        fmul    s0, s0, s15
        ldr     d15, [sp, 16]
        ldp     x29, x30, [sp], 32
        ret

However, when SVE is enabled the SVE instruction FSCALE can be used to implement ldexp, powif and powof2.

In GCC 15 with SVE enabled we generate:

test_ldexpf(float, int):
        fmov    s31, w0
        ptrue   p7.b, vl4
        fscale  z0.s, p7/m, z0.s, z31.s
        ret
test_powif_1(float, int):
        fmov    s31, w0
        ptrue   p7.b, vl4
        fscale  z0.s, p7/m, z0.s, z31.s
        ret

And thus avoid the expansive library calls

Continuing the trend that GCC 14 started we now use more SVE instructions to patch up inefficiencies in Adv. SIMD codegen. The following example

#include <stdint.h>

int8_t M_int8_t_8[8];
void
asrd_int8_t_8 ()
{
    for (int i = 0; i < 8; i++)
        {
            M_int8_t_8[i] /= 4;
        }
}

When vectorized using Adv. SIMD used to generate

asrd_int8_t_8():
        adrp    x0, .LANCHOR0
        movi    v29.8b, 0x3
        ldr     d31, [x0, #:lo12:.LANCHOR0]
        cmlt    v30.8b, v31.8b, #0
        and     v29.8b, v29.8b, v30.8b
        add     v29.8b, v31.8b, v29.8b
        sshr    v29.8b, v29.8b, 2
        str     d29, [x0, #:lo12:.LANCHOR0]
        ret

But in GCC 15 with SVE enabled will generate

asrd_int8_t_8():
        adrp    x0, .LANCHOR0
        ptrue   p7.b, vl8
        ldr     d31, [x0, #:lo12:.LANCHOR0]
        asrd    z31.b, p7/m, z31.b, #2
        str     d31, [x0, #:lo12:.LANCHOR0]
        ret

Code locality optimizations

GCC 15 has a new optimization for optimizing code layout for locality between callees and callers with LTO. The basic idea is that it’s a good idea to minimize branch distances between frequently called functions.

This is particularly useful for larger applications where you typically have a grouping of functions that work together in a closely related cluster. Keeping branch distances short can improve icache efficiency and make better use of internal CPU data structures.

This new optimization is also able to use PGO to guide this heuristic. In the absence of PGO it will try to use GCC’s static predictors to determine what the best locality could be.

The new pass can be turned on with the -fipa-reorder-for-locality flag.

Improved malloc in glibc

GLIBC now has an improved __libc_malloc by splitting it into 2 parts:

first handle the tcache fastpath
then do the rest in a separate tailcalled function.

This results in significant performance gains since __libc_malloc doesn't need to setup a frame and we delay tcache initialization and setting of errno until later.

On Neoverse V2, bench-malloc-simple improves by 6.7% overall (up to 8.5% for ST case) and bench-malloc-thread improves by 20.3% for 1 thread and 14.4% for 32 threads.

Guarded Control Stack (GCS)

By Yury Khrustalev

GCC 15, along with Binutils 2.44 and Glibc 2.41 brings support for Guarded Control Stack extension for systems that have this hardware capability and run Linux kernel 6.13 or newer. This extension allows to use shadow stacks on AArch64 systems. A shadow stack is maintained transparently for the programmer and has limited access, so it may not be corrupted in the same manner as the main stack of the application. Each return from a function comes along with a check of return address in the link register and in the corresponding shadow stack frame. If mismatch is detected, execution is interrupted instead of proceeding to an incorrect address.

GCS is opt-in and needs all object files or libraries of the application to be built with GCS support (similar to BTI, GCS relies on ELF marking) and it must be explicitly activated at runtime. GCS is currently not enabled by default and needs to be turned on explicitly.

To build your application with GCS support, use the `-mbranch-protection=gcs` compiler flag:

$ gcc -mbranch-protection=gcs hello.c -o hello

Of course in practice you would usually use `-mbranch-protection=standard` which includes GCS. To check that your binary is marked as supporting GCS, look at the GNU property notes:

$ readelf -n hello
...
Properties: AArch64 feature: GCS
...

When running an application linked against Glibc, use new GCS tunable `glibc.cpu.aarch64_gcs` to activate GCS runtime support. A Glibc library needs to be built with standard branch protection to support GCS, however it would work with both applications that use GCS and those that don't. New tunable allows users to choose one of the following behaviours:

`Disabled` (value `0`): GCS will not be enabled for the process.
`Enforced` (value `1`): check binary markings and fail if any of the binaries are not marked.
`Optional` (value `2`): check markings but keep GCS off if there is an unmarked binary.
`Override` (value `3`): enable GCS, regardless of binary markings.

To run your app with one of these behaviours, use the `GLIBC_TUNABLES` environment variable:

$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./hello

You can find more information about how Glibc enables GCS via the `prctl` system call in corresponding documentation.

It is important to make a note about `dlopen`. If GCS is enabled at the time of calling `dlopen` and the library we are about to load is not GCS-marked, it will return an error.

In multi-threaded applications each thread will have its own shadow stack.

With return addresses being carefully checked against shadow stack, it is important that the application does not manipulate its stack frame records in a way that is not PCS compliant. However, some tricky cases are already covered. Glibc 2.41 or newer has GCS-friendly implementations of `setjmp`, `longjmp` and `makecontext`, `setcontext` or `swapcontext`, and `vfork`. It is expected that for many applications and libraries GCS will be transparent and should not require any code change.

We will try to demonstrate effects of GCS in practice using over-simplified example adopted from the Stack Buffer Overflow learning path. So, let's write some code.

GCC 15 implements ACLE builtins relevant to GCS, one of them is `__chkfeat` which allows to do runtime checks for certain CPU features. Under the hood it is just a few instructions, so these checks are quick. To use it, include the `arm_acle.h` header:

// hack.c
#include <stdio.h>
#include <arm_acle.h>

int main (int argc, char *argv[])
{
  /* Show if GCS is enabled (uses ACLE).  */
  if (__chkfeat (_CHKFEAT_GCS))
    printf ("GCS enabled\n");
  else
    printf ("GCS not enabled\n");

  return 0;
}

Build this with GCS branch protection:

$ aarch64-none-linux-gnu-gcc hack.c -mbranch-protection=gcs -o hack \
  --sysroot=/path/to/sysroot/aarch64-none-linux-gnu

And run using GCS tunable:

$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./hack

GCS enabled

and (notice value `0` that we use this time):

$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=0 ./hack

GCS not enabled

To run these examples, if you don't have access to hardware that supports GCS, you can use a cool tool called Shrinkwrap that allows you to run programs on Arm Fixed Virtual Platforms (FVP) models in a very user-friendly way.

Next we will write a program that has some explicit bugs and implements a ROP attack via stack buffer overflow, overwriting the return address of the `main` function and achieving execution of malicious code. This example is very crude for the sake of being brief and simple, but it will allow us to see how this program behaves with and without GCS.

When a GCS detects mismatch of return addresses, program receives `SIGSEGV`, so to catch this we will setup signal handler (some code omitted for brevity):

// hack.c

#include <signal.h>

<...>

/* We'll arrive here when GCS catches mismatch of return addresses.  */
static void handler (int)
{
  puts ("something's phishy... exiting");
  exit (0);
}

int main (int argc, char *argv[])
{
<...>

  /* Setup signal handler for GCS errors.  */
  signal (SIGSEGV, handler);

<...>
}

Now let's introduce our buggy code that may result in stack buffer overflow.

// hack.c

#include <stdlib.h>

<...>

/* Buffer which attacker managed to write to.  */
unsigned long buffer[3] = {};

/* To make example simpler we use this crude mem copy implementation.  */
static void copy (char *dst, const char *src, size_t len)
{
  const char *end = src + len;
  while (src < end)
    *dst++ = *src++;
}

/* Buggy function.  */
static int fun (const char *src, size_t len)
{
  /* This is a bug as LEN can be more than 8.  */
  char local[8];
  copy (local, src, len);
  return *(int *)local;
}

int main (int argc, char *argv[])
{
<...>

  /* Do stuff. After hack instead of returning from main
     we will jump to `hacked` due to return address corruption.  */
  return fun ((const char *)buffer, sizeof (buffer));
}

Copying into the `local` buffer that is allocated on stack may result in writing past the end of this buffer which means we may overwrite other data stored on stack above this buffer.

Now for the hacker bits:

/* Pointer to data of attacker's choice.  */
const char command[] = "whoami";

/* Code that attacker wants to run.  */
static void hacked (const char *command)
{
  printf ("you've been hacked! executing `%s`\n", command);
  exit (1);
}

and in the `main` function after we installed signal handler and before we call our `fun` function, let's use arbitrary write gadget that attacker managed to achieve:

int main (int argc, char *argv[])
{
<...>

  /* Attacker uses their arbitrary write gadgets.  */
  buffer[0] = (unsigned long)command;
  buffer[2] = (unsigned long)&hacked;

<...>
}

Putting it all together and compiling with (note that this example is sensitive to code generation, that's why we use specific optimisation level `-O0` below):

$ aarch64-none-linux-gnu-gcc hack.c -O0 -mbranch-protection=gcs -o hack \
  --sysroot=/path/to/sysroot/aarch64-none-linux-gnu

And run using GCS tunable with GCS disabled:

$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=0 ./hack

GCS not enabled
you've been hacked! executing `whoami`

and (notice value `1` that we use this time):

$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./hack
GCS enabled

something's phishy... exiting

With the help of GCS we managed to prevent execution of malicious code and handle the unexpected situation appropriately. You can of course run this example on a non-GCS AArch64 system, but GCS will not be enabled due to lack of hardware GCS support.

The guarded control stack is used under the hood for tracking return addresses but it can also be useful for other things. While userspace applications are not allowed to write to shadow stack directly, we can read from it, and this may be handy for gathering call stacks for debugging and profiling purposes in a potentially faster and more robust way than via previously available tools.

Let's write a simple example. We will use the new ACLE builtin `__gcspr` that returns shadow stack pointer. We will also use `AT_HWCAP` to check if the system we are running on can support GCS (otherwise the following code would not be able to run). Even if the system supports GCS, it doesn't mean that GCS is enabled at runtime. To check this we will use, as before, the `__chkfeat` ACLE builtin. If GCS is not enabled, shadow stack pointer will be null. Finally, if all is good and GCS is enabled, we'll be able to access a stack of return addresses and print them in order (most recent all first).

// callstack.c

#include <stdio.h>
#include <arm_acle.h>
#include <sys/auxv.h>

int main(int argc, char const *argv[])
{
  /* Check if system may support GCS at all.  */
  if (!(getauxval (AT_HWCAP) & HWCAP_GCS))
    {
      printf ("GCS is not supported on this system\n");
      return 1;
    }

  /* Get GCS pointer.  */
  const unsigned long *gcspr = __gcspr ();

  /* Check of GCS is enabled.  */
  if (!__chkfeat (_CHKFEAT_GCS))
    {
      printf ("GCS is not enabled so GCS pointer is null: %016lx\n",
        (unsigned long)gcspr);
      return 2;
    }
  else
    {
      printf ("GCS is enabled so GCS pointer is not null: %016lx\n",
        (unsigned long)gcspr);
    }

  /* Print callstack.  */
  printf ("callstack:\n");
  do {
    printf (" - return address: %08lx\n", *gcspr);
  } while (*gcspr++);
  return 0;
}

Build with standard branch protection:

$ aarch64-none-linux-gnu-gcc callstack.c -mbranch-protection=standard -o callstack \
  --sysroot=/path/to/sysroot/aarch64-none-linux-gnu -static

And run, as before with one of the values for the GCS tunable, for example:

$ GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1 ./callstack

Use `objdump` to check disassembly of this program to confirm the return addresses shown in the output:

$ objdump -D callstack | less

Summary

GCC 15 was a significant step forward in support for Arm architectures and optimizations. As work on the internal structure of the compiler continues the improvements made with each release will continue to be more and more. Look out for much more to come in GCC 16!

In the meantime, check out the previous year's entry for GCC 14.

Part 1: To bigger things in GCC-14

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

GCC 15: Continuously Improving

Vectorization improvements

Early break SLP support

Peeling for alignment for early breaks

Fixed Size SVE Peeling for alignment

Support vectorization of loops with __builtin_prefetch calls

Support two way dot product

Conditional store optimizations

Saturating arithmetic support

Improve vectorization of popcount

MVE Tail predication

FP8

RCPC3 (FEAT_LRCPC3) Libatomic

New cores and updated Neoverse tunings

Default L1 data cache line change

Pipelined FMA code generation

CMP+CSEL fusion

New Architecture and features support

SVE Support for C/C++ operators

SVE2.1 & SME2.1

IV opts/Addressing modes

ILP32 deprecation

-mcpu=native detection on unknown heterogeneous systems

Libstdc++ improvement

std::find

std::hashtable

Improvement to SIMD immediate generations

Disconnect immediate mode from container/type

Improve constant generation with SVE

Update zero initialization

Permute optimizations

Optimize zero registers 2 TBL to 1 TBL

Avoiding unnecessary 2 TBL permutes

Support folding permutes over vector converts

Early-ra improvements

New late combine pass

Disable early scheduling for lower optimization levels

SVE Intrinsics optimizations

Improve CTZ optimizations

Improve vector rotates

Improve CRC detections

Bit-Forward CRC

_crc32(unsigned int, unsigned short): and w1, w1, 65535 eor w1, w1, w0 mov w0, 0 crc32b w0, w1, w0 retBit-Reversed CRC

Use SVE FSCALE, ASRD and more with Adv. SIMD

Code locality optimizations

Improved malloc in glibc

Guarded Control Stack (GCS)

By Yury Khrustalev

Summary

_crc32(unsigned int, unsigned short): and w1, w1, 65535 eor w1, w1, w0 mov w0, 0 crc32b w0, w1, w0 ret

Bit-Reversed CRC