Part 2: To bigger things in GCC-14

May 8, 2024

14 minute read time.

This year's release blog has been split into a 3-part blog series. This is part 2.
See part 1 here.

Memcpy/memmove improvements

Another benefit of having the new load/store pass is that when generating code, we no longer must explicitly generate the load/store pair instructions as we can rely on the optimization pipeline to correctly form them.

Having to explicitly generate the special AArch64 instructions are problematic because passes that check for memory-based optimizations (for example, dead load/store elimination) would not know what these instructions are doing.

This has often led to us generating inefficient code when compiling common memmove idioms:

void bar(char * a,char * b)
{
    char buffer[32];
    __builtin_memcpy(buffer,b,32);
    __builtin_memcpy(a,buffer,32);
}

Generates in GCC 13:

bar(char*, char*):
        ldp     q0, q1, [x1]
        sub     sp, sp, #32
        stp     q0, q1, [x0]
        stp     q0, q1, [sp]
        add     sp, sp, 32
        ret

And in GCC 14:

bar(char*, char*):
        ldp     q31, q30, [x1]
        sub     sp, sp, #32
        stp     q31, q30, [x0]
        add     sp, sp, 32
        ret

Speaking of memmove, in GCC 14 we have also now inline small memmoves similarly to memcpy. For example the following:

void foo(char * a,char * b)
{
    __builtin_memmove(a,b,32);
}

No longer generates a library call like GCC 13 and earlier did:

foo(char*, char*):
        mov     x2, 32
        b       memmove

But now inlines the sequence resulting in faster memory moves:

foo(char*, char*):
        ldp     q31, q30, [x1]
        stp     q31, q30, [x0]
        ret

SME/SME2

GCC 14 also adds support for the SME and SME2 architecture extensions. These extensions allow users to write routines performing various matrix operations on large datasets. SME2 is a superset of SME which in turn is an extension of SVE. However, unlike other extensions, when SME is enabled not all SVE and Advanced SIMD instructions are available while in SME streaming mode. For details consult the SME documentation https://developer.arm.com/documentation/ddi0616/latest/

The following shows an example of using SME to perform outer dot product operation:

#include <arm_sme.h>

void
f (svuint8_t z0, svint8_t z1, svint8_t z5, svint8_t z6, svint8_t z7,
   svbool_t p0, svbool_t p1) __arm_streaming __arm_inout("za")
{
    svusmopa_za32_u8_m (0, p0, p1, z0, z1);
}

Which generates:

f:
        usmopa  za0.s, p0/m, p1/m, z0.b, z1.b
        ret

Early-ra pass

As part of the implementation of SME/SME2 a new AArch64 pass has been added to deal with strided vector register operands in SME. The pass runs in addition to the standard register allocator.

This pass is on by default to fix some of the Advanced SIMD intrinsics register allocation issues that happen when using an instruction which requires multiple consecutive registers such as LD2/3/4.

As an example the following Advanced SIMD code:

#include <arm_neon.h>

int16x8x3_t
bsl (const uint16x8x3_t *check, const int16x8x3_t *in1, const int16x8x3_t *in2)
{
    int16x8x3_t out;
    for (uint32_t j = 0; j < 3; j++)
        out.val[j] = vbslq_s16 (check->val[j], in1->val[j], in2->val[j]);
    return out;
}

Used to generate:

bsl:
        ldp     q6, q16, [x1]
        ldp     q0, q4, [x2]
        ldp     q5, q7, [x0]
        bsl     v5.16b, v6.16b, v0.16b
        ldr     q0, [x2, 32]
        ldr     q6, [x1, 32]
        mov     v1.16b, v5.16b
        ldr     q5, [x0, 32]
        bsl     v7.16b, v16.16b, v4.16b
        bsl     v5.16b, v6.16b, v0.16b
        mov     v0.16b, v1.16b
        mov     v1.16b, v7.16b
        mov     v2.16b, v5.16b
        ret

But now generates:

bsl:
        ldr     q31, [x0, 32]
        ldr     q30, [x1, 32]
        ldr     q2, [x2, 32]
        ldp     q6, q4, [x0]
        ldp     q5, q3, [x1]
        ldp     q0, q1, [x2]
        bit     v2.16b, v30.16b, v31.16b
        bit     v0.16b, v5.16b, v6.16b
        bit     v1.16b, v3.16b, v4.16b
        ret

The remaining issues are planned to be addressed in GCC 15.

Libmvec

One common issue for vectorization is dealing with function calls in between the vector code. Such function calls completely block vectorization and we lose the speedup vectorization typically gives.

As an example, the following sequence fails to vectorize in older GCCs because of the use of the cosf math function:

#include <math.h>

float double_sum (float *x, int n)
{
  float res = 0;
  for (int i = 0; i < (n & -8); i++)
    res += cosf (x[i] * x[i]);

  return res;
}

This is rather unfortunate and in such a trivial example the gains may not be much, but in larger code a single function call can block the entire function.

GCC 7-14 now support vectorizing such functions with Advanced SIMD if a new enough glibc is used.

The example above now generates:

.L3:
        ldp     q0, q31, [x19], 32
        fmul    v23.4s, v31.4s, v31.4s
        fmul    v0.4s, v0.4s, v0.4s
        bl      _ZGVnN4v_cosf
        mov     v31.16b, v0.16b
        mov     v0.16b, v23.16b
        mov     v23.16b, v31.16b
        bl      _ZGVnN4v_cosf
        fadd    v22.4s, v22.4s, v23.4s
        fadd    v21.4s, v21.4s, v0.4s
        cmp     x20, x19
        bne     .L3

As can be seen these new SIMD function calls have a special name mangling that differ from normal AArch64 PCS name mangling. This new mangling comes with a new calling procedure standard (PCS) that’s specifically designed to be efficient during vectorized loops. One major change is the number of caller and callee saved registers involved in such a call. This new PCS is specified in the Arm ABI documentation.

LSE128

Libatomic in GCC 14, is extended with support for LSE128 atomics through ifuncs. This means that binaries using this new libatomic will be automatically accelerated using the new sequences without any change needed by the user. It also preserves support for architectures without these extensions.

One addition in these extensions is the addition of 128-bit atomic operations such as fetch_and/fetch_or and 128-bit swap instructions.

New cores

GCC-14 adds supports for the following new Arm cores (-mcpu/-march/-mtune values between the brackets):

AArch64

Cortex-A520 (cortex-a520)
Cortex-A720 (cortex-a720)
Cortex-X4 (cortex-x4)

This release also adds support for new generic tunings generic-armv8-a and generic-armv9-a

Arm

Cortex-M52 (cortex-m52)

If-conversion optimizations

Improvements in various optimization passes in GCC 13 ended up exposing a missing optimization in GCC when it comes to if-conversion. Because of this, workloads that have many nested branches end up with a significant slowdown when if-converted and vectorized, compared to GCC 12 or earlier.

To illustrate this, the testcase:

void
foo (int *f, int d, int e)
{
  for (int i = 0; i < 1024; i++)
    {
      int a = f[i];
      int t;
      if (a < 0)
        t = 1;
      else if (a < e)
        t = 1 - a * d;
      else
        t = 0;
      f[i] = t;
    }
}

Generated in GCC 12:

 a_10 = *_3;
 _44 = (unsigned int) a_10;
 _46 = _44 * _45;
 _48 = 1 - _46;
 t_13 = (int) _48;
 _18 = a_10 = e_11(D);
 _26 = _7 & _25;
 _ifc__42 = a_10 < 0 ? 1 : t_13;
 _ifc__43 = _21 ? t_13 : _ifc__42;
 t_6 = _26 ? 0 : _ifc__43;
 *_3 = t_6;

Where we have 4 different comparisons to navigate through the if-else statements inside the loop. However this is highly inefficient and we in fact only require two comparisons, since if we don’t need to test both the condition and its inverse.

When the previous code vectorizes these 4 comparisons are materialized in the resulting code which drastically slows down the vector code:

.L2:
    ldr   q0, [x0]
    mov   v3.16b, v6.16b
    mls   v3.4s, v0.4s, v7.4s
    cmge  v4.4s, v0.4s, #0
    cmlt  v2.4s, v0.4s, #0
    cmgt  v1.4s, v5.4s, v0.4s
    cmge  v0.4s, v0.4s, v5.4s
    bsl   v2.16b, v6.16b, v3.16b
    and   v1.16b, v1.16b, v4.16b
    and   v0.16b, v0.16b, v4...

Because of how late it was discovered in the GCC 13 development cycle we were unable to fix it, however a small improvement was made following the observation that when traversing nested conditions like the above we get several “paths” through the control flow. If we have 3 paths, we only need to test 2 of them and if neither have been taken then this means the third one must implicitly be true.

This means we can simply not test the last condition which results in the following CFG:

 a_10 = *_3;
 _7 = a_10 = 0;
 _22 = a_10 < e_11(D);
 _23 = _21 & _22;
 _ifc__42 = _23 ? t_13 : 0;
 t_6 = _7 ? 1 : _ifc__42;
 *_3 = t_6;

And the following codegen:

.L2:
    ldr   q0, [x0]
    mov   v1.16b, v3.16b
    cmge  v2.4s, v0.4s, #0
    cmgt  v4.4s, v5.4s, v0.4s
    mls   v1.4s, v0.4s, v6.4s
    cmlt  v0.4s, v0.4s, #0
    and   v2.16b, v2.16b, v4.16b
    and   v1.16b, v1.16b, v2.16b
    bsl   v0.16b, v3.16b, v1.16b
    str   q0, [x0], 16
    cmp...

which has removed one comparison, but we still check both _7 and its inverse _21. In GCC 14 we finally fix this by tracking truth values. In essence, we keep a set of known “truths” for a particular path and when we encounter a condition, we evaluate it under the context of known predicates.

Thus, if we have a path gated by condA and if we take the path where condA is True, then we insert condA=True into the map. If we go down the opposite branch, we add ~condA=True. This then allows us to drop inverse comparisons and in GCC 14 we generate:

 a_10 = *_3;
 _7 = a_10 < 0;
 _43 = (unsigned int) a_10;
 _45 = _43 * _44;
 _47 = 1 - _45;
 t_13 = (int) _47;
 _22 = a_10 < e_11(D);
 _ifc__42 = _22 ? t_13 : 0;
 t_6 = _7 ? 1 : _ifc__42;
 *_3 = t_6;

And we get the optimal number of comparisons:

.L2:
    ldr   q28, [x0]
    mov   v27.16b, v29.16b
    mls   v27.4s, v28.4s, v31.4s
    cmgt  v26.4s, v30.4s, v28.4s
    cmlt  v28.4s, v28.4s, #0
    and   v26.16b, v27.16b, v26.16b
    bsl   v28.16b, v29.16b, v26.16b
    str   q28, [x0], 16
    cmp   x1, x0
    bne   .L2

The curious reader might be wondering how we chose the order of the paths to know which one to not have to test, but also more importantly, does which path we drop matter?

The answer to that is yes, when dropping a path, it’s beneficial to not test the most complicated path as that saves us the most amount of work. In GCC 13 this path was determined by looking at the number of occurrences of an SSA value after all the paths converge. Or put differently, we count the number of occurrences of a value inside the PHI node in the merge blocks and sort by increasing occurrence. The idea being that if a value can be produced from multiple paths, we must do extra work to disambiguate between them.

This however does not consider dependency chains and does not work when all the values in the PHI node are unique. In GCC 14 we additionally also count the number of Boolean operators inside the condition and sort based on a tuple of occurrences and num_of_operations.

As an example, for deeply nested compares such as:

 _79 = vr_15 > 20;
 _80 = _68 & _79;
 _82 = vr_15 = -20;
 _88 = _73 & _87;
 _ifc__111 = _55 ? 10 : 12;
 _ifc__112 = _70 ? 7 : _ifc__111;
 _ifc__113 = _85 ? 8 : _ifc__112;
 _ifc__114 = _88 ? 9 : _ifc__113;
 _ifc__115 = _45 ? 1 : _ifc__114;
 _ifc__116...

We now generate one less compare, but also crucially the longest dependency chain is smaller. Prior to GCC 14, we would generate 5 compares as the longest chain:

    cmple  p7.s, p4/z, z29.s, z30.s
    cmpne  p7.s, p7/z, z29.s, #0
    cmple  p6.s, p7/z, z31.s, z30.s
    cmpge  p6.s, p6/z, z27.s, z25.s
    cmplt  p15.s, p6/z, z28.s, z21.s

and from GCC 14 we generate:

    cmple  p7.s, p3/z, z27.s, z30.s
    cmpne  p7.s, p7/z, z27.s, #0
    cmpgt  p7.s, p7/z, z31.s, z30.s

and (x <= y) && (x != 0) && (z > y) cannot be reduced further.

Bitint support

GCC 14 added support for the _BitInt(N) C2x standard type to support arbitrary precision integers. To accommodate this new type we had to design and implement a new AArch64 ABI.

Arch64 BitInt ABI

The following example shows how to create and use a 125-bit integer and perform addition on it:

_BitInt (125) foo (unsigned _BitInt(125) a, unsigned _BitInt(125) *p)
{
  return a + *p;
}

Note that a target may not always support a particular precision. To aid in this the compiler defines a macro __BITINT_MAXWIDTH__ to denote the maximum _BitInt precision it supports. To be complete the above example is better written as:

#if __BITINT_MAXWIDTH__ >= 125
_BitInt (125) foo (unsigned _BitInt(125) a, unsigned _BitInt(125) *p)
{
  return a + *p;
}
#endif

For the given example we generate in GCC-14:

foo:
    ldp x3, x2, [x2]
    and x4, x1, 2305843009213693951
    adds x3, x3, x0
    and x1, x2, 2305843009213693951
    adc x1, x1, x4
    and x0, x3, 2305843009213693951
    extr x1, x1, x3, 61
    orr x0, x0, x1, lsl 61
    asr x1, x1, 3
    ret

Zero-extends to permutes

Zero extension is a common operation which on AArch64 is typically done for vectors using the uxtl instruction. This instruction on most Arm microarchitectures is throughput limited. Or put it differently, we cannot execute as many as them in parallel as we would like.

However conceptually zero extension is just simply inserting zeros before each lane. That is, zero extending the low bits of a vector of 4 shorts to 2 ints requires the insertion of zero values below the bottom two elements in the vector (one before each). This is essentially what the zip permutes do in AArch64. As we have covered before, most modern Arm microarchitectures give us a way to create a vector of 0s for free. Since permutes can use the whole vector unit, replacing the throughput limited instructions with zips will give us much better performance overall.

In versions before GCC 14 this loop:

void f (int *b, unsigned short *a, int n)
{
  for (int i = 0; i < (n & -4); i++)
    b[i] = a[i];
}

Generates:

.L5:
    ldr   q0, [x4], 16
    uxtl  v1.4s, v0.4h
    uxtl2 v0.4s, v0.8h
    stp   q1, q0, [x3]
    add   x3, x3, 32
    cmp   x5, x4
    bne   .L5

while in GCC 14 we now generate:

    movi  v31.4s, 0
.L5:
    ldr   q30, [x4], 16
    zip1  v29.8h, v30.8h, v31.8h
    zip2  v30.8h, v30.8h, v31.8h
    stp   q29, q30, [x3], 32
    cmp   x4, x5
    bne   .L5

Novector

With the improved autovectorization in GCC 14 we also wanted to provide the user with a way to explicitly tell the vectorizer not to vectorize a loop. In previous incarnations of the compiler users typically did this by adding an assembly statement inside the loop:

void f (int *b, unsigned short *a, int n)
{
  for (int i = 0; i < (n & -4); i++)
    {
      asm ("");
      b[i] = a[i];
    }
}

Which will then abort vectorization as the vectorizer is unable to see what is happening inside the assembly blocks to know if it’s safe to vectorize. Unfortunately, this approach has two big downsides in that the use of the asm statement can have other effects on codegen, and it can require you to add explicit braces to your loop if they were not already there.

In GCC 14, we provide a new pragma to explicitly state to the vectorizer not to vectorize a loop:

void f (int *b, unsigned short *a, int n)
{
#pragma GCC novector
  for (int i = 0; i < (n & -4); i++)
    b[i] = a[i];
}

The pragma needs to be placed before the loop conditional.

In Part 3, we talk about the following topics:

Smarter SIMD constant generation
COPYSIGN improvements
Combining SVE and Advanced SIMD
Neon SVE bridge
New generic cost models
Better ABDL recognition
FEAT decoupled from architecture
Function multi versioning
PAC+BTI benchmarking
Match.pd partitioning
Compact syntax

Read GCC 14 Part 3

If you missed part 1, read here.

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog