Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog Part 3: To bigger things in GCC-14
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • optimization
  • performance
  • GCC
  • NEON
  • Compilers
  • SVE
  • Vectorization
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Part 3: To bigger things in GCC-14

Tamar Christina
Tamar Christina
May 8, 2024
15 minute read time.
This year's release blog has been split into a 3-part blog series. This is part 3.
See part 1 and part 2

Smarter SIMD constant generation

Advanced SIMD lacks some of the instructions that SVE can use to create bitmask patterns. For Advanced SIMD, we are often required to use creative means to generate vector constant immediates. In GCC 14, we added a small framework to allow us to generate these special constants more easily than before.

An example of such an immediate is one with the top bit set in each 64-bit lane inside a vector set. Such patterns are typically used to manipulate the sign bit of vectors of doubles:

void g (double *a, int n)
{
  for (int i = 0; i < n; i++)
     a[i] = copysign (a[i], -1);
}

Used to generate prior to GCC 14:

.L13:
    ldr   q0, [x2]
    fabs  v0.2d, v0.2d
    fneg  v0.2d, v0.2d
    str  q0, [x2], 16
    cmp  x3, x2
    bne  .L13

but with GCC 14 we now generate:

    movi  v31.4s, 0
    fneg  v31.2d, v31.2d
.L9:
    ldr   q30, [x2]
    orr   v30.16b, v30.16b, v31.16b
    str   q30, [x2], 16
    cmp   x2, x3
    bne   .L9

COPYSIGN improvements

The example above also shows that for the pattern fneg (fabs (..)) we should be generating the same code as copysign.

In GCC, we made several changes including the recognition of this pattern as a copysign operation. This operation is quite common in scientific code. Concretely we now generate for:

void g (double *a, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = -fabs (a[i]);
}

The same codegen as the above copysign example:

.L9:
    ldr   q30, [x2]
    orr   v30.16b, v30.16b, v31.16b
    str   q30, [x2], 16
    cmp   x3, x2
    bne   .L9

Combining SVE and Advanced SIMD

As mentioned before, SVE typically has better immediate ranges in vector instructions than Advanced SIMD has. Part of the costing the compiler does is determining whether it should autovectorize using Advanced SIMD or SVE. In previous releases, we have mentioned that often better performance can be gotten by using a mix of SVE and Advanced SIMD loops. A typical example is an unrolled Advanced SIMD main loop followed by an SVE loop as the epilogue.

In GCC 14, we continue pushing towards the trend of mixing SVE and Advanced SIMD code. When you are compiling code for an SVE-capable system but the compiler’s costing has determined that Advanced SIMD is the best ISA to use, we can now use a few SVE instructions to enhance the Advanced SIMD code.

The example above compiled on an SVE-enabled system generates

.L9:
    ldr   q31, [x2]
    orr   z31.d, z31.d, #-9223372036854775808
    str   q31, [x2], 16
    cmp   x3, x2
    bne   .L9

Using an SVE inclusive OR by immediate instead of the Advanced SIMD variant which requires us to first construct the immediate inside a register.

Another example is the optimization of 128-bit vector integer divides. Advanced SIMD lacks an instruction for doing vector division of 64-bit integers:

#include <arm_neon.h>

int64x2_t foo2 (int64x2_t a, int64x2_t b)
{
  return a / b;
}

is compiled in GCC 13 to:

foo2:
    fmov  x0, d0
    fmov  x3, d1
    umov  x1, v0.d[1]
    umov  x2, v1.d[1]
    sdiv  x0, x0, x3
    sdiv  x1, x1, x2
    fmov  d0, x0
    ins   v0.d[1], x1
    ret

but in GCC 14 on an SVE-cabable system it’s compiled to:

foo2:
    ptrue p7.b, all
    sdiv  z0.d, p7/m, z0.d, z1.d
    ret

There are several such cases where we generate Advanced SIMD instructions mixed with SVE and this is a trend that will continue to increase in the future.

Neon SVE bridge

Mixing SVE and Advanced SIMD is a feature that is expected to be wanted by users as well and as such Arm has developed an SVE-NEON bridge as part of the ACLE.

Read more about SVE-NEON ACLE bridge

This new interface can be used through the inclusion of the arm_neon_sve_bridge.h header file. The intrinsics defined here allow you to convert between SVE and Advanced SIMD types and then carry on using the normal SVE or Advanced SIMD intrinsics:

#include <arm_neon_sve_bridge.h>

void f (int *a, int *b)
{
  int32x4_t va = vld1q_s32 (a);
  svint32_t za = svset_neonq_s32 (svundef_s32 (), va);
  svint32_t zo = svorr_n_s32_x (svptrue_b32 (), za, 1 << 31);
  int32x4_t res = svget_neonq_s32 (zo);
  vst1q_s32 (b, res);
}

Generates the equivalent code of the copysign example above. The loads are performed using advanced SIMD instructions and we then apply SVE instructions to the data held in the advanced SIMD registers.

This bridge allows for seamless conversion between the two ISAs. The special SVE intrinsic svundef can be used to avoid explicitly setting the upper bits of the SVE register, allowing direct usage of the register overlap characteristics of the AArch64 register file.

New generic cost models

Prior to GCC 14 the default cost model for the compiler (that is, what you get if no -mcpu/-mtune is used) was based on Cortex-A57. This cost model is quite old and what was inefficient or didn’t matter for that specific micro-architecture is no longer accurate for modern micro-architectures.

This would mean that GCC’s default code generation was suboptimal for majority of users today. This becomes especially evident when comparing GCC against other AArch64 compilers without changing the tuning model.

The tuning models in GCC are rather complex but one big difference between the old and new tuning model framework is that the old framework only modelled latency but did not consider throughput. One example of this is the following code where unrolling is beneficial on newer micro-architecture but wouldn’t really hurt older ones much.

One example is better addressing modes. For SVE we used to generate at -O3 -march=armv9-a in GCC-13:

.L14:
    lsl   x0, x1, 1
    add   x1, x1, 8
    add   x2, x3, x0
    add   x6, x0, x4
    ld1h  z2.h, p7/z, [x2]
    ld1h  z22.h, p7/z, [x6]
    add   x0, x0, x5
    fmad  z22.h, p7/m, z21.h, z2.h
    ld1h  z20.h, p7/z, [x0]
    fmad  z20.h, p7/m, z26.h, z22.h
    st1h  z20.h, p7, [x2]...

And now generates:

.L13:
    ld1h  z0.h, p0/z, [x1, x0, lsl 1]
    ld1h  z2.h, p0/z, [x4, x0, lsl 1]
    ld1h  z1.h, p0/z, [x3, x0, lsl 1]
    fmad  z2.h, p0/m, z4.h, z0.h
    ld1h  z0.h, p0/z, [x2, x0, lsl 1]
    fmad  z1.h, p0/m, z5.h, z2.h
    fmad  z0.h, p0/m, z3.h, z1.h
    st1h  z0.h, p0...

These newer cost models are not a single fixed cost model and are instead attached to architecture revisions. For instance, we may over time add different cost models for Armv8-a which is different from Armv8.1-a. In other words, each -march can now have its own cost model based on what we think the types of CPUs are that would be using them.

Today we have added two new generic models, generic-armv8-a and generic-armv9-a. These are enabled by default and are selected based on the current -march setting. that is, using -march=armv8-a would use the new generic-armv8-a model and using -march=armv9-a uses generic-armv9-a. These models differ in the width of the micro-architecture on which they are expected to be used on and in codegen strategy for SVE.

Better ABDL recognition

GCC 14 added better support for the AArch64 Absolute Difference instruction and adds support for the Absolute Difference Long instruction.

Concretely for:

void
f (unsigned short *a, unsigned char *b, unsigned int *restrict out)
{
  for (int i = 0; i < 1024; i++)
    out[i] = __builtin_abs (a[i] - b[i]);
}

We used to generate prior to GCC 14:

.L2:
    ldr    q2, [x1], 16
    ldp    q1, q0, [x0]
    add    x0, x0, 32
    uxtl   v3.8h, v2.8b
    uxtl2  v2.8h, v2.16b
    usubl  v4.4s, v1.4h, v3.4h
    usubl2 v1.4s, v1.8h, v3.8h
    usubl  v3.4s, v0.4h, v2.4h
    usubl2 v0.4s, v0.8h, v2.8h
    abs    v4.4s, v4.4s
    abs    v1.4s, v1...

And now in GCC-14:

.L2:
    ldr    q28, [x1], 16
    add    x2, x2, 64
    ldp    q0, q29, [x0], 32
    zip1   v27.16b, v28.16b, v31.16b
    zip2   v28.16b, v28.16b, v31.16b
    uabdl  v26.4s, v0.4h, v27.4h
    uabdl  v30.4s, v29.4h, v28.4h
    uabdl2 v27.4s, v0.8h, v27.8h
    uabdl2 v28.4s, v29.8h, v28...

FEAT decoupled from architecture

In versions of GCC prior to 14 we used to enforce that architecture extensions be paired with the architecture that introduced them.

As an example, for one to use the dotproduct intrinsics introduced in Armv8.2-a or higher where you would have been obligated to use -march=armv8.2-a+dotprod. The downside of such approach is that as the number of extensions grew it became harder and harder for users to keep track of which extension belongs to which architecture.

Additionally, it has the downside of forcing you to accept every new mandatory extension for the architecture as well instead of just the one you as the user wanted.

Should you not do this the error message was not particularly informative:

error: inlining failed in call to 'always_inline' 'uint32x4_t vdotq_u32(uint32x4_t, uint8x16_t, uint8x16_t)': target specific option mismatch

Starting GCC 14, we no longer tie architectures to extensions and allow you to use any extension at the architecture revision you currently are at.

This means you can now use for example, -march=armv8-a+dotprod.

For this to work correctly we will retro-actively introduce new command line names for features that previously had none because they were mandatory at the architecture that introduced them. One such feature is the Armv8.3-a Complex Number instructions. Some of this work has been deferred to GCC-15.

Function multi versioning

Often it would be handy to be able to write code that takes advantage of newer architecture features while still being compatible with older architectures, in effect creating a “fat” binary. GCC already does so automatically for atomics using the outline-atomics implementation, which is on by default but so far it hasn’t been possible for users do to the same without significant work.

In GCC 14, we have implemented support for Function Multi Versioning (FMV) based on the Arm specification.

Read the Function Multi Versioning specification

Using FMV allows the user to create special “clones” of a function, each with different architecture extensions enabled and a default implementation which represents the baseline version.

Any calls made to this function within the same translation unit (TU) results in the compiler using an ifunc resolver to insert a dynamic runtime check to see which of the clones to call automatically. As an example:

__attribute__((target_version("sve")))
int foo ()
{
  return 3;
}
__attribute__((target_version("sve2")))
int foo ()
{
  return 5;
}

int foo ()
{
  return 1;
}

int bar()
{
  return foo ();
}
 

On an SVE2 enabled system will call the SVE2 optimized version, on an SVE enabled version calling the SVE enabled version and otherwise calling the default one.

How it determines which one to check first is defined in the ACLE specification for FMV. Each architecture extension is given a priority, and the higher priority gets checked first.

Note that the FMV attribute names do not always align with the command-line option names currently. Users are encouraged to check the FMV spec for the attributes and their priorities.

The code above generates:

foo()._Msve:
    mov   w0, 3
    ret
foo()._Msve2:
    mov   w0, 5
    ret
foo() [clone .default]:
    mov   w0, 1
    ret
foo() [clone .resolver]:
    stp   x29, x30, [sp, -16]!
    mov   x29, sp
    bl    __init_cpu_features_resolver
    adrp  x0, __aarch64_cpu_features
    ldr   x0...

To make it easier to write these functions we have also introduced a new function attribute that can be used to have the compiler automatically generate the function clones by enabling different extensions for each:

__attribute__((target_clones("default", "sve", "sve2")))
int foo ()
{
  return 5;
}

int bar()
{
  return foo ();
}

This is functionally equivalent to the previous example (if all the return values were the same), and we did not have to keep multiple copies of the source code around. This is convenient when the compiler generated code can take advantage of the different architecture extensions available in the clones without using relying on intrinsics.

PAC+BTI benchmarking

Pointer Authentication (PAC) and Branch Target Identification (BTI) are two security features available on AArch64 and implemented in GCC for years now. They are intended to mitigate Return-Oriented Programming (ROP) and Jump-Oriented Programming (JOP) attacks respectively.

To enable these in GCC the command line flag -mbranch-protection=<option> can be used.

Read more about PAC+BTI mitigations

One thing often not talked about is the impact on performance for these security mitigations. If we enable PAC+BTI on system libraries (glibc, GCC runtime libraries) we see that providing the extra security has very low overhead on Neoverse V2 systems:

GCC 14.x Neoverse-V2 slowdown graph.

If we enable all applications to use PAC+BTI then the impact increases somewhat but is still quite minimal:

GCC 14.x Neoverse-V12 slowdown

This shows that security does not need to come at the expense of performance.

Match.pd partitioning

GCC writes mathematical optimizations using a custom Domain Specific Language (DSL) which aims to reduce the boilerplate needed to perform transformations. It also allows us to efficiently combine rules that have a shared matching prefix.

As an example the following rule optimizes X / X:

(simplify
 (div @0 @0)
 { build_one_cst (type); })

While this DSL is great to use, there is one downside: build time.

These rules generate a giant C file which contains the code to handle all the rules. Over time as rules were added this file has grown to a point where it is a significant compile time and memory hog.

This file is also required for many other files in the compiler. As such they are a parallel compilation bottleneck. When looking at the bootstrap time of the compiler we see that of the 33 minutes it takes to build the compiler on Neoverse N1 the majority of which underutilizes the CPU:

The linear time is essentially waiting on points in the compilation that serializes the build process. To fix this we’ve added an automatic partitioning scheme for the program that generates the C output.

There is however a point of diminishing returns. Each file still must parse many headers. When the file gets too small the overhead of parsing dominates instead of codegen which is why it’s important to pick a default number of partitions that is good today and for the next few years. To pick the number of default partitions we can plot the compile time improvements over number of partitions:

Compile partition graph

This shows that the gains start dropping after 10 partitions, and that there’s not much difference between 5 to 10 partitions today. Keeping in mind future growth we went with 10 partitions.

The result of this work shaves off 6 whole minutes on the total compiler bootstrap time (or alternatively, a 27% decrease in build time):

And significantly decreases the flat sequential part of the build time.

Compact syntax

Another DSL that GCC uses is one that targets use to describe instructions and pattern combinations.

The exact detail of the DSL is out of scope for this blog, but an example of such pattern is the 32-bits data load/store/move pattern:

(define_insn_and_split "*movsi_aarch64"
  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, m,  r,  r,  r, w,r,w, w")
	(match_operand:SI 1 "aarch64_mov_operand"  " r,r,k,M,n,Usv,m,m,rZ,w,Usw,Usa,Ush,rZ,w,w,Ds"))]
  "(register_operand (operands[0], SImode)
    || aarch64_reg_or_zero (operands[1], SImode))"
  "@
   mov\\t%w0, %w1
   mov\\t%w0, %w1
   mov\\t%w0, %w1
   mov\\t%w0, %1
   #
   * return aarch64_output_sve_cnt_immediate (\"cnt\", \"%x0\", operands[1]);
   ldr\\t%w0, %1
   ldr\\t%s0, %1
   str\\t%w1, %0
   str\\t%s1, %0
   adrp\\t%x0, %A1\;ldr\\t%w0, [%x0, %L1]
   adr\\t%x0, %c1
   adrp\\t%x0, %A1
   fmov\\t%s0, %w1
   fmov\\t%w0, %s1
   fmov\\t%s0, %s1
   * return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);"
  "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), SImode)
    && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
   [(const_int 0)]
   "{
       aarch64_expand_mov_immediate (operands[0], operands[1]);
       DONE;
    }"
  ;; The "mov_imm" type for CNT is just a placeholder.
  [(set_attr "type" "mov_reg,mov_reg,mov_reg,mov_imm,mov_imm,mov_imm,load_4,
		    load_4,store_4,store_4,load_4,adr,adr,f_mcr,f_mrc,fmov,neon_move")
   (set_attr "arch"   "*,*,*,*,*,sve,*,fp,*,fp,*,*,*,fp,fp,fp,simd")
   (set_attr "length" "4,4,4,4,*,  4,4, 4,4, 4,8,4,4, 4, 4, 4,   4")
])

There are several problems with this pattern, for one thing the number of rows in the pattern need to match the number of columns in things like the match_operand. In other words, the first “r” in “=r” matches the “mov\\t%w0, %w1”. It also means that modifying one pattern means having to find which columns and rows to modify. It makes it quite error prone to make changes and there’s a large number of repetitions which is the anthesis of what makes a good DSL.

Instead, we now added a new syntax that takes into account the all the things we have learned and done using the patterns over the years:

[(set (match_operand:SI 0 "nonimmediate_operand")
	(match_operand:SI 1 "aarch64_mov_operand"))]
  "(register_operand (operands[0], SImode)
    || aarch64_reg_or_zero (operands[1], SImode))"
  {@ [cons: =0, 1; attrs: type, arch, length]
     [r , r  ; mov_reg  , *   , 4] mov\t%w0, %w1
     [k , r  ; mov_reg  , *   , 4] ^
     [r , k  ; mov_reg  , *   , 4] ^
     [r , M  ; mov_imm  , *   , 4] mov\t%w0, %1
     [r , n  ; mov_imm  , *   ,16] #
     /* The "mov_imm" type for CNT is just a placeholder.  */
     [r , Usv; mov_imm  , sve , 4] << aarch64_output_sve_cnt_immediate ("cnt", "%x0", operands[1]);
     [r , m  ; load_4   , *   , 4] ldr\t%w0, %1
     [w , m  ; load_4   , fp  , 4] ldr\t%s0, %1
     [m , rZ ; store_4  , *   , 4] str\t%w1, %0
     [m , w  ; store_4  , fp  , 4] str\t%s1, %0
     [r , Usw; load_4   , *   , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1]
     [r , Usa; adr      , *   , 4] adr\t%x0, %c1
     [r , Ush; adr      , *   , 4] adrp\t%x0, %A1
     [w , rZ ; f_mcr    , fp  , 4] fmov\t%s0, %w1
     [r , w  ; f_mrc    , fp  , 4] fmov\t%w0, %s1
     [w , w  ; fmov     , fp  , 4] fmov\t%s0, %s1
     [w , Ds ; neon_move, simd, 4] << aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);
  }
  "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), SImode)
    && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))"
  [(const_int 0)]
  {
    aarch64_expand_mov_immediate (operands[0], operands[1]);
    DONE;
  })

This new syntax makes it possible to see more clearly which line is being edited, which pattern belongs to which attributes and constraints. Lastly, as a bonus when looking at git patch diffs it’s now only localized to the pattern that is being changed. We have converted the entirety of the AArch64 backend to this syntax and using it we have now been able to clean up duplicate patterns, make them easier to read and change and notice missing alternatives or bugs in existing ones.

While it is not something that is user visible it has and will allow us to move faster and make less mistakes when developing the compiler.

Summary

GCC 14 was a big release in not just features but also in positioning us with a clear path into more aggressive optimizations in the future. Stay tuned, there are much more to come.

In the meantime, check out the previous year's entry for GCC 13.

New features in GCC 13

Anonymous
Tools, Software and IDEs blog
  • Python on Arm: 2025 Update

    Diego Russo
    Diego Russo
    Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
    • August 21, 2025
  • Product update: Arm Development Studio 2025.0 now available

    Stephen Theobald
    Stephen Theobald
    Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
    • July 18, 2025
  • GCC 15: Continuously Improving

    Tamar Christina
    Tamar Christina
    GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
    • June 26, 2025