Advanced SIMD lacks some of the instructions that SVE can use to create bitmask patterns. For Advanced SIMD, we are often required to use creative means to generate vector constant immediates. In GCC 14, we added a small framework to allow us to generate these special constants more easily than before.
An example of such an immediate is one with the top bit set in each 64-bit lane inside a vector set. Such patterns are typically used to manipulate the sign bit of vectors of doubles:
void g (double *a, int n) { for (int i = 0; i < n; i++) a[i] = copysign (a[i], -1); }
Used to generate prior to GCC 14:
.L13: ldr q0, [x2] fabs v0.2d, v0.2d fneg v0.2d, v0.2d str q0, [x2], 16 cmp x3, x2 bne .L13
but with GCC 14 we now generate:
movi v31.4s, 0 fneg v31.2d, v31.2d .L9: ldr q30, [x2] orr v30.16b, v30.16b, v31.16b str q30, [x2], 16 cmp x2, x3 bne .L9
The example above also shows that for the pattern fneg (fabs (..)) we should be generating the same code as copysign.
In GCC, we made several changes including the recognition of this pattern as a copysign operation. This operation is quite common in scientific code. Concretely we now generate for:
void g (double *a, int n) { for (int i = 0; i < n; i++) a[i] = -fabs (a[i]); }
The same codegen as the above copysign example:
.L9: ldr q30, [x2] orr v30.16b, v30.16b, v31.16b str q30, [x2], 16 cmp x3, x2 bne .L9
As mentioned before, SVE typically has better immediate ranges in vector instructions than Advanced SIMD has. Part of the costing the compiler does is determining whether it should autovectorize using Advanced SIMD or SVE. In previous releases, we have mentioned that often better performance can be gotten by using a mix of SVE and Advanced SIMD loops. A typical example is an unrolled Advanced SIMD main loop followed by an SVE loop as the epilogue.
In GCC 14, we continue pushing towards the trend of mixing SVE and Advanced SIMD code. When you are compiling code for an SVE-capable system but the compiler’s costing has determined that Advanced SIMD is the best ISA to use, we can now use a few SVE instructions to enhance the Advanced SIMD code.
The example above compiled on an SVE-enabled system generates
.L9: ldr q31, [x2] orr z31.d, z31.d, #-9223372036854775808 str q31, [x2], 16 cmp x3, x2 bne .L9
Using an SVE inclusive OR by immediate instead of the Advanced SIMD variant which requires us to first construct the immediate inside a register.
Another example is the optimization of 128-bit vector integer divides. Advanced SIMD lacks an instruction for doing vector division of 64-bit integers:
#include <arm_neon.h> int64x2_t foo2 (int64x2_t a, int64x2_t b) { return a / b; }
is compiled in GCC 13 to:
foo2: fmov x0, d0 fmov x3, d1 umov x1, v0.d[1] umov x2, v1.d[1] sdiv x0, x0, x3 sdiv x1, x1, x2 fmov d0, x0 ins v0.d[1], x1 ret
but in GCC 14 on an SVE-cabable system it’s compiled to:
foo2: ptrue p7.b, all sdiv z0.d, p7/m, z0.d, z1.d ret
There are several such cases where we generate Advanced SIMD instructions mixed with SVE and this is a trend that will continue to increase in the future.
Mixing SVE and Advanced SIMD is a feature that is expected to be wanted by users as well and as such Arm has developed an SVE-NEON bridge as part of the ACLE.
Read more about SVE-NEON ACLE bridge
This new interface can be used through the inclusion of the arm_neon_sve_bridge.h header file. The intrinsics defined here allow you to convert between SVE and Advanced SIMD types and then carry on using the normal SVE or Advanced SIMD intrinsics:
#include <arm_neon_sve_bridge.h> void f (int *a, int *b) { int32x4_t va = vld1q_s32 (a); svint32_t za = svset_neonq_s32 (svundef_s32 (), va); svint32_t zo = svorr_n_s32_x (svptrue_b32 (), za, 1 << 31); int32x4_t res = svget_neonq_s32 (zo); vst1q_s32 (b, res); }
Generates the equivalent code of the copysign example above. The loads are performed using advanced SIMD instructions and we then apply SVE instructions to the data held in the advanced SIMD registers.
This bridge allows for seamless conversion between the two ISAs. The special SVE intrinsic svundef can be used to avoid explicitly setting the upper bits of the SVE register, allowing direct usage of the register overlap characteristics of the AArch64 register file.
Prior to GCC 14 the default cost model for the compiler (that is, what you get if no -mcpu/-mtune is used) was based on Cortex-A57. This cost model is quite old and what was inefficient or didn’t matter for that specific micro-architecture is no longer accurate for modern micro-architectures.
This would mean that GCC’s default code generation was suboptimal for majority of users today. This becomes especially evident when comparing GCC against other AArch64 compilers without changing the tuning model.
The tuning models in GCC are rather complex but one big difference between the old and new tuning model framework is that the old framework only modelled latency but did not consider throughput. One example of this is the following code where unrolling is beneficial on newer micro-architecture but wouldn’t really hurt older ones much.
One example is better addressing modes. For SVE we used to generate at -O3 -march=armv9-a in GCC-13:
.L14: lsl x0, x1, 1 add x1, x1, 8 add x2, x3, x0 add x6, x0, x4 ld1h z2.h, p7/z, [x2] ld1h z22.h, p7/z, [x6] add x0, x0, x5 fmad z22.h, p7/m, z21.h, z2.h ld1h z20.h, p7/z, [x0] fmad z20.h, p7/m, z26.h, z22.h st1h z20.h, p7, [x2]...
And now generates:
.L13: ld1h z0.h, p0/z, [x1, x0, lsl 1] ld1h z2.h, p0/z, [x4, x0, lsl 1] ld1h z1.h, p0/z, [x3, x0, lsl 1] fmad z2.h, p0/m, z4.h, z0.h ld1h z0.h, p0/z, [x2, x0, lsl 1] fmad z1.h, p0/m, z5.h, z2.h fmad z0.h, p0/m, z3.h, z1.h st1h z0.h, p0...
These newer cost models are not a single fixed cost model and are instead attached to architecture revisions. For instance, we may over time add different cost models for Armv8-a which is different from Armv8.1-a. In other words, each -march can now have its own cost model based on what we think the types of CPUs are that would be using them.
Today we have added two new generic models, generic-armv8-a and generic-armv9-a. These are enabled by default and are selected based on the current -march setting. that is, using -march=armv8-a would use the new generic-armv8-a model and using -march=armv9-a uses generic-armv9-a. These models differ in the width of the micro-architecture on which they are expected to be used on and in codegen strategy for SVE.
GCC 14 added better support for the AArch64 Absolute Difference instruction and adds support for the Absolute Difference Long instruction.
Concretely for:
void f (unsigned short *a, unsigned char *b, unsigned int *restrict out) { for (int i = 0; i < 1024; i++) out[i] = __builtin_abs (a[i] - b[i]); }
We used to generate prior to GCC 14:
.L2: ldr q2, [x1], 16 ldp q1, q0, [x0] add x0, x0, 32 uxtl v3.8h, v2.8b uxtl2 v2.8h, v2.16b usubl v4.4s, v1.4h, v3.4h usubl2 v1.4s, v1.8h, v3.8h usubl v3.4s, v0.4h, v2.4h usubl2 v0.4s, v0.8h, v2.8h abs v4.4s, v4.4s abs v1.4s, v1...
And now in GCC-14:
.L2: ldr q28, [x1], 16 add x2, x2, 64 ldp q0, q29, [x0], 32 zip1 v27.16b, v28.16b, v31.16b zip2 v28.16b, v28.16b, v31.16b uabdl v26.4s, v0.4h, v27.4h uabdl v30.4s, v29.4h, v28.4h uabdl2 v27.4s, v0.8h, v27.8h uabdl2 v28.4s, v29.8h, v28...
In versions of GCC prior to 14 we used to enforce that architecture extensions be paired with the architecture that introduced them.
As an example, for one to use the dotproduct intrinsics introduced in Armv8.2-a or higher where you would have been obligated to use -march=armv8.2-a+dotprod. The downside of such approach is that as the number of extensions grew it became harder and harder for users to keep track of which extension belongs to which architecture.
Additionally, it has the downside of forcing you to accept every new mandatory extension for the architecture as well instead of just the one you as the user wanted.
Should you not do this the error message was not particularly informative:
error: inlining failed in call to 'always_inline' 'uint32x4_t vdotq_u32(uint32x4_t, uint8x16_t, uint8x16_t)': target specific option mismatch
Starting GCC 14, we no longer tie architectures to extensions and allow you to use any extension at the architecture revision you currently are at.
This means you can now use for example, -march=armv8-a+dotprod.
For this to work correctly we will retro-actively introduce new command line names for features that previously had none because they were mandatory at the architecture that introduced them. One such feature is the Armv8.3-a Complex Number instructions. Some of this work has been deferred to GCC-15.
Often it would be handy to be able to write code that takes advantage of newer architecture features while still being compatible with older architectures, in effect creating a “fat” binary. GCC already does so automatically for atomics using the outline-atomics implementation, which is on by default but so far it hasn’t been possible for users do to the same without significant work.
In GCC 14, we have implemented support for Function Multi Versioning (FMV) based on the Arm specification.
Read the Function Multi Versioning specification
Using FMV allows the user to create special “clones” of a function, each with different architecture extensions enabled and a default implementation which represents the baseline version.
Any calls made to this function within the same translation unit (TU) results in the compiler using an ifunc resolver to insert a dynamic runtime check to see which of the clones to call automatically. As an example:
__attribute__((target_version("sve"))) int foo () { return 3; } __attribute__((target_version("sve2"))) int foo () { return 5; } int foo () { return 1; } int bar() { return foo (); }
On an SVE2 enabled system will call the SVE2 optimized version, on an SVE enabled version calling the SVE enabled version and otherwise calling the default one.
How it determines which one to check first is defined in the ACLE specification for FMV. Each architecture extension is given a priority, and the higher priority gets checked first.
Note that the FMV attribute names do not always align with the command-line option names currently. Users are encouraged to check the FMV spec for the attributes and their priorities.
The code above generates:
foo()._Msve: mov w0, 3 ret foo()._Msve2: mov w0, 5 ret foo() [clone .default]: mov w0, 1 ret foo() [clone .resolver]: stp x29, x30, [sp, -16]! mov x29, sp bl __init_cpu_features_resolver adrp x0, __aarch64_cpu_features ldr x0...
To make it easier to write these functions we have also introduced a new function attribute that can be used to have the compiler automatically generate the function clones by enabling different extensions for each:
__attribute__((target_clones("default", "sve", "sve2"))) int foo () { return 5; } int bar() { return foo (); }
This is functionally equivalent to the previous example (if all the return values were the same), and we did not have to keep multiple copies of the source code around. This is convenient when the compiler generated code can take advantage of the different architecture extensions available in the clones without using relying on intrinsics.
Pointer Authentication (PAC) and Branch Target Identification (BTI) are two security features available on AArch64 and implemented in GCC for years now. They are intended to mitigate Return-Oriented Programming (ROP) and Jump-Oriented Programming (JOP) attacks respectively.
To enable these in GCC the command line flag -mbranch-protection=<option> can be used.
Read more about PAC+BTI mitigations
One thing often not talked about is the impact on performance for these security mitigations. If we enable PAC+BTI on system libraries (glibc, GCC runtime libraries) we see that providing the extra security has very low overhead on Neoverse V2 systems:
If we enable all applications to use PAC+BTI then the impact increases somewhat but is still quite minimal:
This shows that security does not need to come at the expense of performance.
GCC writes mathematical optimizations using a custom Domain Specific Language (DSL) which aims to reduce the boilerplate needed to perform transformations. It also allows us to efficiently combine rules that have a shared matching prefix.
As an example the following rule optimizes X / X:
(simplify (div @0 @0) { build_one_cst (type); })
While this DSL is great to use, there is one downside: build time.
These rules generate a giant C file which contains the code to handle all the rules. Over time as rules were added this file has grown to a point where it is a significant compile time and memory hog.
This file is also required for many other files in the compiler. As such they are a parallel compilation bottleneck. When looking at the bootstrap time of the compiler we see that of the 33 minutes it takes to build the compiler on Neoverse N1 the majority of which underutilizes the CPU:
The linear time is essentially waiting on points in the compilation that serializes the build process. To fix this we’ve added an automatic partitioning scheme for the program that generates the C output.
There is however a point of diminishing returns. Each file still must parse many headers. When the file gets too small the overhead of parsing dominates instead of codegen which is why it’s important to pick a default number of partitions that is good today and for the next few years. To pick the number of default partitions we can plot the compile time improvements over number of partitions:
This shows that the gains start dropping after 10 partitions, and that there’s not much difference between 5 to 10 partitions today. Keeping in mind future growth we went with 10 partitions.
The result of this work shaves off 6 whole minutes on the total compiler bootstrap time (or alternatively, a 27% decrease in build time):
And significantly decreases the flat sequential part of the build time.
Another DSL that GCC uses is one that targets use to describe instructions and pattern combinations.
The exact detail of the DSL is out of scope for this blog, but an example of such pattern is the 32-bits data load/store/move pattern:
(define_insn_and_split "*movsi_aarch64" [(set (match_operand:SI 0 "nonimmediate_operand" "=r,k,r,r,r,r, r,w, m, m, r, r, r, w,r,w, w") (match_operand:SI 1 "aarch64_mov_operand" " r,r,k,M,n,Usv,m,m,rZ,w,Usw,Usa,Ush,rZ,w,w,Ds"))] "(register_operand (operands[0], SImode) || aarch64_reg_or_zero (operands[1], SImode))" "@ mov\\t%w0, %w1 mov\\t%w0, %w1 mov\\t%w0, %w1 mov\\t%w0, %1 # * return aarch64_output_sve_cnt_immediate (\"cnt\", \"%x0\", operands[1]); ldr\\t%w0, %1 ldr\\t%s0, %1 str\\t%w1, %0 str\\t%s1, %0 adrp\\t%x0, %A1\;ldr\\t%w0, [%x0, %L1] adr\\t%x0, %c1 adrp\\t%x0, %A1 fmov\\t%s0, %w1 fmov\\t%w0, %s1 fmov\\t%s0, %s1 * return aarch64_output_scalar_simd_mov_immediate (operands[1], SImode);" "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), SImode) && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))" [(const_int 0)] "{ aarch64_expand_mov_immediate (operands[0], operands[1]); DONE; }" ;; The "mov_imm" type for CNT is just a placeholder. [(set_attr "type" "mov_reg,mov_reg,mov_reg,mov_imm,mov_imm,mov_imm,load_4, load_4,store_4,store_4,load_4,adr,adr,f_mcr,f_mrc,fmov,neon_move") (set_attr "arch" "*,*,*,*,*,sve,*,fp,*,fp,*,*,*,fp,fp,fp,simd") (set_attr "length" "4,4,4,4,*, 4,4, 4,4, 4,8,4,4, 4, 4, 4, 4") ])
There are several problems with this pattern, for one thing the number of rows in the pattern need to match the number of columns in things like the match_operand. In other words, the first “r” in “=r” matches the “mov\\t%w0, %w1”. It also means that modifying one pattern means having to find which columns and rows to modify. It makes it quite error prone to make changes and there’s a large number of repetitions which is the anthesis of what makes a good DSL.
Instead, we now added a new syntax that takes into account the all the things we have learned and done using the patterns over the years:
[(set (match_operand:SI 0 "nonimmediate_operand") (match_operand:SI 1 "aarch64_mov_operand"))] "(register_operand (operands[0], SImode) || aarch64_reg_or_zero (operands[1], SImode))" {@ [cons: =0, 1; attrs: type, arch, length] [r , r ; mov_reg , * , 4] mov\t%w0, %w1 [k , r ; mov_reg , * , 4] ^ [r , k ; mov_reg , * , 4] ^ [r , M ; mov_imm , * , 4] mov\t%w0, %1 [r , n ; mov_imm , * ,16] # /* The "mov_imm" type for CNT is just a placeholder. */ [r , Usv; mov_imm , sve , 4] << aarch64_output_sve_cnt_immediate ("cnt", "%x0", operands[1]); [r , m ; load_4 , * , 4] ldr\t%w0, %1 [w , m ; load_4 , fp , 4] ldr\t%s0, %1 [m , rZ ; store_4 , * , 4] str\t%w1, %0 [m , w ; store_4 , fp , 4] str\t%s1, %0 [r , Usw; load_4 , * , 8] adrp\t%x0, %A1;ldr\t%w0, [%x0, %L1] [r , Usa; adr , * , 4] adr\t%x0, %c1 [r , Ush; adr , * , 4] adrp\t%x0, %A1 [w , rZ ; f_mcr , fp , 4] fmov\t%s0, %w1 [r , w ; f_mrc , fp , 4] fmov\t%w0, %s1 [w , w ; fmov , fp , 4] fmov\t%s0, %s1 [w , Ds ; neon_move, simd, 4] << aarch64_output_scalar_simd_mov_immediate (operands[1], SImode); } "CONST_INT_P (operands[1]) && !aarch64_move_imm (INTVAL (operands[1]), SImode) && REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))" [(const_int 0)] { aarch64_expand_mov_immediate (operands[0], operands[1]); DONE; })
This new syntax makes it possible to see more clearly which line is being edited, which pattern belongs to which attributes and constraints. Lastly, as a bonus when looking at git patch diffs it’s now only localized to the pattern that is being changed. We have converted the entirety of the AArch64 backend to this syntax and using it we have now been able to clean up duplicate patterns, make them easier to read and change and notice missing alternatives or bugs in existing ones.
While it is not something that is user visible it has and will allow us to move faster and make less mistakes when developing the compiler.
GCC 14 was a big release in not just features but also in positioning us with a clear path into more aggressive optimizations in the future. Stay tuned, there are much more to come.In the meantime, check out the previous year's entry for GCC 13.
New features in GCC 13