Welcome to the GCC 12 issue of Arm’s annual performance improvements blog. As always, this year’s changes are a combination of the work that Arm and the community have done in GCC. With GCC 12 we have been focused on laying the groundwork with changes that will allow us to be in a better position for future optimization work. That said we still managed plenty of improvements for both Advanced SIMD and SVE.
This year’s GCC is the fastest GCC as measured on the SPECrate® 2017 integer on the Neoverse platform:
GCC 7-12 SPECrate® 2017 integer on Neoverse platforms estimated relative improvements.
These are all rate=1 (single core) improvements. Let us dive in to see how we got here.
GCC 12 brings with it compiler support for various architecture features and improves auto-vectorization support for others. Chief among these are the following:
Armv8.6-A introduced a new dot-product instruction for when the sign of the operands differ called usdot. This instruction is introduced behind the +i8mm compiler flag.
Starting with GCC 12 the auto-vectorizer can now automatically recognize and use this instruction for SVE, AArch64, and AArch32. For example, the following sequence
#define N 480 unsigned int f (unsigned int res, signed char *restrict a, unsigned char *restrict b) { for (__INTPTR_TYPE__ i = 0; i < N; ++i) { int av = a[i]; int bv = b[i]; signed short mult = av * bv; res += mult; } return res; }
Used to vectorize as:
f: movi v0.4s, 0 mov x3, 0 .p2align 3,,7 .L2: ldr q2, [x1, x3] ldr q1, [x2, x3] add x3, x3, 16 sxtl v4.8h, v2.8b sxtl2 v3.8h, v2.16b uxtl v2.8h, v1.8b uxtl2 v1.8h, v1.16b mul v2.8h, v2.8h, v4.8h mul v1.8h, v1.8h, v3.8h saddw v0.4s, v0.4s, v2.4h saddw2 v0.4s, v0.4s, v2.8h saddw v0.4s, v0.4s, v1.4h saddw2 v0.4s, v0.4s, v1.8h cmp x3, 480 bne .L2 addv s0, v0.4s fmov w1, s0 add w0, w0, w1 ret
Now with GCC-12 we get a much nicer:
f: movi v0.4s, 0 mov x3, 0 .p2align 3,,7 .L2: ldr q1, [x2, x3] ldr q2, [x1, x3] usdot v0.4s, v1.16b, v2.16b add x3, x3, 16 cmp x3, 480 bne .L2 addv s0, v0.4s fmov w1, s0 add w0, w0, w1 ret
Armv8.8-A added several new memcpy instructions to the architecture to accelerate these often-used operations. When doing a small memcpy, memset or memmove, compilers typically inline the operation. However if doing a large memcpy the compiler instead emits a function call to the implementation in the C standard library. As an example:
void copy (char * restrict a, char * restrict b, int n) { for (int i = 0; i < n; i++) a[i] = b[i]; }
emits at -O2
copy: cmp w2, 0 ble .L1 uxtw x2, w2 b memcpy .L1: ret
Which unless n == 0 makes a call to the memcpy function. Starting with GCC-12 when using the +mops the compiler emits the new memcpy instructions:
copy: cmp w2, 0 ble .L1 uxtw x2, w2 cpyfp [x0]!, [x1]!, x2! cpyfm [x0]!, [x1]!, x2! cpyfe [x0]!, [x1]!, x2! .L1: ret
With GCC 12, we have continued to enhance GCC’s intrinsics support. This year has focused on starting to migrate intrinsics functions from RTL definitions into GIMPLE to allow the front end of the compiler to understand the semantics of the instructions. The goal here is to effectively remove the physical arm_Neon.h file from the compiler source tree entirely and instead synthesis the header entirely in the compiler itself. While this does not have any effect in itself for end-users it has great benefits for maintainability. It also allow us to tie together concepts that would be difficult to do when having to use the C language to describe them. This GCC version starts by moving the structural types, for example, int32x4x2_t and related out of the header file and into the compiler directly. One of the primary reasons for doing this is to fix register allocation issues when these types were used.
To handle the new types that were moved out of the header, we introduced several new full and partial structure types in the compiler. In versions before GCC 12, we used to use generic “bag of bits” to represent these intrinsics. As an example the vst2q_s32 intrinsics used a type that says the intrinsics generates a 256-bit “bag-of-bits”. We would then generate “inserts” into this bag of bits so we can fill it up piecewise. The issue with this is that the compiler needs to know the lifetime of all the bits in the bag the moment the bag is created. The second major downside of this is that we have copies of the structures being done piecewise and we relied on the register allocator to consolidate them. As a result, it would often manage to eliminate one copy but not the other. For the example:
#include <arm_neon.h> void foo(int *dst, int32x4x2_t a) { vst2q_s32(dst, a); }
GCC used to generate:
foo: mov v2.16b, v0.16b mov v3.16b, v1.16b st2 {v2.4s - v3.4s}, [x0] ret
Where the register allocator was unable to consolidate all the copies. Starting with GCC 12 we now simply generate.
foo: st2 {v0.4s - v1.4s}, [x0] ret
By using a new type that says, “this type uses two sequential 128-bit vector registers starting at register n”. This allows us not to require any additional copies to make or deconstruct the type.
LD1, ST1 and left and right shift intrinsics are among the first to be described using GIMPLE (GCC's mid-end IR language) rather than RTL (GCC's back-end IR language). A simple example to show the benefits of this is:
#include <arm_neon.h> int32x4_t foo(int32x4_t a) { int32_t temp[4]; vst1q_s32(temp, a); return vld1q_s32(temp); }
This code loads and stores to a local array and so is a no-op. However, because the front end cannot see what the load and store intrinsics are doing we can only eliminate the instructions late in RTL. However we would have already laid out the frame and so before GCC 12 we would generate:
foo: sub sp, sp, #16 add sp, sp, 16 ret
GCC-12 can now correctly see inside the definition and generate:
foo: ret
GCC has a very advanced register allocator that for much of the time gets things right. To do so, it relies heavily on receiving correct input on costing from each target backend and on information from mid-end passes. Two key pieces of information it requires is information on basic block frequencies and on the branch probabilities. An example situation where this becomes very apparent is during high register pressure situations where you also have function calls:
void bar (int, int, int, int); int foo (int x, char* foo) { int tmp = x * 753; int t2 = tmp + 7; int t3 = tmp * 7; int c1 = 753; int c2 = c1 + 7; int c3 = c3 * 7; for (int i = 0; i < 1024; i++) { if (__builtin_expect_with_probability (foo[i] != 0, 1, SPILLER)) bar(x, tmp, t2, t3); c1 += foo[i+1]; c2 *= foo[i+1]; c3 += c2; } return c1 + c2 + c3; }
With this example, we can tweak the branch probabilities by changing the value of SPILLER and we can simulate high register pressure by taking other register out of consideration for register allocation. If we look at the output of this example compiled with -DSPILLER=0.5 -fno-shrink-wrap -fno-schedule-insns -O3 -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks, we find several issues.
We can see that just by tweaking the static branch probabilities we could get the register allocator to generate better or worse code. As an example, with -DSPILLER=0.5 we get this snippet:
.L5: ldrb w0, [x19] cbz w0, .L2 ldp w1, w0, [sp, 72] stp w2, w3, [sp, 56] str x7, [sp, 64] bl bar ldrb w0, [x19, 1]! ldr x7, [sp, 64] add w22, w22, w0 ldp w2, w3, [sp, 56] mul w20, w20, w0 add w21, w21, w20 cmp x19, x7 bne .L5
However with a simple 1% change (-DSPILLER=0.51) in the probability we suddenly get an additional reload:
.L5: ldrb w0, [x19] cbz w0, .L2 ldr w0, [sp, 76] stp w1, w2, [sp, 56] str w3, [sp, 72] bl bar ldrb w0, [x19, 1]! ldp w1, w2, [sp, 56] add w21, w21, w0 ldr w3, [sp, 72] <<<< here mul w20, w20, w0 ldr x0, [sp, 64] add w22, w22, w20 cmp x19, x0 bne .L5
Part of this was because historically in the AArch64 back-end we had costed the cost of a load and store to be the same. This is obviously not correct as modern CPUs have a store buffer. This buffer makes stores significantly cheaper than loads as the CPU does not need to wait for the store to complete to continue. The register allocator is deciding between whether it is cheaper to spill a value outside of the loop’s live range. In this case it would free up registers so it can avoid having to spill at the call site. The other option it has is to do reloads around the call itself. Doing the spills outside of the loop requires multiple stores and so it compares the costs of these stores vs the loads and stores needed around the call. Because of the incorrect costs, we would choose to spill around the call instead.
The second fix to GCC’s register allocator is how it handles “soft” conflicts. A conflict is considered soft when a value is live throughout a range (for example, a loop) and the range contains branches where the value is not used. As an example, the values c1, c2, t2, and c3 are live but not used inside the branch with the function call. Typically, a register allocator handles such situations by trying to split the live range, but since the values are live the range cannot be reduced. With the changes, we now choose to allocate them to callee saved registers. As a consequence they no longer need to be spilled before the function call.
GCC used to emit a GOT access using separate ADRP and LDR instructions. This allows them to be scheduled independently and use different registers:
ADRP x0, :got: symbol ... ; unrelated instructions LDR x1, [x0, :got_lo12: symbol]
GCC12 now always emits the ADRP and LDR as consecutive instructions using the same register. This reduces register pressure with -fPIC/-fPIE, resulting in more efficient code. For example, Perlbench is 1.8% faster with -fPIC and 0.9% smaller.
ADRP x1, :got: symbol LDR x1, [x1, :got_lo12: symbol]
Another advantage is that linkers can now optimize GOT accesses without having to introduce new relocations because the instructions are now always sequential as a block.
When register pressure is high, the register allocator spills some values to the stack. Spilling is expensive due to the cost of loads and stores, so the register allocator tries hard to minimize the number of spills. Some values, such as immediate and addresses, can be recomputed. Recomputing is better than spilling since ALU instructions are very fast. This alternative to spilling is called rematerialization. GCC12 rematerializes addresses more often on AArch64, which improves performance and reduces codesize of applications using many global variables.
Using the previous approach we may use more registers than needed and allow code to be scheduled in between the two usages of the ADRP. When the function has high register pressure this could lead to spilling the ADRP address, for example:
ADRP x0, symbol LDR x1, [x0, :lo12: symbol] STR x0, [sp, 32] // spill ADRP value ... code using many registers... LDR x2, [sp, 32] // restore ADRP STR x3, [x2, :lo12: symbol]
Now instead of spilling the ADRP we simply re-materialize it which is much cheaper and also frees up an additional register:
ADRP x0, symbol LDR x1, [x0, :lo12: symbol] ... code using many registers... ADRP x2, symbol // rematerialize ADRP STR x3, [x2, :lo12: symbol]
With GCC-12, we have started beefing up GCC’s constant CSE. AArch64 has limited constant range that can fit in a single instruction. To create complex constants we have two options, either use a sequence of mov/movk or a literal pool.
GCC as a compiler does quite a lot of optimizations at the time of parsing already. One of these is pulling out of constants out of arrays. As an example:
#include <stdint.h> #include <arm_neon.h> uint64_t test (uint64_t a, uint64x2_t b, uint64x2_t* rt) { uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL}; uint64_t res = a | arr[0]; uint64x2_t val = vld1q_u64 (arr); *rt = vaddq_u64 (val, b); return res; }
The expression res is represented in GIMPLE as “a | 0x0942430810234076UL” and because the other occurrence of the constant is inside a vector we could not CSE the constant. As a result, we would materialize the same constant twice:
test: adrp x2, .LC0 sub sp, sp, #16 ldr q1, [x2, #:lo12:.LC0] mov x2, 16502 movk x2, 0x1023, lsl 16 movk x2, 0x4308, lsl 32 add v1.2d, v1.2d, v0.2d movk x2, 0x942, lsl 48 orr x0, x0, x2 str q1, [x1] add sp, sp, 16 ret .LC0: .xword 667169396713799798 .xword 667169396713799798
GCC 12 can now not just CSE the constant, but can also decide where it is cheaper to materialize the constant. It can do so on either the SIMD or general register side. In this case, the bigger constant is needed on the SIMD side and so it is cheaper to materialize the constant there. With GCC 12 we now generate:
test: adrp x2, .LC0 ldr q1, [x2, #:lo12:.LC0] add v0.2d, v0.2d, v1.2d fmov x2, d1 str q0, [x1] orr x0, x0, x2 ret .LC0: .xword 667169396713799798 .xword 667169396713799798
In GCC 12, we have taught the compiler to do various bit optimizations with the goal to improve latency and throughput. Usually, vector shifts on Arm CPUs are throughput limited and so avoiding them usually has great benefits. The following, are a couple of examples:
In many image and video processing it is common to do operations that truncate or shift by half the width of a vector. As an example:
typedef short int16_t; typedef unsigned short uint16_t; void foo (uint16_t * restrict a, int16_t * restrict d, int n) { for( int i = 0; i < n; i++ ) d[i] = (a[i] * a[i]) >> 16; }
Is a common operation where you do an operation on a small datatype that ends up widening it and then therefore only the top bits are needed. Previously, we would generate:
.L4: ldr q0, [x0, x3] umull v1.4s, v0.4h, v0.4h umull2 v0.4s, v0.8h, v0.8h sshr v1.4s, v1.4s, 16 sshr v0.4s, v0.4s, 16 xtn v2.4h, v1.4s xtn2 v2.8h, v0.4s str q2, [x1, x3] add x3, x3, 16 cmp x3, x4 bne .L4
With GCC 12 we now generate:
.L4: ldr q0, [x0, x3] umull v1.4s, v0.4h, v0.4h umull2 v0.4s, v0.8h, v0.8h uzp2 v0.8h, v1.8h, v0.8h str q0, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4
The uzp operation is described by the following image:
uzp operational semantics
Another common operation is creating a mask that is all 1s when the top bit is 1 and 0. Otherwise, checking if the number is negative. An example is:
void e (int * restrict a, int *b, int n) { for (int i = 0; i < n; i++) b[i] = a[i] >> 31; }
Which used to generate:
.L4: ldr q0, [x0, x3] sshr v0.4s, v0.4s, 31 str q0, [x1, x3] add x3, x3, 16 cmp x3, x4 bne .L4
Now generates:
.L4: ldr q0, [x0, x3] cmlt v0.4s, v0.4s, #0 str q0, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4
While these operations have the same latency on almost all AArch64 CPUs the comparison has a higher throughput than shifts in virtually all cases.
In many Arm CPUs, a vector of zeros can be created very cheaply or even free when using movi with a 0 immediate. This opens a lot of optimizations as we can use a vector of zeros to seed operations to turn them into more efficient forms of other operations. As an example, rounding right shifts by half the input type size can be optimized:
#include <arm_neon.h> uint32x4_t foo (uint64x2_t a, uint64x2_t b) { return vrshrn_high_n_u64 (vrshrn_n_u64 (a, 32), b, 32); }
Used to generate:
foo: rshrn v0.2s, v0.2d, 32 rshrn2 v0.4s, v1.2d, 32 ret
But today generates:
foo: movi v2.4s, 0 raddhn v0.2s, v0.2d, v2.2d raddhn2 v0.4s, v1.2d, v2.2d ret
This sequence has both lower latency and higher throughput than the previously generated one using shifts. These are just three examples of the many optimizations we have added in GCC 12. These are just the start and there are many more to come.
Whenever you have a conditional statement in C the code can be vectorized with SVE using predication. This allows us to vectorize many more loops than we could, using Neon. This is not limited to one conditional however and SVE also allows us to deal with nested conditionals. In such cases before GCC 12, we would handle these a bit sub optimally. As an example:
void f(float * restrict z0, float * restrict z1, float *restrict x, float * restrict y, float c, int n) { for (int i = 0; i < n; i++) { float a = x[i]; float b = y[i]; if (a > b) { z0[i] = a + b; if (a > c) { z1[i] = a - b; } } } }
Would generate:
.L3: ld1w z1.s, p1/z, [x2, x5, lsl 2] ld1w z2.s, p1/z, [x3, x5, lsl 2] fcmgt p0.s, p3/z, z1.s, z0.s fcmgt p2.s, p1/z, z1.s, z2.s fcmgt p0.s, p0/z, z1.s, z2.s movprfx z3, z1 fadd z3.s, p2/m, z3.s, z2.s and p0.b, p0/z, p1.b, p1.b fsub z1.s, p0/m, z1.s, z2.s st1w z3.s, p2, [x0, x5, lsl 2] st1w z1.s, p0, [x1, x5, lsl 2] add x5, x5, x6 whilelo p1.s, w5, w4 b.any .L3
Where we perform too many predicate comparisons. We can combine the predicate for a > b to create the predicate for a > b && a > c by using it as an input predicate when checking a > c. As such in GCC 12 we now generate:
.L3: ld1w z1.s, p0/z, [x2, x5, lsl 2] ld1w z2.s, p0/z, [x3, x5, lsl 2] fcmgt p0.s, p0/z, z1.s, z2.s movprfx z3, z1 fadd z3.s, p0/m, z3.s, z2.s fcmgt p1.s, p0/z, z1.s, z0.s fsub z1.s, p1/m, z1.s, z2.s st1w z3.s, p0, [x0, x5, lsl 2] st1w z1.s, p1, [x1, x5, lsl 2] add x5, x5, x6 whilelo p0.s, w5, w4 b.any .L3
This is an example of the kind of predicate optimizations introduced in GCC 12.
In older GCC versions when using -mcpu=native and -march=native if the CPU ID is unknown to the compiler we would use Armv8-a without any extra extensions enabled. This is done even if we were able to tell which extensions the CPU supports through /proc/cpuinfo. Starting with GCC 12 we now enable any extension we find that can be enabled using Armv8-a as the baseline. Using GCC 12+ on a CPU it does not know the ID of but has known feature bits like SVE support or FP16, will still have these features enabled.
In GCC 12 auto-vectorization has finally been enabled at -O2 instead of needing -O3 or higher. This allows for a better comparison against other compilers. At -O2 the compiler uses by default the very-cheap cost model when deciding whether to vectorize or not. Note that this is a different model than when doing -O2 -ftree-vectorize.
The very cheap cost model will only allow vectorization if the compiler is certain that vectorization would result in a performance win. Additionally the code-size increase to be able to vectorize should not be too much.
Often when talking about both of Arm’s vector ISAs the discussion is always about picking one above the other. However best performance can often be had by combining the two rather than pitting them against each other. These decisions are tied to the specific micro-architecture as the decision requires accurate throughput and latency information for both ISAs.
Before GCC 12 we would pick one or the other. Starting with GCC 12 the choice is a bit more nuanced. As an example:
typedef short int16_t; typedef unsigned short uint16_t; void foo (uint16_t * restrict a, int16_t * restrict d, int n) { for( int i = 0; i < n; i++ ) d[i] = (a[i] * a[i]) >> 10; }
Used to generate with SVE enabled:
.L3: ld1h z0.s, p0/z, [x0, x3, lsl 1] mul z0.s, p1/m, z0.s, z0.s asr z0.s, z0.s, #10 st1h z0.s, p0, [x1, x3, lsl 1] add x3, x3, x4 whilelo p0.s, w3, w2 b.any .L3 .L1:
In which SVE would be preferred over Neon. For loops where n is small however Neon may offer the best performance here. Starting with GCC 12 we now generate for this example a combination of Neon and SVE. The compiler emits a runtime check for trip count, if the trip count is large we jump directly into an SVE loop. If the trip count is low, we use a Neon main loop followed by an SVE epilogue.
Because the codegen is large, we only show the Neon and SVE combination:
.L4: ldr q0, [x0, x3] umull v1.4s, v0.4h, v0.4h umull2 v0.4s, v0.8h, v0.8h shrn v1.4h, v1.4s, 10 shrn2 v1.8h, v0.4s, 10 str q1, [x1, x3] add x3, x3, 16 cmp x4, x3 bne .L4 and w3, w2, -8 tst x2, 7 beq .L1 .L3: sub w2, w2, w3 ptrue p0.b, all whilelo p1.s, wzr, w2 ld1h z0.s, p1/z, [x0, x3, lsl 1] mul z0.s, p0/m, z0.s, z0.s asr z0.s, z0.s, #10 st1h z0.s, p1, [x1, x3, lsl 1] add x0, x0, x3, lsl 1 add x3, x1, x3, lsl 1 cntw x1 whilelo p1.s, w1, w2 b.any .L9
Unrolling of loops is a common thing in compilers. It is typically done on scalar code. In GCC, we have not done much in the way of vector code unrolling before, however as the number of vector pipelines are increasing loop unrolling is needed to feed the pipelines with enough work to get optimal performance.
Indiscriminately unrolling vectorized loops ignoring ISA and micro-architecture details will yield undesirable results. With GCC12 the AArch64 backend leverages the tuning information regarding the CPU's width and throughput, selected by -mcpu=native or -mcpu=<cpu>, to make the decision whether to unroll and by how much.
This unrolling is also combined with the ability to use Neon and SVE together to handle loops. As such you can result in several different combinations. A few examples:
The previous example when compiling for some micro-architectures can result in Neon being unrolled and a single SVE fallback:
foo: cmp w2, 0 ble .L1 sub w3, w2, #1 mov x6, 0 cmp w3, 6 bls .L3 lsr w4, w2, 3 lsl x5, x4, 4 tbz x4, 0, .L4 ldr q0, [x0] mov x6, 16 umull v1.4s, v0.4h, v0.4h umull2 v2.4s, v0.8h, v0.8h shrn v3.4h, v1.4s, 10 shrn2 v3.8h, v2.4s, 10 str q3, [x1] cmp x5, x6 beq .L13 .p2align 5,,15 .L4: add x7, x6, 16 ldr q4, [x0, x6] ldr q5, [x0, x7] umull v6.4s, v4.4h, v4.4h umull2 v7.4s, v4.8h, v4.8h umull v16.4s, v5.4h, v5.4h umull2 v17.4s, v5.8h, v5.8h shrn v18.4h, v6.4s, 10 shrn v1.4h, v16.4s, 10 shrn2 v18.8h, v7.4s, 10 shrn2 v1.8h, v17.4s, 10 str q18, [x1, x6] add x6, x6, 32 str q1, [x1, x7] cmp x5, x6 bne .L4 .L13: and w6, w2, -8 tst x2, 7 beq .L1 .L3: sub w2, w2, w6 ptrue p0.b, all whilelo p1.s, wzr, w2 ld1h z19.s, p1/z, [x0, x6, lsl 1] mul z19.s, p0/m, z19.s, z19.s asr z20.s, z19.s, #10 st1h z20.s, p1, [x1, x6, lsl 1] cntw x8 add x0, x0, x6, lsl 1 whilelo p2.s, w8, w2 add x1, x1, x6, lsl 1 b.none .L1 ld1h z21.s, p2/z, [x0, #1, mul vl] mul z21.s, p0/m, z21.s, z21.s asr z22.s, z21.s, #10 st1h z22.s, p2, [x1, #1, mul vl] .L1: ret
If we change the sequence into one where SVE is always beneficial we only get SVE code and an unrolled SVE loop:
generates:
.L4: ld1h z0.h, p0/z, [x9] ld1h z1.h, p0/z, [x9, #1, mul vl] umulh z0.h, p0/m, z0.h, z0.h umulh z1.h, p0/m, z1.h, z1.h add w5, w5, w7 st1h z1.h, p0, [x10, #1, mul vl] st1h z0.h, p0, [x10] add x9, x9, x6 add x10, x10, x6 cmp w8, w5 bcs .L4 cmp w2, w5 beq .L1 .L3: ubfiz x11, x5, 1, 32 sub w2, w2, w5 ptrue p1.b, all whilelo p2.h, wzr, w2 add x0, x0, x11 add x1, x1, x11 ld1h z2.h, p2/z, [x0] umulh z2.h, p1/m, z2.h, z2.h st1h z2.h, p2, [x1] cntb x12 cnth x13 whilelo p3.h, w13, w2 add x14, x0, x12 add x15, x1, x12 b.none .L1 ld1h z3.h, p3/z, [x14] umulh z3.h, p1/m, z3.h, z3.h st1h z3.h, p3, [x15] .L1: ret
This unrolling is more than just simply repeating instructions. The goal of the unrolling is to increase the pipeline usage inside of the loop. To accomplish this we try to maintain a parallelism as much as possible. One way we do this is to share accumulators if the loop does an accumulation.For example:
double f(double *x, double *y, long n) { double res = 0; for (long i = 0; i < n; ++i) res += x[i] * y[i]; return res; }
generates when unrolled:
.L4: ld1d z4.d, p0/z, [x6, x3, lsl 3] ld1d z5.d, p0/z, [x5, x3, lsl 3] ld1d z2.d, p0/z, [x0, x3, lsl 3] ld1d z3.d, p0/z, [x1, x3, lsl 3] add x3, x3, x4 fmla z1.d, p0/m, z4.d, z5.d fmla z0.d, p0/m, z2.d, z3.d cmp x7, x3 bcs .L4 fadd z0.d, z0.d, z1.d cmp x2, x3 beq .L6
which keeps the fmla chains separate and only does the the final accumulation after the loop.
With GCC 12 we can now vectorize fmin/fmax without the need to use -ffast-math.For example:
double f (double *x, int n) { double res = 100.0; for (int i = 0; i < n; ++i) res = __builtin_fmin (res, x[i]); return res; }
Before GCC 12 we would fail vectorization at -O3 and produce scalar code. With GCC 12 we now generate:
.L3: ld1d z1.d, p0/z, [x0, x2, lsl 3] add x2, x2, x4 fminnm z0.d, p0/m, z0.d, z1.d whilelo p0.d, w2, w1 b.any .L3 ptrue p0.b, all fminnmv d0, p0, z0.d ret
which now produces vectorized code even at -O3.
In GCC 12 support for gathers and scatters was added to the SLP vectorizer which allows for greater flexibility when these operations are required.As an example:
void f (int *restrict y, int *restrict x, int *restrict indices) { for (int i = 0; i < 16; ++i) { y[i * 2] = x[indices[i * 2]] + 1; y[i * 2 + 1] = x[indices[i * 2 + 1]] + 2; } }
fails to vectorize before GCC 12. Starting with GCC 12 we can now handle such cases and are able to use the existing costing infrastructure for SLP loops to determine when their use would be beneficial. We now generate for the example:
.L2: ld1w z0.s, p0/z, [x2, x3, lsl 2] ld1w z0.s, p0/z, [x1, z0.s, sxtw 2] add z0.s, z0.s, z1.s st1w z0.s, p0, [x0, x3, lsl 2] add x3, x3, x5 whilelo p0.s, x3, x4 b.any .L2 ret
As part of Armv8.7-A we've added support for the atomic 64-byte load and store instructions to GCC.These can be used with the +ls64 extension. This extension comes with a new ACLE type data512_t which can be used to store the resulting data into.The example:
#include <arm_acle.h> void func(const void * addr, data512_t *data) { *data = __arm_ld64b (addr); }
func: ld64b x8, [x0] stp x8, x9, [x1] stp x10, x11, [x1, 16] stp x12, x13, [x1, 32] stp x14, x15, [x1, 48] ret
GCC 12 also adds CPU support for the following Arm CPUs:
These can be used with the -march, -mcpu and -mtune compiler options to target the compiler to these CPUs.
With a lot of the foundational pieces in place, we can now push for more complex optimizations in GCC. The combination of SVE and Neon promises to deliver much greater performance by giving us flexibility without needing to lose any performance during low trip count.
In the meantime, check out previous year's entry for GCC 11.
Performance improvements in GCC 11
Very excellent article! seems these cases tested on cortex-A, do you have
statistics on corter-m, e.g. cortex-m33, m55?