The GCC 13 compiler is now out and contains a few goodies for everyone. Arm engineers and the GCC community have been hard at work to produce this release and this blog covers some of the things to expect in GCC 13. Some of the optimizations here are generic and some only apply to Arm architectures.
On AArch64, the ABI specifies that when passing a value of a type that’s smaller than the register it’s being passed in then the top bits are undefined. What this means is that passing for example, a char as an argument to a function we cannot assume the top bits of the register are zero, only that the bottom 8 bits are set.
This is also true for passing smaller types like Booleans, which are extended to a byte in the AArch64 ABI. In C and C++ Booleans are defined as 0 or 1, so only the bottom bit matters and the next 7 bits are required to be 0.
As an example, this simple function:
#include <stdbool.h> extern int bar(); int foo (bool a) { if (a) return bar (); return 5; }
Generates in GCC-12:
foo: tst w0, 255 bne .L4 mov w0, 5 ret .L4: b bar
Where we must test the explicit 8 bits and then branch on the results. However, Arm architectures have a handy instruction to be able to test only a single bit. In GCC 13, we now generate:
foo: tbnz w0, 0, .L4 mov w0, 5 ret .L4: b bar
Which combines the tst + b into a tb[n]z. Modern code with many Booleans see a significant performance improvement and a codesize reduction after this change. This improvement is not restricted to Booleans however as it hooks into GCC’s extensive range calculation framework and so will work any time the compiler can see that only a single bit is relevant for the result.
Using the above-mentioned range based framework we’ve started doing more optimization in the vectorizer as well. One such example is a commonly used sequence during image processing:
#include <stdint.h> void f(uint8_t* restrict pixel, uint8_t level, int n) { for (int i = 0; i < (n & -16); i+=1) pixel[i] = (pixel[i] * level) / 0xff; }
In this example, a pixel value is multiplied by a value which results in a new value that requires twice the number of bits to express. After this computation the value is then scaled back into the range of a pixel by dividing by 255.
In GCC 13 such sequences would generate:
.L3: ldr q1, [x0] umull v2.8h, v1.8b, v5.8b umull2 v1.8h, v1.16b, v5.16b umull v0.4s, v2.4h, v3.4h umull2 v2.4s, v2.8h, v3.8h umull v4.4s, v1.4h, v3.4h umull2 v1.4s, v1.8h, v3.8h uzp2 v0.8h, v0.8h, v2.8h uzp2 v1.8h, v4.8h, v1.8h shrn v0.8b, v0.8h, 7 shrn2 v0.16b, v1.8h, 7 str q0, [x0], 16 cmp x1, x0 bne .L3
This sequence benefits from the work in GCC-12.1 to recognize truncations as permutations (read more about that here) but it’s still inefficient. The biggest issue is that multiplications are quite expensive and this sequence uses 6 of them.
This is one case where SVE does better than Advanced SIMD:
.L3: ld1b z0.h, p1/z, [x0, x3] mul z0.h, p0/m, z0.h, z1.h umulh z0.h, p0/m, z0.h, z2.h lsr z0.h, z0.h, #7 st1b z0.h, p1, [x0, x3] add x3, x3, x5 whilelo p1.h, w3, w2 b.any .L3
This sequence takes advantage of SVE’s widening loads and narrowing stores to generate a much better sequence. However, this sequence is still far from optimal.
In GCC-13 such sequences now generate:
.L3: ldr q29, [x0] umull v0.8h, v29.8b, v31.8b umull2 v29.8h, v29.16b, v31.16b addhn v27.8b, v0.8h, v30.8h addhn v28.8b, v29.8h, v30.8h uaddw v27.8h, v0.8h, v27.8b uaddw v28.8h, v29.8h, v28.8b uzp2 v28.16b, v27.16b, v28.16b str q28, [x0], 16 cmp x1, x0 bne .L3
For Advanced SIMD and for SVE2
.L3: ld1b z29.h, p7/z, [x0, x3] mul z29.h, z29.h, z31.h addhnb z28.b, z29.h, z30.h addhnb z28.b, z28.h, z29.h st1b z28.h, p7, [x0, x3] add x3, x3, x4 whilelo p7.h, w3, w2 b.any .L3
In both cases we now have a sequence that gets rid of the multiply and the right shift bottlenecks. But how does it work?
The vectorizer uses the range-based framework to detect whenever we are dividing a value of double the precision by a power-of-2 bitmask.
The general transformation is recognizing:
x = y / (2 ^ (sizeof (y)/2)-1)
and rewriting this into, for example 8-bit types:
(x + ((x + 257) >> 8)) >> 8
Normally this would require that the additions be done in double the precision of x such that we don’t lose any bits in an overflow. However, if we can tell that x + 257 won’t overflow then we can perform the calculation in the same precision.
Essentially the sequence decomposes the division into doing two smaller divisions, one for the top and bottom parts of the number and adding the results back together.
To account for the fact that shift by 8 would be division by 256, 1 is added to both parts of x such that when x is 255 we still get 1 as the answer.
Because the amount we shift are half the original datatype we can use the halving instructions the ISA provides to do the operation instead of using actual shifts.
In C and C++, the storage size of a value inside of a struct can be specified. Often you only need one or a few bits for a value and doing so allows you to reduce the overall storage size required for each struct entry.
Before GCC-13 we didn’t support vectorizing such structures:
struct s { int i : 24; }; int f (struct s *ptr, unsigned n) { int res = 0; for (int i = 0; i < n; ++i) res += ptr[i].i; return res; }
Here every value of s.i only requires 24 bits. Vectorizing this with Advanced SIMD is tricky because this requires us to eventually mask the values loaded to extract the bit.
Advanced SIMD generates in GCC-13:
.L4: ldr q31, [x2], 16 shl v31.4s, v31.4s, 8 ssra v30.4s, v31.4s, 8 cmp x2, x3 bne .L4
And for SVE2:
.L3: ld1w z29.s, p7/z, [x0, x2, lsl 2] add x2, x2, x3 lsl z29.s, z29.s, #8 asr z29.s, z29.s, #8 add z30.s, p7/m, z30.s, z29.s whilelo p7.s, w2, w1 b.any .L3
One of the nice things about the AArch64 register file is how registers overlap. This means we can creatively use this overlap to optimize sequences.
One such example is add/sub:
void f (float *restrict a, float *restrict b, float *res, int n) { for (int i = 0; i < (n & -4); i+=2) { res[i+0] = a[i+0] + b[i+0]; res[i+1] = a[i+1] - b[i+1]; } }
This example is common in scientific computations where it occurs when dealing with complex numbers. The sequence alternates selecting the subtraction and the addition values.
In GCC-12, we would vectorize such sequences as:
.L3: ldr q1, [x0, x3] ldr q2, [x1, x3] fadd v0.4s, v1.4s, v2.4s fsub v1.4s, v1.4s, v2.4s tbl v0.16b, {v0.16b - v1.16b}, v3.16b str q0, [x2, x3] add x3, x3, 16 cmp x3, x4 bne .L3
Here we do both the subtract and the addition and then select the appropriate value out with the TBL. But because in AArch64 the 32-bit floating and the 64-bit floating point register files overlap we can perform this using an fneg on double the register size of the add. Remember that unlike integer negate, floating point negate is just flipping of the sign bit.
As such in GCC-13 we vectorize such sequences using some creative instructions:
.L3: ldr q31, [x1, x3] ldr q30, [x0, x3] fneg v31.2d, v31.2d fadd v30.4s, v30.4s, v31.4s str q30, [x2, x3] add x3, x3, 16 cmp x3, x4 bne .L3
When initializing complex numbers (Or Complex integers using GCC’s _Complex extension) a lot of the code was written using x86-64’s representation of the internal syntax in mind. As such on Arm architecture we’ve suffered from not having an efficient way to initialize such values.
As an example the following:
_Complex int f(int a, int b) { _Complex int t = {a, b}; return t; }
Which tried to create a complex int from two values a and b generate on GCC-12 and earlier:
f(int, int): bfi x2, x0, 0, 32 bfi x2, x1, 32, 32 mov x0, x2 ret
This is quite slow and is symptomatic of the internal representation of the compiler. There’s not enough context for architectures other than x86-64 to correctly initialize such sequences. In GCC-13 we’ve now fixed this by generating a representation that will allow architectures like AArch64 to optimize such full structure writes while not degrading the code generated by other architectures. In GCC-13 we generate:
f(int, int): bfi x0, x1, 32, 32 ret
Resulting in a sequence that is 1/3rd the size of the one in GCC-12 and 66% faster.
Optimizing compilers typically support recognizing sequences such as min/max from various inputs. One case where GCC was lacking is where we have nested min/max sequences.
As an example:
#include <stdint.h> uint32_t three_min (uint32_t xc, uint32_t xm, uint32_t xy) { uint32_t xk; if (xc < xm) { xk = (uint32_t) (xc < xy ? xc : xy); } else { xk = (uint32_t) (xm < xy ? xm : xy); } return xk; }
This means that such sequences in GCC-12 generate:
three_min(unsigned int, unsigned int, unsigned int): cmp w0, w1 bcs .L2 cmp w0, w2 csel w0, w0, w2, ls ret .L2: cmp w1, w2 csel w0, w1, w2, ls ret
because we only recognized one level of nesting. From GCC-13 onwards we can recognize multiple levels and such sequences now generate:
three_min(unsigned int, unsigned int, unsigned int): cmp w1, w2 csel w1, w1, w2, ls cmp w1, w0 csel w0, w1, w0, ls ret
While the scalar code may not look that different, the vector code does since AArch64 supports vector MIN/MAX instructions.
GCC-13 also supports codegen for the CSSC (Common Short Sequence Compression) extensions new to the Arm architecture. This extension adds general purpose register (GPR) variants of sequences for things we already had vector variants for.
One such example is min/max. The code above compiled with the +cssc extension in GCC-13 generates:
three_min(unsigned int, unsigned int, unsigned int): umin w1, w1, w2 umin w0, w1, w0 ret
AArch64 has a limited immediate range, therefor instructions with longer immediates requiring multiple instructions to generate. However, we have several instructions to generate bitmasks (i.e. values with a fixed repeating pattern).
Better immediates can often be generated by finding the nearest repeating bit pattern and calculating the offset to this pattern.
As an example, in GCC-12 we generate for this sequence;
unsigned long long foo (void) { return 0x7efefefefefefeffULL; }
The following 4 instructions:
foo: mov x0, 65279 movk x0, 0xfefe, lsl 16 movk x0, 0xfefe, lsl 32 movk x0, 0x7efe, lsl 48 ret
thus, incrementally building up the constants. By using the trick mentioned above in GCC-13 we generate:
foo: mov x0, -72340172838076674 eor x0, x0, -9223372036854775807 ret
This is perhaps better visualized as hexadecimal:
foo: mov x0, 0xFEFEFEFEFEFEFEFE eor x0, x0, 0x8000000000000001 ret
The nearest repeating constant is 0xFE repeated, and from the result we take the exclusive-OR of the top and bottom bit, which is also a suitable immediate for the EOR instruction.
New to libatomic is support for LSE2 and RCPC through ifuncs. This means that binaries using this new libatomic will be automatically accelerated using the new sequences without any change needed by the user. It also preserves support for systems without these extensions.
One addition in these extensions is that 128-bit loads and stores no longer need a barrier for RELAXED atomics.
GCC-13 adds supports for the following new Arm cores (-mcpu/-march/-mtune values are in the brackets):
This release also adds support for Armv9.1-A, Armv9.2- and Armv9.3-A.
This release also adds support for Decimal Floating Point in the Binary Integer Decimal format (BID):
struct z { _Decimal64 x[4]; }; _Decimal64 d1 = 25.0dd; struct z a = { 5.0dd, 6.0dd, 7.0dd, 8.0dd }; struct z b = { 9.0dd, 10.0dd, 11.0dd, 12.0dd }; _Decimal64 f (_Decimal64 x1, _Decimal64 x2) { return x1 + x2; }
Decimal Floating Point is a format that used decimal (base-10) fractions for calculations instead of binary (base-2). This aims to reduce the rounding errors between the base-2 and base-10 numbers which occur during normal floating point representations.
AArch64 now supports this through the BID format and library. The function above generates:
f: stp x29, x30, [sp, -16]! mov x29, sp bl __bid_adddd3 ldp x29, x30, [sp], 16 ret
GCC 13 adds support for the Armv8.1-M Pointer Authentication and Branch Target Identification Extension.
This extension, like the A-profile variant, offers some security hardening against Return-Oriented Programming (ROP) and Jump-Oriented Programming (JOP) security exploit attacks. These attacks allow an attacker to utilize existing and legitimate pieces of code in an existing program to exploit a system.
To see this in action you can use the same -mbranch-protection= option as on A-profile. For example, the following fragment:
void foo (void); int main() { foo (); return 0; }
Normally generates with -march=armv8.1-m.main:
main: push {r3, lr} bl foo() movs r0, #0 pop {r3, pc}
But with -march=armv8.1-m.main -mbranch-protection=pac-ret
main: pac ip, lr, sp push {ip, lr} bl foo() pop {ip, lr} movs r0, #0 aut ip, lr, sp bx lr
This shows an example of the mitigation in working. Notice the insertion of the pac and aut instructions.
Read more about M-profile PAC
GCC 13 has a bunch of new features for Arm architectures, but it also sets us up for more aggressive future work to further take advantage of the features available in Advanced SIMD and SVE.
In the meantime, check out previous year's entry for GCC 12.
GCC 12 performance features and improvements