New features in GCC 13

May 7, 2024

11 minute read time.

The GCC 13 compiler is now out and contains a few goodies for everyone. Arm engineers and the GCC community have been hard at work to produce this release and this blog covers some of the things to expect in GCC 13. Some of the optimizations here are generic and some only apply to Arm architectures.

Test-and-branch

On AArch64, the ABI specifies that when passing a value of a type that’s smaller than the register it’s being passed in then the top bits are undefined. What this means is that passing for example, a char as an argument to a function we cannot assume the top bits of the register are zero, only that the bottom 8 bits are set.

This is also true for passing smaller types like Booleans, which are extended to a byte in the AArch64 ABI. In C and C++ Booleans are defined as 0 or 1, so only the bottom bit matters and the next 7 bits are required to be 0.

As an example, this simple function:

#include <stdbool.h>

extern int bar();

int foo (bool a)
{
  if (a)
    return bar ();

  return 5;
}

Generates in GCC-12:

foo:
        tst     w0, 255
        bne     .L4
        mov     w0, 5
        ret
.L4:
        b       bar

Where we must test the explicit 8 bits and then branch on the results. However, Arm architectures have a handy instruction to be able to test only a single bit. In GCC 13, we now generate:

foo:
        tbnz    w0, 0, .L4
        mov     w0, 5
        ret
.L4:
        b       bar

Which combines the tst + b into a tb[n]z. Modern code with many Booleans see a significant performance improvement and a codesize reduction after this change. This improvement is not restricted to Booleans however as it hooks into GCC’s extensive range calculation framework and so will work any time the compiler can see that only a single bit is relevant for the result.

Range based div/255

Using the above-mentioned range based framework we’ve started doing more optimization in the vectorizer as well. One such example is a commonly used sequence during image processing:

#include <stdint.h>

void f(uint8_t* restrict pixel, uint8_t level, int n)
{
  for (int i = 0; i < (n & -16); i+=1)
    pixel[i] = (pixel[i] * level) / 0xff;
}

In this example, a pixel value is multiplied by a value which results in a new value that requires twice the number of bits to express. After this computation the value is then scaled back into the range of a pixel by dividing by 255.

In GCC 13 such sequences would generate:

.L3:
        ldr     q1, [x0]
        umull   v2.8h, v1.8b, v5.8b
        umull2  v1.8h, v1.16b, v5.16b
        umull   v0.4s, v2.4h, v3.4h
        umull2  v2.4s, v2.8h, v3.8h
        umull   v4.4s, v1.4h, v3.4h
        umull2  v1.4s, v1.8h, v3.8h
        uzp2    v0.8h, v0.8h, v2.8h
        uzp2    v1.8h, v4.8h, v1.8h
        shrn    v0.8b, v0.8h, 7
        shrn2   v0.16b, v1.8h, 7
        str     q0, [x0], 16
        cmp     x1, x0
        bne     .L3

This sequence benefits from the work in GCC-12.1 to recognize truncations as permutations (read more about that here) but it’s still inefficient. The biggest issue is that multiplications are quite expensive and this sequence uses 6 of them.

This is one case where SVE does better than Advanced SIMD:

.L3:
        ld1b    z0.h, p1/z, [x0, x3]
        mul     z0.h, p0/m, z0.h, z1.h
        umulh   z0.h, p0/m, z0.h, z2.h
        lsr     z0.h, z0.h, #7
        st1b    z0.h, p1, [x0, x3]
        add     x3, x3, x5
        whilelo p1.h, w3, w2
        b.any   .L3

This sequence takes advantage of SVE’s widening loads and narrowing stores to generate a much better sequence. However, this sequence is still far from optimal.

In GCC-13 such sequences now generate:

.L3:
        ldr     q29, [x0]
        umull   v0.8h, v29.8b, v31.8b
        umull2  v29.8h, v29.16b, v31.16b
        addhn   v27.8b, v0.8h, v30.8h
        addhn   v28.8b, v29.8h, v30.8h
        uaddw   v27.8h, v0.8h, v27.8b
        uaddw   v28.8h, v29.8h, v28.8b
        uzp2    v28.16b, v27.16b, v28.16b
        str     q28, [x0], 16
        cmp     x1, x0
        bne     .L3

For Advanced SIMD and for SVE2

.L3:
        ld1b    z29.h, p7/z, [x0, x3]
        mul     z29.h, z29.h, z31.h
        addhnb  z28.b, z29.h, z30.h
        addhnb  z28.b, z28.h, z29.h
        st1b    z28.h, p7, [x0, x3]
        add     x3, x3, x4
        whilelo p7.h, w3, w2
        b.any   .L3

In both cases we now have a sequence that gets rid of the multiply and the right shift bottlenecks. But how does it work?

The vectorizer uses the range-based framework to detect whenever we are dividing a value of double the precision by a power-of-2 bitmask.

The general transformation is recognizing:

x = y / (2 ^ (sizeof (y)/2)-1)

and rewriting this into, for example 8-bit types:

(x + ((x + 257) >> 8)) >> 8

Normally this would require that the additions be done in double the precision of x such that we don’t lose any bits in an overflow. However, if we can tell that x + 257 won’t overflow then we can perform the calculation in the same precision.

Essentially the sequence decomposes the division into doing two smaller divisions, one for the top and bottom parts of the number and adding the results back together.

To account for the fact that shift by 8 would be division by 256, 1 is added to both parts of x such that when x is 255 we still get 1 as the answer.

Because the amount we shift are half the original datatype we can use the halving instructions the ISA provides to do the operation instead of using actual shifts.

Bitfield vectorization

In C and C++, the storage size of a value inside of a struct can be specified. Often you only need one or a few bits for a value and doing so allows you to reduce the overall storage size required for each struct entry.

Before GCC-13 we didn’t support vectorizing such structures:

struct s { int i : 24; };

int f (struct s *ptr, unsigned n) {
    int res = 0;
    for (int i = 0; i < n; ++i)
      res += ptr[i].i;
    return res;
}

Here every value of s.i only requires 24 bits. Vectorizing this with Advanced SIMD is tricky because this requires us to eventually mask the values loaded to extract the bit.

Advanced SIMD generates in GCC-13:

.L4:
        ldr     q31, [x2], 16
        shl     v31.4s, v31.4s, 8
        ssra    v30.4s, v31.4s, 8
        cmp     x2, x3
        bne     .L4

And for SVE2:

.L3:
        ld1w    z29.s, p7/z, [x0, x2, lsl 2]
        add     x2, x2, x3
        lsl     z29.s, z29.s, #8
        asr     z29.s, z29.s, #8
        add     z30.s, p7/m, z30.s, z29.s
        whilelo p7.s, w2, w1
        b.any   .L3

Add/sub optimization

One of the nice things about the AArch64 register file is how registers overlap. This means we can creatively use this overlap to optimize sequences.

One such example is add/sub:

void f (float *restrict a, float *restrict b, float *res, int n)
{
   for (int i = 0; i < (n & -4); i+=2)
    {
      res[i+0] = a[i+0] + b[i+0];
      res[i+1] = a[i+1] - b[i+1];
    }
}

This example is common in scientific computations where it occurs when dealing with complex numbers. The sequence alternates selecting the subtraction and the addition values.

In GCC-12, we would vectorize such sequences as:

.L3:
        ldr     q1, [x0, x3]
        ldr     q2, [x1, x3]
        fadd    v0.4s, v1.4s, v2.4s
        fsub    v1.4s, v1.4s, v2.4s
        tbl     v0.16b, {v0.16b - v1.16b}, v3.16b
        str     q0, [x2, x3]
        add     x3, x3, 16
        cmp     x3, x4
        bne     .L3

Here we do both the subtract and the addition and then select the appropriate value out with the TBL. But because in AArch64 the 32-bit floating and the 64-bit floating point register files overlap we can perform this using an fneg on double the register size of the add. Remember that unlike integer negate, floating point negate is just flipping of the sign bit.

As such in GCC-13 we vectorize such sequences using some creative instructions:

.L3:
        ldr     q31, [x1, x3]
        ldr     q30, [x0, x3]
        fneg    v31.2d, v31.2d
        fadd    v30.4s, v30.4s, v31.4s
        str     q30, [x2, x3]
        add     x3, x3, 16
        cmp     x3, x4
        bne     .L3

Initialization of complex numbers

When initializing complex numbers (Or Complex integers using GCC’s _Complex extension) a lot of the code was written using x86-64’s representation of the internal syntax in mind. As such on Arm architecture we’ve suffered from not having an efficient way to initialize such values.

As an example the following:

_Complex int f(int a, int b) {
    _Complex int t = {a, b};
    return t;
}

Which tried to create a complex int from two values a and b generate on GCC-12 and earlier:

f(int, int):
        bfi     x2, x0, 0, 32
        bfi     x2, x1, 32, 32
        mov     x0, x2
        ret

This is quite slow and is symptomatic of the internal representation of the compiler. There’s not enough context for architectures other than x86-64 to correctly initialize such sequences. In GCC-13 we’ve now fixed this by generating a representation that will allow architectures like AArch64 to optimize such full structure writes while not degrading the code generated by other architectures. In GCC-13 we generate:

f(int, int):
        bfi     x0, x1, 32, 32
        ret

Resulting in a sequence that is 1/3rd the size of the one in GCC-12 and 66% faster.

Min/max chaining

Optimizing compilers typically support recognizing sequences such as min/max from various inputs. One case where GCC was lacking is where we have nested min/max sequences.

As an example:

#include <stdint.h>

uint32_t three_min (uint32_t xc, uint32_t xm, uint32_t xy) {
    uint32_t     xk;
    if (xc < xm) {
        xk = (uint32_t) (xc < xy ? xc : xy);
    } else {
        xk = (uint32_t) (xm < xy ? xm : xy);
    }
    return xk;
}

We recognize the ternary operation (the ? ) as MIN expressions but we don’t recognize that the outer sequence is also doing a MIN on the values.

This means that such sequences in GCC-12 generate:

three_min(unsigned int, unsigned int, unsigned int):
        cmp     w0, w1
        bcs     .L2
        cmp     w0, w2
        csel    w0, w0, w2, ls
        ret
.L2:
        cmp     w1, w2
        csel    w0, w1, w2, ls
        ret

because we only recognized one level of nesting. From GCC-13 onwards we can recognize multiple levels and such sequences now generate:

three_min(unsigned int, unsigned int, unsigned int):
        cmp     w1, w2
        csel    w1, w1, w2, ls
        cmp     w1, w0
        csel    w0, w1, w0, ls
        ret

While the scalar code may not look that different, the vector code does since AArch64 supports vector MIN/MAX instructions.

CSSC support

GCC-13 also supports codegen for the CSSC (Common Short Sequence Compression) extensions new to the Arm architecture. This extension adds general purpose register (GPR) variants of sequences for things we already had vector variants for.

One such example is min/max. The code above compiled with the +cssc extension in GCC-13 generates:

three_min(unsigned int, unsigned int, unsigned int):
        umin    w1, w1, w2
        umin    w0, w1, w0
        ret

Immediate constant generation

AArch64 has a limited immediate range, therefor instructions with longer immediates requiring multiple instructions to generate. However, we have several instructions to generate bitmasks (i.e. values with a fixed repeating pattern).

Better immediates can often be generated by finding the nearest repeating bit pattern and calculating the offset to this pattern.

As an example, in GCC-12 we generate for this sequence;

unsigned long long
foo (void)
{
  return 0x7efefefefefefeffULL;
}

The following 4 instructions:

foo:
        mov     x0, 65279
        movk    x0, 0xfefe, lsl 16
        movk    x0, 0xfefe, lsl 32
        movk    x0, 0x7efe, lsl 48
        ret

thus, incrementally building up the constants. By using the trick mentioned above in GCC-13 we generate:

foo:
        mov     x0, -72340172838076674
        eor     x0, x0, -9223372036854775807
        ret

This is perhaps better visualized as hexadecimal:

foo:
        mov     x0, 0xFEFEFEFEFEFEFEFE
        eor     x0, x0, 0x8000000000000001
        ret

The nearest repeating constant is 0xFE repeated, and from the result we take the exclusive-OR of the top and bottom bit, which is also a suitable immediate for the EOR instruction.

Libatomics improvements

New to libatomic is support for LSE2 and RCPC through ifuncs. This means that binaries using this new libatomic will be automatically accelerated using the new sequences without any change needed by the user. It also preserves support for systems without these extensions.

One addition in these extensions is that 128-bit loads and stores no longer need a barrier for RELAXED atomics.

New cores

GCC-13 adds supports for the following new Arm cores (-mcpu/-march/-mtune values are in the brackets):

AArch64

Cortex-A715 (cortex-a715)
Cortex-X1C (cortex-x1c)
Cortex-X3 (cortex-x3)
Neoverse V2 (neoverse-v2)

This release also adds support for Armv9.1-A, Armv9.2- and Armv9.3-A.

Arm

Cortex-X1C (cortex-x1c)
Cortex-M85 (cortex-m85)

DFP

This release also adds support for Decimal Floating Point in the Binary Integer Decimal format (BID):

struct z
{
  _Decimal64 x[4];
};

_Decimal64 d1 = 25.0dd;
struct z a = { 5.0dd, 6.0dd, 7.0dd, 8.0dd };
struct z b = { 9.0dd, 10.0dd, 11.0dd, 12.0dd };

_Decimal64 f (_Decimal64 x1, _Decimal64 x2)
{
  return x1 + x2;
}

Decimal Floating Point is a format that used decimal (base-10) fractions for calculations instead of binary (base-2). This aims to reduce the rounding errors between the base-2 and base-10 numbers which occur during normal floating point representations.

AArch64 now supports this through the BID format and library. The function above generates:

f:
        stp     x29, x30, [sp, -16]!
        mov     x29, sp
        bl      __bid_adddd3
        ldp     x29, x30, [sp], 16
        ret

PACBTI for M-profile

GCC 13 adds support for the Armv8.1-M Pointer Authentication and Branch Target Identification Extension.

This extension, like the A-profile variant, offers some security hardening against Return-Oriented Programming (ROP) and Jump-Oriented Programming (JOP) security exploit attacks. These attacks allow an attacker to utilize existing and legitimate pieces of code in an existing program to exploit a system.

To see this in action you can use the same -mbranch-protection= option as on A-profile. For example, the following fragment:

void foo (void);
int main()
{
  foo ();
  return 0;
}

Normally generates with -march=armv8.1-m.main:

main:
        push    {r3, lr}
        bl      foo()
        movs    r0, #0
        pop     {r3, pc}

But with -march=armv8.1-m.main -mbranch-protection=pac-ret

main:
        pac     ip, lr, sp
        push    {ip, lr}
        bl      foo()
        pop     {ip, lr}
        movs    r0, #0
        aut     ip, lr, sp
        bx      lr

This shows an example of the mitigation in working. Notice the insertion of the pac and aut instructions.

Summary

GCC 13 has a bunch of new features for Arm architectures, but it also sets us up for more aggressive future work to further take advantage of the features available in Advanced SIMD and SVE.

In the meantime, check out previous year's entry for GCC 12.

GCC 12 performance features and improvements

Tools, Software and IDEs blog

Taking Windows on Arm to the North Africa Developer Community

Peter Ing

Arm Ambassadors bring the benefits of Window on Arm to developers in North Africa.
- July 4, 2024
Part 1: Porting to Arm Intrinsics with SIMDe

Khalid Saadi

This blog post presents a case study using SIMD Everywhere (SIMDe) to automatically port software using x86 SSE and AVX SIMD intrinsics to Arm Neon.
- June 26, 2024
Product update: Arm Development Studio 2024.0 now available

Ronan Synnott

Arm Development Studio 2024.0 is now available.
- May 15, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog