Performance improvements in GCC 11

May 6, 2021

9 minute read time.

Welcome back. This year, as with those before it, the new GCC release comes with a slew of performance improvements. As always Arm has worked together with the GCC community to cook up some nice things. Some general, some Arm specific. This year we have focused on enabling the use of more Arm instructions in the vectorizer along with improving our codegen for intrinsics and heavily optimized applications can use them with confidence. Read on for all the goodies in this year's GCC 11 bag.

Improved inlined memcpy and memset

Optimizing compilers like GCC try to avoid emitting library calls whenever it is faster to handle the behavior of the call inline. These built-in functions, as GCC calls them, include some well-known ISO C90 functions including memset and memcpy.
When the call to these two functions is made using a constant size the compiler may choose to instead emit the equivalent instructions directly in the assembly. This is done with a trade-off between code-size and performance.

As an example, the following function:

void fc (char *restrict a, char *restrict b)
{
    memcpy (a, b, 48);
}

Is inlined by the compiler and instead of a call to memcpy GCC 10 generates:

fc:
        ldp     x2, x3, [x1]
        stp     x2, x3, [x0]
        ldp     x2, x3, [x1, 16]
        stp     x2, x3, [x0, 16]
        ldp     x2, x3, [x1, 32]
        stp     x2, x3, [x0, 32]
        ret

Which is performing the memcpy directly using loads and stores. However, this is not the most optimal form as it uses integer registers which have a maximum size of 64-bits to do the copy. If we were to use SIMD registers, we could instead copy 128-bits using a single instruction or 256-bits using a single pair instruction.

GCC 11 now favors using SIMD registers for doing inlined memset and memcpy whenever the amount to be copied or set is smaller than 256 bytes (unless optimizing for code size, in which case the limit is set to 128 bytes).

For the example GCC 11 now generates:

fc:
        ldp     q1, q2, [x1]
        ldr     q0, [x1, 32]
        stp     q1, q2, [x0]
        str     q0, [x0, 32]
        ret

It is worth noting that even though you may not have explicitly typed memcpy or memset the compiler may have recognized the sequence during optimizations. As an example, at -O2 the compiler will automatically idiom recognize a loop such as:

void f (char *restrict a, char *restrict b)
{
    for (int x = 0; x < 48; x++)
      a[x] = b[x];
}

As a memcpy between a and b.

Using conditional selects to perform conditional stores

AArch64 does not have conditional stores as part of the ISA, however we can make a conditional store by using a conditional select (csel) and then using an unconditional store.

That would allow us to remove more branches in the output.

Consider the example:

unsigned test(unsigned k, unsigned b) {
        unsigned a[2];
        // Initialize a as needed
        if (b < a[k]) {
                a[k] = b;
        }
        return a[0]+a[2];
}

Where the store of a[k] depends on whether you enter the conditional or not. The array a is intentionally left uninitialized in this example to keep the output easier to follow.

As such GCC 10 generated:

test:
        sub     sp, sp, #16
        uxtw    x0, w0
        add     x2, sp, 8
        ldr     w3, [x2, x0, lsl 2]
        cmp     w3, w1
        bls     .L6
        str     w1, [x2, x0, lsl 2]
.L6:
        ldr     w1, [sp, 8]
        ldr     w0, [sp, 16]
        add     sp, sp, 16
        add     w0, w1, w0
        ret

Where it tests w1 and w3 to determine whether it needs to perform the store.

However, notice that the value being stored a[k] is governed by a load that we know must also be writable. That is, the array a is stored on the stack and the stack cannot be read-only. Because of these two conditions, we know it is safe to unconditionally store to a[k] and so the branch is converted into a conditional select of the original value of a[k] and the new value b:

test:
        sub     sp, sp, #16
        uxtw    x0, w0
        add     x3, sp, 8
        ldr     w2, [x3, x0, lsl 2]
        cmp     w2, w1
        csel    w2, w2, w1, ls
        str     w2, [x3, x0, lsl 2]
        ldr     w1, [sp, 8]
        ldr     w0, [sp, 16]
        add     sp, sp, 16
        add     w0, w1, w0
        ret

Better support for widening instructions

AArch64 has various SIMD instructions that perform an arithmetic operation and a widening and shortening in one. By using these instructions, we can reduce the widening operations emitted during vectorization. Some were implemented before (such as widening multiply) but plenty more such as widening shifts were not.

Consider the following example:

void wide1 (char *restrict a, short *restrict b, int n)
{
    for (int x = 0; x < 16; x++)
      b[x] = a[x] << 8;
}

GCC 10 generated:

wide1:
        ldr     q0, [x0]
        uxtl    v1.8h, v0.8b
        uxtl2   v0.8h, v0.16b
        shl     v1.8h, v1.8h, 8
        shl     v0.8h, v0.8h, 8
        stp     q1, q0, [x1]
        ret

Where we emit explicit zero extensions. This has a secondary effect in that it also increases the costs of vectorization. When weighing the cost of the vector vs scalar versions of the function if, there are enough of these superfluous extensions the vectorizer decides it is too expensive to vectorize. Adding support for these instructions results in more and better vectorization.

In GCC 11 we now generate:

wide1:
        ldr     q0, [x0]
        shll    v1.8h, v0.8b, 8
        shll2   v0.8h, v0.16b, 8
        stp     q1, q0, [x1]
        ret

Improved AArch64 intrinsics code-gen

Many of the intrinsics in the arm_neon.h header were defined using simple inline assembly. The problem with inline assembly is that it is completely opaque to the compiler.
The compiler has no idea about the operational semantics of the instructions and does not know what type of instruction it is and so cannot do any sort of instruction scheduling.

As an example, if vadd_f32 and vmul_f32 were defined as inline assembly, the compiler would never be able to transform them into vfma_f32. The compiler would also always emit them grouped just as specified in the source.

Lastly, when the instruction is accumulating this often causes the register allocator to generate unneeded register copies.

This caused AArch64 intrinsics code generated by GCC to be larger and perform worse than other AArch64 compilers.
Consider the simple example:

int32x4_t intrinsics(int32x4_t acc, int16x4_t b, int16x4_t c) {
  return vmlal_n_s16(acc, b, c[3]);
}

Where GCC 10 would generate:

intrinsics:
        dup     h2, v2.h[3]
        smlal v0.4s,v1.4h,v2.h[0]
        ret

But with GCC 11 happily generates:

intrinsics:
        smlal   v0.4s, v1.4h, v2.h[3]
        ret

Many intrinsics have been converted and semantically described in GCC 11 with the rest to be done in GCC 12.
An example of the impact of this work can be seen in the popular libjpeg-turbo library:

libjpeg-turbo benchmarks neoverse-n1

To get started with optimizing your applications using intrinsics do check out the resources Arm makes available for this:

View Neon app developers resources

Visit the Neon intrinsics search engine

Auto-vectorization of complex numbers arithmetic

Instructions performing complex arithmetic exist in Armv8.3-a (AArch32 and AArch64), Armv8-R AArch64, SVE, SVE2, and MVE, covering the spread of Arm architecture profiles.
As a basic example, consider the complex addition where the second operand is rotated by 90* around the argand plane:

#include <complex.h>

void cadd (complex float *restrict a, complex float *restrict b, complex float *restrict c, int n)
{
    for (int x = 0; x < n; x++)
      c[x] = a[x] + (b[x] * I);
}

Which for Neon on AArch64 generates:

cadd:
        … loop prologue …
.L14:
        ld2     {v2.4s - v3.4s}, [x4], 32
        ld2     {v4.4s - v5.4s}, [x6], 32
        fsub    v0.4s, v2.4s, v5.4s
        fadd    v1.4s, v4.4s, v3.4s
        st2     {v0.4s - v1.4s}, [x5], 32
        cmp     x7, x4
        bne     .L14
        and     w6, w3, -4
        tst     x3, 3
        beq     .L11
.L13:
        … scalar loop end …
.L11:
        ret
.L16:
        mov     w6, 0
        b       .L13

Essentially to vectorize the compiler must use gathers to de-interleave the imaginary and real components of the number, perform the desired computation, and then interleave the values again to store them in c.
This loop also has a higher minimum element count to hit the vector code. It requires at least 8 complex floats for the vectorized loop to be entered.
For GCC 11 with -Ofast -march=armv8.3-a the compiler now generates:

cadd:
        … loop prologue …
.L13:
        ldr     q0, [x0, x4]
        ldr     q1, [x1, x4]
        fcadd   v0.4s, v0.4s, v1.4s, #90
        str     q0, [x2, x4]
        add     x4, x4, 16
        cmp     x4, x5
        bne     .L13
        and     w4, w3, -2
        tbz     x3, 0, .L10
.L12:
        uxtw    x3, w4
        ldr     d0, [x0, x3, lsl 3]
        ldr     d1, [x1, x3, lsl 3]
        fcadd   v0.2s, v0.2s, v1.2s, #90
        str     d0, [x2, x3, lsl 3]
.L10:
        ret
.L15:
        mov     w4, 0
        b       .L12

This loop has a couple of advantages:

It requires only a single complex number to enter the vectorized loop.
It does not generate a scalar loop epilogue. Complex numbers are always a pair of numbers and only a single pair is required to enter the vector code. Because of this there cannot ever be an uneven number of elements in the array. As such a scalar epilogue is not needed.

GCC 11 recognizes the following sequences:

Addition rotated by 90*
Addition rotated by 270*
Multiply and Subtract
Conjugate, Multiple and Subtract
Multiply and Add
Conjugate, Multiply and Add
Multiply
Conjugate Multiply

The implementation in the compiler does not rely on the elements being complex numbers. The matching is performed based on the shape of the computation and not the complex type.
As an example, the following two examples are both recognized as complex additions:

void cadd_pat (float *restrict a, float *restrict b, float *restrict c, int n)
{
  for (int i=0; i < n; i+=2)
    {
      c[i] = a[i] - b[i+1];
      c[i+1] = a[i+1] + b[i];
    }
}

void cadd_pat_unrolled (float *restrict a, float *restrict b, float *restrict c, int n)
{
  for (int i=0; i < n; i+=4)
    {
      c[i] = a[i] - b[i+1];
      c[i+1] = a[i+1] + b[i];
      c[i+2] = a[i+2] - b[i+3];
      c[i+3] = a[i+3] + b[i+2];
    }
}

Because of this, the implementation is not restricted to just floating point but can also handle integer versions of these sequences. As an example:

void cadd_int (int *restrict a, int *restrict b, int *restrict c, int n)
{
  for (int i=0; i < n; i+=2)
    {
      c[i] = a[i] - b[i+1];
      c[i+1] = a[i+1] + b[i];
    }
}

Generates with -O3 -march=armv8.2-a+sve2:

cadd_int:
        cmp     w3, 0
        ble     .L28
        sub     w4, w3, #1
        mov     x3, 0
        lsr     w4, w4, 1
        add     w4, w4, 1
        lsl     x4, x4, 1
        whilelo p0.s, xzr, x4
.L30:
        ld1w    z0.s, p0/z, [x0, x3, lsl 2]
        ld1w    z1.s, p0/z, [x1, x3, lsl 2]
        cadd    z0.s, z0.s, z1.s, #90
        st1w    z0.s, p0, [x2, x3, lsl 2]
        incw    x3
        whilelo p0.s, x3, x4
        b.any   .L30
.L28:
        ret

While the floating-point cases are more common, these integer versions are used quite often in video and image processing. One particularly common use-case is motion estimation for videos when implementing sum of absolute transformed differences (SATD).

More to come

This year’s GCC has continued the focus on real world user performance issues and optimizing for our recently announced CPUs.

But this is not the end, in GCC 12 expect more intrinsics improvements along with more SVE optimizations.
Speaking of SVE, stay tuned for a blog post on the exciting SVE specific changes in GCC 11.

In the meantime, be sure to check out which performance changes were done in GCC 10

GCC 10 better and faster than ever

Tools, Software and IDEs blog

CPython Core Dev Sprint 2025 at Arm Cambridge: The biggest one yet

Diego Russo

For one week, Arm’s Cambridge HQ became the heart of Python development. Contributors globally came together for the CPython Core Developer Sprint.
- October 9, 2025
Python on Arm: 2025 Update

Diego Russo

Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
- August 21, 2025
Product update: Arm Development Studio 2025.0 now available

Stephen Theobald

Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
- July 18, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog