Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog New performance features and improvements in GCC 12
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • optimization
  • performance
  • GCC
  • NEON
  • Compilers
  • SVE
  • Vectorization
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

New performance features and improvements in GCC 12

Tamar Christina
Tamar Christina
May 10, 2022
24 minute read time.

Welcome to the GCC 12 issue of Arm’s annual performance improvements blog. As always, this year’s changes are a combination of the work that Arm and the community have done in GCC. With GCC 12 we have been focused on laying the groundwork with changes that will allow us to be in a better position for future optimization work. That said we still managed plenty of improvements for both Advanced SIMD and SVE.

This year’s GCC is the fastest GCC as measured on the SPECrate® 2017 integer on the Neoverse platform:

GCC 7 through 12 SPECrate® 2017 Integer on Neoverse platforms estimated relative improvements

     GCC 7-12 SPECrate® 2017 integer on Neoverse platforms estimated relative improvements.

These are all rate=1 (single core) improvements. Let us dive in to see how we got here.

New performance extensions

GCC 12 brings with it compiler support for various architecture features and improves auto-vectorization support for others. Chief among these are the following:

Mixed sign dot-product

Armv8.6-A introduced a new dot-product instruction for when the sign of the operands differ called usdot. This instruction is introduced behind the +i8mm compiler flag.

Starting with GCC 12 the auto-vectorizer can now automatically recognize and use this instruction for SVE, AArch64, and AArch32. For example, the following sequence

#define N 480

unsigned int
f (unsigned int res, signed char *restrict a,
   unsigned char *restrict b)
{
  for (__INTPTR_TYPE__ i = 0; i < N; ++i)
    {
      int av = a[i];
      int bv = b[i];
      signed short mult = av * bv;
      res += mult;
    }
  return res;
}

Used to vectorize as:

f:
        movi    v0.4s, 0
        mov     x3, 0
        .p2align 3,,7
.L2:
        ldr     q2, [x1, x3]
        ldr     q1, [x2, x3]
        add     x3, x3, 16
        sxtl    v4.8h, v2.8b
        sxtl2   v3.8h, v2.16b
        uxtl    v2.8h, v1.8b
        uxtl2   v1.8h, v1.16b
        mul     v2.8h, v2.8h, v4.8h
        mul     v1.8h, v1.8h, v3.8h
        saddw   v0.4s, v0.4s, v2.4h
        saddw2  v0.4s, v0.4s, v2.8h
        saddw   v0.4s, v0.4s, v1.4h
        saddw2  v0.4s, v0.4s, v1.8h
        cmp     x3, 480
        bne     .L2
        addv    s0, v0.4s
        fmov    w1, s0
        add     w0, w0, w1
        ret

Now with GCC-12 we get a much nicer:

f:
        movi    v0.4s, 0
        mov     x3, 0
        .p2align 3,,7
.L2:
        ldr     q1, [x2, x3]
        ldr     q2, [x1, x3]
        usdot   v0.4s, v1.16b, v2.16b
        add     x3, x3, 16
        cmp     x3, 480
        bne     .L2
        addv    s0, v0.4s
        fmov    w1, s0
        add     w0, w0, w1
        ret

Instructions to accelerate memory operations

Armv8.8-A added several new memcpy instructions to the architecture to accelerate these often-used operations. When doing a small memcpy, memset or memmove, compilers typically inline the operation. However if doing a large memcpy the compiler instead emits a function call to the implementation in the C standard library. As an example:

void copy (char * restrict a, char * restrict b, int n)
{
    for (int i = 0; i < n; i++)
      a[i] = b[i];
}

emits at -O2

copy:
        cmp     w2, 0
        ble     .L1
        uxtw    x2, w2
        b       memcpy
.L1:
        ret

Which unless n == 0 makes a call to the memcpy function. Starting with GCC-12 when using the +mops the compiler emits the new memcpy instructions:

copy:
        cmp     w2, 0
        ble     .L1
        uxtw    x2, w2
        cpyfp   [x0]!, [x1]!, x2!
        cpyfm   [x0]!, [x1]!, x2!
        cpyfe   [x0]!, [x1]!, x2!
.L1:
        ret

More Intrinsics optimizations

With GCC 12, we have continued to enhance GCC’s intrinsics support. This year has focused on starting to migrate intrinsics functions from RTL definitions into GIMPLE to allow the front end of the compiler to understand the semantics of the instructions. The goal here is to effectively remove the physical arm_Neon.h file from the compiler source tree entirely and instead synthesis the header entirely in the compiler itself. While this does not have any effect in itself for end-users it has great benefits for maintainability.  It also allow us to tie together concepts that would be difficult to do when having to use the C language to describe them. This GCC version starts by moving the structural types, for example, int32x4x2_t and related out of the header file and into the compiler directly. One of the primary reasons for doing this is to fix register allocation issues when these types were used.

Register allocation fixes

To handle the new types that were moved out of the header, we introduced several new full and partial structure types in the compiler. In versions before GCC 12, we used to use generic “bag of bits” to represent these intrinsics. As an example the vst2q_s32 intrinsics used a type that says the intrinsics generates a 256-bit “bag-of-bits”. We would then generate “inserts” into this bag of bits so we can fill it up piecewise. The issue with this is that the compiler needs to know the lifetime of all the bits in the bag the moment the bag is created. The second major downside of this is that we have copies of the structures being done piecewise and we relied on the register allocator to consolidate them. As a result, it would often manage to eliminate one copy but not the other. For the example:

#include <arm_neon.h>

void foo(int *dst, int32x4x2_t a)
{
    vst2q_s32(dst, a);
}

GCC used to generate:

foo:
        mov     v2.16b, v0.16b
        mov     v3.16b, v1.16b
        st2     {v2.4s - v3.4s}, [x0]
        ret

Where the register allocator was unable to consolidate all the copies. Starting with GCC 12 we now simply generate.

foo:
        st2     {v0.4s - v1.4s}, [x0]
        ret

By using a new type that says, “this type uses two sequential 128-bit vector registers starting at register n”. This allows us not to require any additional copies to make or deconstruct the type.

GIMPLEfy loads and shifts

LD1, ST1 and left and right shift intrinsics are among the first to be described using GIMPLE (GCC's mid-end IR language) rather than RTL (GCC's back-end IR language). A simple example to show the benefits of this is:

#include <arm_neon.h>

int32x4_t foo(int32x4_t a)
{
    int32_t temp[4];
    vst1q_s32(temp, a);
    return vld1q_s32(temp);
}

This code loads and stores to a local array and so is a no-op. However, because the front end cannot see what the load and store intrinsics are doing we can only eliminate the instructions late in RTL. However we would have already laid out the frame and so before GCC 12 we would generate:

foo:
        sub     sp, sp, #16
        add     sp, sp, 16
        ret

GCC-12 can now correctly see inside the definition and generate:

foo:
        ret

Register allocation under high register pressure

GCC has a very advanced register allocator that for much of the time gets things right. To do so, it relies heavily on receiving correct input on costing from each target backend and on information from mid-end passes. Two key pieces of information it requires is information on basic block frequencies and on the branch probabilities. An example situation where this becomes very apparent is during high register pressure situations where you also have function calls:

void bar (int, int, int, int);

int foo (int x, char* foo) {
  int tmp = x * 753;
  int t2 = tmp + 7;
  int t3 = tmp * 7;
  int c1 = 753;
  int c2 = c1 + 7;
  int c3 = c3 * 7;
  for (int i = 0; i < 1024; i++) {
	if (__builtin_expect_with_probability (foo[i] != 0, 1, SPILLER))
	  bar(x, tmp, t2, t3);
	c1 += foo[i+1];
	c2 *= foo[i+1];
	c3 += c2;
  }
  return c1 + c2 + c3;
}

With this example, we can tweak the branch probabilities by changing the value of SPILLER and we can simulate high register pressure by taking other register out of consideration for register allocation. If we look at the output of this example compiled with -DSPILLER=0.5 -fno-shrink-wrap -fno-schedule-insns -O3 -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -fno-reorder-blocks, we find several issues.

We can see that just by tweaking the static branch probabilities we could get the register allocator to generate better or worse code. As an example, with -DSPILLER=0.5 we get this snippet:

.L5:
        ldrb    w0, [x19]
        cbz     w0, .L2
        ldp     w1, w0, [sp, 72]
        stp     w2, w3, [sp, 56]
        str     x7, [sp, 64]
        bl      bar
        ldrb    w0, [x19, 1]!
        ldr     x7, [sp, 64]
        add     w22, w22, w0
        ldp     w2, w3, [sp, 56]
        mul     w20, w20, w0
        add     w21, w21, w20
        cmp     x19, x7
        bne     .L5

However with a simple 1% change (-DSPILLER=0.51) in the probability we suddenly get an additional reload:

.L5:
        ldrb    w0, [x19]
        cbz     w0, .L2
        ldr     w0, [sp, 76]
        stp     w1, w2, [sp, 56]
        str     w3, [sp, 72]
        bl      bar
        ldrb    w0, [x19, 1]!
        ldp     w1, w2, [sp, 56]
        add     w21, w21, w0
        ldr     w3, [sp, 72] <<<< here
        mul     w20, w20, w0
        ldr     x0, [sp, 64]
        add     w22, w22, w20
        cmp     x19, x0
        bne     .L5

Part of this was because historically in the AArch64 back-end we had costed the cost of a load and store to be the same. This is obviously not correct as modern CPUs have a store buffer. This buffer makes stores significantly cheaper than loads as the CPU does not need to wait for the store to complete to continue. The register allocator is deciding between whether it is cheaper to spill a value outside of the loop’s live range. In this case it would free up registers so it can avoid having to spill at the call site. The other option it has is to do reloads around the call itself. Doing the spills outside of the loop requires multiple stores and so it compares the costs of these stores vs the loads and stores needed around the call. Because of the incorrect costs, we would choose to spill around the call instead.

The second fix to GCC’s register allocator is how it handles “soft” conflicts. A conflict is considered soft when a value is live throughout a range (for example, a loop) and the range contains branches where the value is not used. As an example, the values c1, c2, t2, and c3 are live but not used inside the branch with the function call. Typically, a register allocator handles such situations by trying to split the live range, but since the values are live the range cannot be reduced. With the changes, we now choose to allocate them to callee saved registers. As a consequence they no longer need to be spilled before the function call.

More efficient GOT accesses

GCC used to emit a GOT access using separate ADRP and LDR instructions. This allows them to be scheduled independently and use different registers:

ADRP   x0, :got: symbol
... ; unrelated instructions
LDR      x1, [x0, :got_lo12: symbol]

GCC12 now always emits the ADRP and LDR as consecutive instructions using the same register. This reduces register pressure with -fPIC/-fPIE, resulting in more efficient code. For example, Perlbench is 1.8% faster with -fPIC and 0.9% smaller.

ADRP   x1, :got: symbol
LDR      x1, [x1, :got_lo12: symbol]

Another advantage is that linkers can now optimize GOT accesses without having to introduce new relocations because the instructions are now always sequential as a block.

Improved address rematerialization

When register pressure is high, the register allocator spills some values to the stack. Spilling is expensive due to the cost of loads and stores, so the register allocator tries hard to minimize the number of spills. Some values, such as immediate and addresses, can be recomputed. Recomputing is better than spilling since ALU instructions are very fast. This alternative to spilling is called rematerialization. GCC12 rematerializes addresses more often on AArch64, which improves performance and reduces codesize of applications using many global variables.

Spilling of an ADRP

Using the previous approach we may use more registers than needed and allow code to be scheduled in between the two usages of the ADRP. When the function has high register pressure this could lead to spilling the ADRP address, for example:

ADRP x0, symbol
LDR    x1, [x0, :lo12: symbol]
STR   x0, [sp, 32]  // spill ADRP value
... code using many registers...
LDR   x2, [sp, 32]  // restore ADRP
STR   x3, [x2, :lo12: symbol]

Rematerialization of an ADRP

Now instead of spilling the ADRP we simply re-materialize it which is much cheaper and also frees up an additional register:

ADRP x0, symbol 
LDR    x1, [x0, :lo12: symbol]
... code using many registers...
ADRP x2, symbol   // rematerialize ADRP
STR   x3, [x2, :lo12: symbol]

Constant Subexpression Elimination (CSE)

With GCC-12, we have started beefing up GCC’s constant CSE. AArch64 has limited constant range that can fit in a single instruction. To create complex constants we have two options, either use a sequence of mov/movk or a literal pool.

GCC as a compiler does quite a lot of optimizations at the time of parsing already. One of these is pulling out of constants out of arrays. As an example:

#include <stdint.h>
#include <arm_neon.h>

uint64_t
test (uint64_t a, uint64x2_t b, uint64x2_t* rt)
{
  uint64_t arr[2] = { 0x0942430810234076UL, 0x0942430810234076UL};
  uint64_t res = a | arr[0];
  uint64x2_t val = vld1q_u64 (arr);
  *rt = vaddq_u64 (val, b);
  return res;
}

The expression res is represented in GIMPLE as “a | 0x0942430810234076UL” and because the other occurrence of the constant is inside a vector we could not CSE the constant. As a result, we would materialize the same constant twice:

test:
        adrp    x2, .LC0
        sub     sp, sp, #16
        ldr     q1, [x2, #:lo12:.LC0]
        mov     x2, 16502
        movk    x2, 0x1023, lsl 16
        movk    x2, 0x4308, lsl 32
        add     v1.2d, v1.2d, v0.2d
        movk    x2, 0x942, lsl 48
        orr     x0, x0, x2
        str     q1, [x1]
        add     sp, sp, 16
        ret
.LC0:
        .xword  667169396713799798
        .xword  667169396713799798

GCC 12 can now not just CSE the constant, but can also decide where it is cheaper to materialize the constant. It can do so on either the SIMD or general register side. In this case, the bigger constant is needed on the SIMD side and so it is cheaper to materialize the constant there. With GCC 12 we now generate:

test:
        adrp    x2, .LC0
        ldr     q1, [x2, #:lo12:.LC0]
        add     v0.2d, v0.2d, v1.2d
        fmov    x2, d1
        str     q0, [x1]
        orr     x0, x0, x2
        ret
.LC0:
        .xword  667169396713799798
        .xword  667169396713799798

Bit optimizations

In GCC 12, we have taught the compiler to do various bit optimizations with the goal to improve latency and throughput. Usually, vector shifts on Arm CPUs are throughput limited and so avoiding them usually has great benefits. The following, are a couple of examples:

Faster unsigned narrowing

In many image and video processing it is common to do operations that truncate or shift by half the width of a vector. As an example:

typedef short int16_t;
typedef unsigned short uint16_t;

void foo (uint16_t * restrict a, int16_t * restrict d, int n)
{
    for( int i = 0; i < n; i++ )
      d[i] = (a[i] * a[i]) >> 16;
}

Is a common operation where you do an operation on a small datatype that ends up widening it and then therefore only the top bits are needed. Previously, we would generate:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        sshr    v1.4s, v1.4s, 16
        sshr    v0.4s, v0.4s, 16
        xtn     v2.4h, v1.4s
        xtn2    v2.8h, v0.4s
        str     q2, [x1, x3]
        add     x3, x3, 16
        cmp     x3, x4
        bne     .L4

With GCC 12 we now generate:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        uzp2    v0.8h, v1.8h, v0.8h
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

The uzp operation is described by the following image:

uzp description

                                                 uzp operational semantics

Extract and replicate sign bit

Another common operation is creating a mask that is all 1s when the top bit is 1 and 0. Otherwise, checking if the number is negative. An example is:

void e (int * restrict a, int *b, int n)
{
    for (int i = 0; i < n; i++)
      b[i] = a[i] >> 31;
}

Which used to generate:

.L4:
        ldr     q0, [x0, x3]
        sshr    v0.4s, v0.4s, 31
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x3, x4
        bne     .L4

Now generates:

.L4:
        ldr     q0, [x0, x3]
        cmlt    v0.4s, v0.4s, #0
        str     q0, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4

While these operations have the same latency on almost all AArch64 CPUs the comparison has a higher throughput than shifts in virtually all cases.

Zero cost zeros

In many Arm CPUs, a vector of zeros can be created very cheaply or even free when using movi with a 0 immediate. This opens a lot of optimizations as we can use a vector of zeros to seed operations to turn them into more efficient forms of other operations. As an example, rounding right shifts by half the input type size can be optimized:

#include <arm_neon.h>

uint32x4_t foo (uint64x2_t a, uint64x2_t b)
{
  return vrshrn_high_n_u64 (vrshrn_n_u64 (a, 32), b, 32);
}

Used to generate:

foo:
        rshrn   v0.2s, v0.2d, 32
        rshrn2  v0.4s, v1.2d, 32
        ret

But today generates:

foo:
        movi    v2.4s, 0
        raddhn  v0.2s, v0.2d, v2.2d
        raddhn2 v0.4s, v1.2d, v2.2d
        ret

This sequence has both lower latency and higher throughput than the previously generated one using shifts. These are just three examples of the many optimizations we have added in GCC 12. These are just the start and there are many more to come.

SVE predicates

Whenever you have a conditional statement in C the code can be vectorized with SVE using predication. This allows us to vectorize many more loops than we could, using Neon. This is not limited to one conditional however and SVE also allows us to deal with nested conditionals. In such cases before GCC 12, we would handle these a bit sub optimally. As an example:

void f(float * restrict z0, float * restrict z1, float *restrict x,
    float * restrict y, float c, int n)
{
    for (int i = 0; i < n; i++) {
        float a = x[i];
        float b = y[i];
        if (a > b) {
            z0[i] = a + b;
            if (a > c) {
                z1[i] = a - b;
            }
        }
    }
}

Would generate:

.L3:
        ld1w    z1.s, p1/z, [x2, x5, lsl 2]
        ld1w    z2.s, p1/z, [x3, x5, lsl 2]
        fcmgt   p0.s, p3/z, z1.s, z0.s
        fcmgt   p2.s, p1/z, z1.s, z2.s
        fcmgt   p0.s, p0/z, z1.s, z2.s
        movprfx z3, z1
        fadd    z3.s, p2/m, z3.s, z2.s
        and     p0.b, p0/z, p1.b, p1.b
        fsub    z1.s, p0/m, z1.s, z2.s
        st1w    z3.s, p2, [x0, x5, lsl 2]
        st1w    z1.s, p0, [x1, x5, lsl 2]
        add     x5, x5, x6
        whilelo p1.s, w5, w4
        b.any   .L3

Where we perform too many predicate comparisons. We can combine the predicate for a > b to create the predicate for a > b && a > c by using it as an input predicate when checking a > c. As such in GCC 12 we now generate:

.L3:
        ld1w    z1.s, p0/z, [x2, x5, lsl 2]
        ld1w    z2.s, p0/z, [x3, x5, lsl 2]
        fcmgt   p0.s, p0/z, z1.s, z2.s
        movprfx z3, z1
        fadd    z3.s, p0/m, z3.s, z2.s
        fcmgt   p1.s, p0/z, z1.s, z0.s
        fsub    z1.s, p1/m, z1.s, z2.s
        st1w    z3.s, p0, [x0, x5, lsl 2]
        st1w    z1.s, p1, [x1, x5, lsl 2]
        add     x5, x5, x6
        whilelo p0.s, w5, w4
        b.any   .L3

This is an example of the kind of predicate optimizations introduced in GCC 12.

-mcpu=native unlocked

In older GCC versions when using -mcpu=native and -march=native if the CPU ID is unknown to the compiler we would use Armv8-a without any extra extensions enabled. This is done even if we were able to tell which extensions the CPU supports through /proc/cpuinfo. Starting with GCC 12 we now enable any extension we find that can be enabled using Armv8-a as the baseline. Using GCC 12+ on a CPU it does not know the ID of but has known feature bits like SVE support or FP16, will still have these features enabled.

Auto-vectorization at -O2

In GCC 12 auto-vectorization has finally been enabled at -O2 instead of needing -O3 or higher. This allows for a better comparison against other compilers. At -O2 the compiler uses by default the very-cheap cost model when deciding whether to vectorize or not. Note that this is a different model than when doing -O2 -ftree-vectorize.

The very cheap cost model will only allow vectorization if the compiler is certain that vectorization would result in a performance win. Additionally the code-size increase to be able to vectorize should not be too much.

SVE vs Advanced SIMD

Often when talking about both of Arm’s vector ISAs the discussion is always about picking one above the other. However best performance can often be had by combining the two rather than pitting them against each other. These decisions are tied to the specific micro-architecture as the decision requires accurate throughput and latency information for both ISAs.

Before GCC 12 we would pick one or the other. Starting with GCC 12 the choice is a bit more nuanced. As an example:

typedef short int16_t;
typedef unsigned short uint16_t;

void foo (uint16_t * restrict a, int16_t * restrict d, int n)
{
    for( int i = 0; i < n; i++ )
      d[i] = (a[i] * a[i]) >> 10;
}

Used to generate with SVE enabled:

.L3:
        ld1h    z0.s, p0/z, [x0, x3, lsl 1]
        mul     z0.s, p1/m, z0.s, z0.s
        asr     z0.s, z0.s, #10
        st1h    z0.s, p0, [x1, x3, lsl 1]
        add     x3, x3, x4
        whilelo p0.s, w3, w2
        b.any   .L3
.L1:

In which SVE would be preferred over Neon. For loops where n is small however Neon may offer the best performance here. Starting with GCC 12 we now generate for this example a combination of Neon and SVE. The compiler emits a runtime check for trip count, if the trip count is large we jump directly into an SVE loop. If the trip count is low, we use a Neon main loop followed by an SVE epilogue.

Because the codegen is large, we only show the Neon and SVE combination:

.L4:
        ldr     q0, [x0, x3]
        umull   v1.4s, v0.4h, v0.4h
        umull2  v0.4s, v0.8h, v0.8h
        shrn    v1.4h, v1.4s, 10
        shrn2   v1.8h, v0.4s, 10
        str     q1, [x1, x3]
        add     x3, x3, 16
        cmp     x4, x3
        bne     .L4
        and     w3, w2, -8
        tst     x2, 7
        beq     .L1
.L3:
        sub     w2, w2, w3
        ptrue   p0.b, all
        whilelo p1.s, wzr, w2
        ld1h    z0.s, p1/z, [x0, x3, lsl 1]
        mul     z0.s, p0/m, z0.s, z0.s
        asr     z0.s, z0.s, #10
        st1h    z0.s, p1, [x1, x3, lsl 1]
        add     x0, x0, x3, lsl 1
        add     x3, x1, x3, lsl 1
        cntw    x1
        whilelo p1.s, w1, w2
        b.any   .L9

Loop unrolling

Unrolling of loops is a common thing in compilers. It is typically done on scalar code. In GCC, we have not done much in the way of vector code unrolling before, however as the number of vector pipelines are increasing loop unrolling is needed to feed the pipelines with enough work to get optimal performance.

Indiscriminately unrolling vectorized loops ignoring ISA and micro-architecture details will yield undesirable results. With GCC12 the AArch64 backend leverages the tuning information regarding the CPU's width and throughput, selected by -mcpu=native or -mcpu=<cpu>, to make the decision whether to unroll and by how much.

This unrolling is also combined with the ability to use Neon and SVE together to handle loops. As such you can result in several different combinations. A few examples:

  • Unrolled Neon (no SVE)
  • Unrolled Neon + SVE
  • Unrolled SVE (no Neon)

The previous example when compiling for some micro-architectures can result in Neon being unrolled and a single SVE fallback:

foo:
        cmp     w2, 0
        ble     .L1
        sub     w3, w2, #1
        mov     x6, 0
        cmp     w3, 6
        bls     .L3
        lsr     w4, w2, 3
        lsl     x5, x4, 4
        tbz     x4, 0, .L4
        ldr     q0, [x0]
        mov     x6, 16
        umull   v1.4s, v0.4h, v0.4h
        umull2  v2.4s, v0.8h, v0.8h
        shrn    v3.4h, v1.4s, 10
        shrn2   v3.8h, v2.4s, 10
        str     q3, [x1]
        cmp     x5, x6
        beq     .L13
        .p2align 5,,15
.L4:
        add     x7, x6, 16
        ldr     q4, [x0, x6]
        ldr     q5, [x0, x7]
        umull   v6.4s, v4.4h, v4.4h
        umull2  v7.4s, v4.8h, v4.8h
        umull   v16.4s, v5.4h, v5.4h
        umull2  v17.4s, v5.8h, v5.8h
        shrn    v18.4h, v6.4s, 10
        shrn    v1.4h, v16.4s, 10
        shrn2   v18.8h, v7.4s, 10
        shrn2   v1.8h, v17.4s, 10
        str     q18, [x1, x6]
        add     x6, x6, 32
        str     q1, [x1, x7]
        cmp     x5, x6
        bne     .L4
.L13:
        and     w6, w2, -8
        tst     x2, 7
        beq     .L1
.L3:
        sub     w2, w2, w6
        ptrue   p0.b, all
        whilelo p1.s, wzr, w2
        ld1h    z19.s, p1/z, [x0, x6, lsl 1]
        mul     z19.s, p0/m, z19.s, z19.s
        asr     z20.s, z19.s, #10
        st1h    z20.s, p1, [x1, x6, lsl 1]
        cntw    x8
        add     x0, x0, x6, lsl 1
        whilelo p2.s, w8, w2
        add     x1, x1, x6, lsl 1
        b.none  .L1
        ld1h    z21.s, p2/z, [x0, #1, mul vl]
        mul     z21.s, p0/m, z21.s, z21.s
        asr     z22.s, z21.s, #10
        st1h    z22.s, p2, [x1, #1, mul vl]
.L1:
        ret

If we change the sequence into one where SVE is always beneficial we only get SVE code and an unrolled SVE loop:

typedef short int16_t;
typedef unsigned short uint16_t;

void foo (uint16_t * restrict a, int16_t * restrict d, int n)
{
    for( int i = 0; i < n; i++ )
      d[i] = (a[i] * a[i]) >> 16;
}

generates:

.L4:
        ld1h    z0.h, p0/z, [x9]
        ld1h    z1.h, p0/z, [x9, #1, mul vl]
        umulh   z0.h, p0/m, z0.h, z0.h
        umulh   z1.h, p0/m, z1.h, z1.h
        add     w5, w5, w7
        st1h    z1.h, p0, [x10, #1, mul vl]
        st1h    z0.h, p0, [x10]
        add     x9, x9, x6
        add     x10, x10, x6
        cmp     w8, w5
        bcs     .L4
        cmp     w2, w5
        beq     .L1
.L3:
        ubfiz   x11, x5, 1, 32
        sub     w2, w2, w5
        ptrue   p1.b, all
        whilelo p2.h, wzr, w2
        add     x0, x0, x11
        add     x1, x1, x11
        ld1h    z2.h, p2/z, [x0]
        umulh   z2.h, p1/m, z2.h, z2.h
        st1h    z2.h, p2, [x1]
        cntb    x12
        cnth    x13
        whilelo p3.h, w13, w2
        add     x14, x0, x12
        add     x15, x1, x12
        b.none  .L1
        ld1h    z3.h, p3/z, [x14]
        umulh   z3.h, p1/m, z3.h, z3.h
        st1h    z3.h, p3, [x15]
.L1:
        ret

This unrolling is more than just simply repeating instructions. The goal of the unrolling is to increase the pipeline usage inside of the loop. To accomplish this we try to maintain a parallelism as much as possible. One way we do this is to share accumulators if the loop does an accumulation.
For example:

double f(double *x, double *y, long n) {
  double res = 0;
  for (long i = 0; i < n; ++i)
    res += x[i] * y[i];
  return res;
}

generates when unrolled:

.L4:
        ld1d    z4.d, p0/z, [x6, x3, lsl 3]
        ld1d    z5.d, p0/z, [x5, x3, lsl 3]
        ld1d    z2.d, p0/z, [x0, x3, lsl 3]
        ld1d    z3.d, p0/z, [x1, x3, lsl 3]
        add     x3, x3, x4
        fmla    z1.d, p0/m, z4.d, z5.d
        fmla    z0.d, p0/m, z2.d, z3.d
        cmp     x7, x3
        bcs     .L4
        fadd    z0.d, z0.d, z1.d
        cmp     x2, x3
        beq     .L6

which keeps the fmla chains separate and only does the the final accumulation after the loop.

FMIN/FMAX reduction support

With GCC 12 we can now vectorize fmin/fmax without the need to use -ffast-math.
For example:

double f (double *x, int n)
{
 double res = 100.0;
 for (int i = 0; i < n; ++i)
 res = __builtin_fmin (res, x[i]);
 return res;
}

Before GCC 12 we would fail vectorization at -O3 and produce scalar code. With GCC 12 we now generate:

.L3:
        ld1d    z1.d, p0/z, [x0, x2, lsl 3]
        add     x2, x2, x4
        fminnm  z0.d, p0/m, z0.d, z1.d
        whilelo p0.d, w2, w1
        b.any   .L3
        ptrue   p0.b, all
        fminnmv d0, p0, z0.d
        ret

which now produces vectorized code even at -O3.

Better support for gathers and scatters with SVE

In GCC 12 support for gathers and scatters was added to the SLP vectorizer which allows for greater flexibility when these operations are required.
As an example:

void f (int *restrict y, int *restrict x, int *restrict indices)
{
  for (int i = 0; i < 16; ++i)
    {
      y[i * 2] = x[indices[i * 2]] + 1;
      y[i * 2 + 1] = x[indices[i * 2 + 1]] + 2;
    }
}

fails to vectorize before GCC 12. Starting with GCC 12 we can now handle such cases and are able to use the existing costing infrastructure for SLP loops to determine when their use would be beneficial. We now generate for the example:

.L2:
        ld1w    z0.s, p0/z, [x2, x3, lsl 2]
        ld1w    z0.s, p0/z, [x1, z0.s, sxtw 2]
        add     z0.s, z0.s, z1.s
        st1w    z0.s, p0, [x0, x3, lsl 2]
        add     x3, x3, x5
        whilelo p0.s, x3, x4
        b.any   .L2
        ret

Atomic 64-byte load and stores

As part of Armv8.7-A we've added support for the atomic 64-byte load and store instructions to GCC.
These can be used with the +ls64 extension. This extension comes with a new ACLE type data512_t which can be used to store the resulting data into.
The example:

#include <arm_acle.h>

void
func(const void * addr, data512_t *data) {
  *data = __arm_ld64b (addr);
}

generates:

func:
        ld64b   x8, [x0]
        stp     x8, x9, [x1]
        stp     x10, x11, [x1, 16]
        stp     x12, x13, [x1, 32]
        stp     x14, x15, [x1, 48]
        ret

New CPU support

GCC 12 also adds CPU support for the following Arm CPUs:

  • Cortex-A510
  • Cortex-R52+
  • Cortex-A710
  • Cortex-X2

These can be used with the -march, -mcpu and -mtune compiler options to target the compiler to these CPUs.

We are only just getting started

With a lot of the foundational pieces in place, we can now push for more complex optimizations in GCC. The combination of SVE and Neon promises to deliver much greater performance by giving us flexibility without needing to lose any performance during low trip count.

In the meantime, check out previous year's entry for GCC 11.

Performance improvements in GCC 11

Anonymous
Tools, Software and IDEs blog
  • Product update: Arm Development Studio 2022.2 now available

    Ronan Synnott
    Ronan Synnott
    Arm Development Studio 2022.2 is now available, providing support for PSA-ADAC authenticated debug.
    • December 7, 2022
  • Product update: Arm Development Studio 2022.1 now available

    Ronan Synnott
    Ronan Synnott
    Arm Development Studio 2022.1 (and 2022.b) is now available.
    • July 25, 2022
  • Arm Compiler for Linux: what is new in the 22.0 release?

    Ashok Bhat
    Ashok Bhat
    Arm Compiler for Linux 22.0 is now available with performance improvements and support for new hardware like AWS Graviton 3.
    • May 27, 2022