Part 1: To bigger things in GCC-14

May 8, 2024

13 minute read time.

This year's release blog has been split into a 3-part blog series. This is part 1.

In the GCC 14 release, as with all other releases, Arm has worked closely together with the community to bring a much-improved compiler. This release however has been somewhat different in that we took a step back and re-evaluated where we are and where we want to go in not just this release but future ones considering the compiler release cycle. With that in mind, GCC 14 now fresh out has been a larger release than normal. Read on for what Arm and the community have been up to and a glimpse into the direction of travel.

Technical debt

As with any project GCC has a lot of technical debt. However, due to its size, age and large number of architectures it supports this technical debt ranges from annoyances to major blockers for development. With every release, the goal is to improve the compiler without making it harder to maintain. Every so often we must take a step back and just focus on maintenance before adding new features.

The end of this blog will also discuss two technical debts that we addressed this year to aid in helping the process of contributing to GCC itself.

Simplifying vector zero concats

An often-occurring pattern is that an operation is performed as a 64-bit vector and then extended to a 128-bit vector by padding the top bits with 0.

The example:

#include <arm_neon.h>

int8x16_t foo (int8x8_t a, int8x8_t b)
{
  int8x8_t x = vadd_s8 (a, b);
  int8x8_t zeros = vcreate_s8 (0);
  return vcombine_s8 (x, zeros);
}

In GCC-13, generates:

foo:
        add     v0.8b, v0.8b, v1.8b
        fmov    d0, d0
        ret

However, most operations in AArch64 when they store the results will implicitly zero out the top bits. In GCC 14, we have modelled this correctly and now generate:

foo:
        add     v0.8b, v0.8b, v1.8b
        ret

Vectorizer improvements

GCC has two auto-vectorization strategies. One is the classical statement-based loop vectorizer which just vectorizes statements in-place and the second one is a loop-aware SLP vectorizer which has the effect of “rerolling” statements during vectorization.

The statement based vectorizer is an inheritance from when vectorization was first added to GCC and due to its age today it still handles things that the loop-aware SLP vectorizer does not handle. The problem with having two vectorization strategies is that as we add new vectorization features, we must effectively add them twice. This makes it quite time consuming to extend the GCC vectorizer with radical new features. This technical debt is a weight on the project. In GCC 14, we have started the groundwork to remove the statement based vectorizer and instead rely entirely on the SLP based vectorizer.

Addressing this technical debt will allow us to move much quicker when implementing new and exciting features inside the vectorizer and will allow us to add some of the advanced vectorization technologies offered by SVE. One of these features is support for control flow vectorization. In GCC 14, we have gotten started on this already as we’ve started refactoring the code. The following sections describe the refactoring done in GCC 14 already to support early break vectorization.

Refactor peeling

The first technical debt we have had to address is how loop epilog and prolog peeling are performed. During vectorization the loop will typically have to be peeled at least once but can often be multiple times. Peeling is the act of copying the loop and placing the new loop either before or after the original loop (prologue or epilogue peeling respectively).

When doing the peeling we must maintain the control flow from the peeled to the original loop. Loops in GCC are in Single Static Assignment (SSA) form as is the rest of GIMPLE structures after lowering. However, the vectorizer has a stronger representation here and requires loops to be in Loop Closed Single Static Assignment (LCSSA) form. Described simply LCSSA means that there must be a connection between the output of the first loop’s value into the second loop’s inputs.

Historically GCC’s peeling code did not maintain this invariant and instead relied on later code to patch up the values as it modified the connection between the two loops. This worked OK enough when there was only one possible way to exit the loops but when there are multiple exits this patch up code would need to be spread in multiple places. Additionally, it means combining early exits/control flow vectorization with other GCC features such as alignment peeling became hard to support.

To make things simpler in GCC 14, we now always maintain LCSSA form, even after peeling. During peeling we have enough context to know how the values will be used as such leaving it up to later parts to fix it up is not necessary. The additional benefit of doing this as well is that we can rely on the use-def chains to determine where and when the second loop uses values from the first loop.

This refactoring at the expense of making peeling slightly more complicated allowed us to remove many hundreds of lines of code while receiving a more flexible vectorizer out of it.

Refactor exits

Conventionally the vectorizer had the notion that what it considered to be the loop’s exit must be the same as the scalar loop’s exit. With one exit this was trivially true. When it came to multiple exits we wanted to decouple this notion and freeing the vectorizer to pick its own main exit independently.

The first step to doing this was removing all the hardcoded assumptions on there being a single exit. These were all replaced by code that tries to identify which exit is the most beneficial for vectorization and storing the information in the vectorization specific bookkeeping. This bookkeeping was then kept up to date as vectorization progressed.

Refactor loop shape rejection

GCC typically rejected loops from vectorization based on the number of basic blocks inside the loop. For inner loops, we only allowed vectorization for loops with 2 basic blocks and for outer vectorization we only allowed loops up to 4 basic blocks.

These restrictions were used as an approximation of the loop having control flow inside since a loop without control flow in GCC will typically consist of the loop body and a latch. However this misses cases where the loop had a simple fall through block in between that for whatever reason wasn’t removed before the vectorizer. More importantly we cannot use number of basic blocks to distinguish between loops with control that we can vectorize and those we cannot.

In GCC 14, we have removed the restrictions on the number of basic blocks and instead look at the control flow of the loop. In GCC 14, we only support control flow that leaves the loop, and in addition the control may not transfer from the outer loop into the inner loop, but it can go from inner loop to outer loop or completely leave the loop or function.

Early break support

With these steps in place the implementation of early break support was added to GCC 14. When referring to vectorization, of early break we mean vectorization of loops with break, go to, exit, return or other calls that result in the loop’s execution ending. In addition, the condition on which the exit is taken must access memory. That is:

int foo (int *x, int n)
{
  for (int i = 0; i < n; i++)
    {
       if (x[i] > 5)
         break;
    }
}

Is the shape of the loops we are looking to handle. Loops such as:

int foo (int *x, int n)
{
  for (int i = 0; i < n; i++)
    {
       if (i > 5)
         break;
    }
}

This is already handled by the compiler because such loops simply add an upper bound to the iteration of the loop.

Limitations

As this is the first implementation we have adopted a few limitations and design constraints to fit it into a single development cycle of GCC. The most important of which are:

We only support early vectorization when the vectorizer knows that it is safe to perform all the loads before an early break for the entire duration of the loop’s lifetime. In practice, this means that either the data being loaded comes from a buffer of known static size or it means that the dependencies are set up as such that we can determine that the alignment of the data source guarantees us that we won’t cross a page boundary. This limitation will be lifted in GCC 15 by use of alignment peeling for Advanced SIMD and First Faulting Loads for SVE.
When the early exit is taken, we branch to the scalar epilogue loop and have the scalar loop perform the side effects for that iteration. This representation was chosen since it allows us to support early breaks for most GCC targets generically and gives us a platform to work on. For SVE this branching is not necessary, and we can remain completely inside the loop with the help of predication and First Faulting Loads. Support for this is planned in GCC 15.

Determining exit

To vectorize a loop, we need to be able to tell how often the loop iterates and how the loop invariants evolve from iteration to iteration. In practice, this means that for us to vectorize a loop today we require the loop’s exit to be a counted one (SVE can also do some uncounted loops, but that’s outside of scope of this release).

However, the vectorizer doesn’t need all exits to be counted, it just needs one of them to be. In the case of multiple exits this gives us the freedom to pick during analysis a different exit to use as the main one.

As a side-effect this means that the vectorizer can now vectorize a loop that it otherwise could not because of the presence of things such as asserts inside the loop.

When the vectorizer picks a different exit than the loop’s latch connected exit then it treats all exits in the loop as an early one. In those cases, the final iteration is always performed in the scalar loop for it to correctly apply the side-effects.

Moving unsafe statements

Any side effects before an early exit are not safe to perform in the vector loop until we know that none of the early exits will be taken. That is so say in a loop such as:

int foo (int *x, int *y, int n)
{
  for (int i = 0; i < n; i++)
    {
       y[i] = x[i] * 2;
       if (x[i] > 5)
         break;
    }
}

The store to y is not safe to perform until after the break statement. Before we start vectorization of loops which have such side-effects before the breaks we perform a couple of checks:

We check that moving the store to a different point which we deem to be safe does not introduce a new data dependency. In the above this is done by checking that x and y do not alias.
We check that the side effect being performed is one that can be moved at all (that is, that there is no other dependency or flag that is forcing the statement to remain in place).

If either of these conditions are not true then we abort vectorization. However, if they are true then we move the side effects all to a safe block and they are moved in order. This also means that not only can the vectorizer’s CFG differ from that of the scalar loop, but the order of side-effects are different.

This is safe due to the “as-if” rule. We cannot guarantee that side-effects within a vector loop are the same as the scalar loop but we do guarantee this by the end of each vector iteration.

The following examples show how the vectorizer handles loops with two exits:

int z[100], y[100], x[100];
int foo (int n)
{
  int res = 0;
  for (int i = 0; i < n; i++)
    {
       y[i] = x[i] * 2;
       res += x[i] + y[i];

       if (x[i] > 5)
         break;

       if (z[i] > 5)
         break;

    }
  return res;
}

Generates for SVE:

.L19:
        ld1w    z26.s, p7/z, [x3, x1, lsl 2]
        cmpgt   p14.s, p7/z, z26.s, #5
        ptest   p15, p14.b
        b.any   .L4
        st1w    z27.s, p7, [x2, x1, lsl 2]
        incw    x1
        whilelo p7.s, w1, w0
        b.none  .L18
.L6:
        ld1w    z31.s, p7/z, [x4, x1, lsl 2]
        cmpgt   p14.s, p7/z, z31.s, #5
        lsl     z27.s, z31.s, #1
        add     z31.s, z31.s, z27.s
        mov     z28.d, z30.d
        ptest   p15, p14.b
        add     z30.s, p7/m, z30.s, z31.s
        mov     z31.d, z29.d
        incw    z29.s
        b.none  .L19

And Advanced SIMD:

.L18:
        ldr     q31, [x1, x4]
        cmgt    v31.4s, v31.4s, v26.4s
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x0, d31
        cbnz    x0, .L4
        str     q27, [x1, x5]
        add     x1, x1, 16
        cmp     x1, x3
        beq     .L17
.L8:
        ldr     q31, [x2, x1]
        mov     v23.16b, v29.16b
        mov     v25.16b, v28.16b
        cmgt    v30.4s, v31.4s, v26.4s
        shl     v27.4s, v31.4s, 1
        add     v28.4s, v28.4s, v24.4s
        umaxp   v30.4s, v30.4s, v30.4s
        add     v31.4s, v31.4s, v27.4s
        fmov    x0, d30
        add     v29.4s, v29.4s, v31.4s
        cbz     x0, .L18

And AArch32 Advanced SIMD:

.L19:
        vld1.64 {d16-d17}, [ip:64]!
        vcgt.s32        q8, q8, q13
        vpmax.u32       d7, d16, d17
        vpmax.u32       d7, d7, d7
        vmov    r3, s14 @ int
        cbnz    r3, .L4
        cmp     r2, r0
        vst1.64 {d24-d25}, [lr:64]!
        beq     .L18
.L8:
        vld1.64 {d16-d17}, [r1:64]!
        adds    r2, r2, #1
        vmov    q2, q10  @ v4si
        vmov    q14, q9  @ v4si
        vcgt.s32        q11, q8, q13
        vadd.i32        q12, q8, q8
        vadd.i32        q10, q10, q15
        vpmax.u32       d7, d22, d23
        vadd.i32        q8, q8, q12
        vpmax.u32       d7, d7, d7
        vadd.i32        q9, q9, q8
        vmov    r3, s14 @ int
        cmp     r3, #0
        beq     .L19

The generated sequences can be optimized further which is on the roadmap for GCC 15.

LDP/STP pass

The AArch64 architecture has helpful instructions that can load and store two consecutive registers from/to memory. These instructions are not exposed as intrinsics as the expectation is that the compiler should be able to form them on their own.

In GCC, this has historically been done using the sched-fusion pass and relied on the scheduler to move possible candidates next to each other to reduce the search space. While this was initially sufficient this approach had proven to not be very flexible.

There are two main issues. First any limitations the scheduler has are automatically inherited by the pass. For instance, the scheduler does not do as extensive alias analysis as we would like to merge loads and stores that are further away. The scheduler is also currently limited to only scheduling instructions within a single basic block. The load/store forming is an example of an optimization that can be done on extended basic blocks instead. The current implementation of the new pass is limited to a single BB as well with plans to extend this in the future.

The second issue is that in certain cases we end up with worse code because the scheduler has placed the index updating instruction between the two memory accesses.

This would cause the register allocator to have to create a copy of the old value before incrementing the new one, resulting in higher register pressure and an additional instruction, instead of removing two.

To fix this in GCC 14, we have introduced a new AArch64 pass to form these pairs more aggressively using a framework that keeps SSA (single static assignment) information along in the RTL abstract syntax of GCC.

Having SSA information allows us to be more aggressive without incurring a large compile time overhead. The new pass Is enabled by default and catches many more cases that previously we did not. The example:

unsigned __int128 *f(unsigned __int128 *p)
{
   unsigned __int128 ab;
   while ((ab = *p++))
     ;
   return p;
}

used to generate:

f(unsigned __int128*):
.L2:
        ldr     x1, [x0], 16
        ldr     x2, [x0, -8]
        orr     x1, x1, x2
        cbnz    x1, .L2
        ret

and now in GCC 14 generates:

f(unsigned __int128*):
.L2:
        ldp     x1, x2, [x0], 16
        orr     x1, x1, x2
        cbnz    x1, .L2
        ret

Part of the power of the new pass is that it's light weight enough for us to run it twice, one before register allocation and one after. By running it before register allocation we prevent issues like scheduling from creating suboptimal allocations.

By running it after register allocation we catch any stack based loads and stores, which after register allocation have a fixed stack location.

In Part 2, we talk about the following topics:

Memcpy/memmove improvements
SME/SME2
Early-ra pass
Libmvec
LSE128
New cores
If-conversion optimizations
Bitint support
Zero-extends to permutes
Novector

Read GCC-14 Part 2

Tools, Software and IDEs blog

Part 3: Leveraging Rust with Rich Operating Systems on Arm

Jonathan Pallant

Understand how Rust can take full advantage of running on a full-blown operating system such as Linux.
- November 15, 2024
Part 2: Integrating Rust with Real-Time Operating Systems on Arm

Jonathan Pallant

Explore how to integrate Rust with Real-Time Operating Systems (RTOS) on Arm microcontrollers and processors.
- November 13, 2024
Building Safe and Secure Software with Rust on Arm

Jonathan Pallant

Learn how Rust enhances safety and performance on Arm microcontrollers, with practical examples and insights from Jonathan Pallant.
- November 11, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog