In the GCC 14 release, as with all other releases, Arm has worked closely together with the community to bring a much-improved compiler. This release however has been somewhat different in that we took a step back and re-evaluated where we are and where we want to go in not just this release but future ones considering the compiler release cycle. With that in mind, GCC 14 now fresh out has been a larger release than normal. Read on for what Arm and the community have been up to and a glimpse into the direction of travel.
As with any project GCC has a lot of technical debt. However, due to its size, age and large number of architectures it supports this technical debt ranges from annoyances to major blockers for development. With every release, the goal is to improve the compiler without making it harder to maintain. Every so often we must take a step back and just focus on maintenance before adding new features.
The end of this blog will also discuss two technical debts that we addressed this year to aid in helping the process of contributing to GCC itself.
An often-occurring pattern is that an operation is performed as a 64-bit vector and then extended to a 128-bit vector by padding the top bits with 0.
The example:
#include <arm_neon.h> int8x16_t foo (int8x8_t a, int8x8_t b) { int8x8_t x = vadd_s8 (a, b); int8x8_t zeros = vcreate_s8 (0); return vcombine_s8 (x, zeros); }
In GCC-13, generates:
foo: add v0.8b, v0.8b, v1.8b fmov d0, d0 ret
However, most operations in AArch64 when they store the results will implicitly zero out the top bits. In GCC 14, we have modelled this correctly and now generate:
foo: add v0.8b, v0.8b, v1.8b ret
GCC has two auto-vectorization strategies. One is the classical statement-based loop vectorizer which just vectorizes statements in-place and the second one is a loop-aware SLP vectorizer which has the effect of “rerolling” statements during vectorization.
The statement based vectorizer is an inheritance from when vectorization was first added to GCC and due to its age today it still handles things that the loop-aware SLP vectorizer does not handle. The problem with having two vectorization strategies is that as we add new vectorization features, we must effectively add them twice. This makes it quite time consuming to extend the GCC vectorizer with radical new features. This technical debt is a weight on the project. In GCC 14, we have started the groundwork to remove the statement based vectorizer and instead rely entirely on the SLP based vectorizer.
Addressing this technical debt will allow us to move much quicker when implementing new and exciting features inside the vectorizer and will allow us to add some of the advanced vectorization technologies offered by SVE. One of these features is support for control flow vectorization. In GCC 14, we have gotten started on this already as we’ve started refactoring the code. The following sections describe the refactoring done in GCC 14 already to support early break vectorization.
The first technical debt we have had to address is how loop epilog and prolog peeling are performed. During vectorization the loop will typically have to be peeled at least once but can often be multiple times. Peeling is the act of copying the loop and placing the new loop either before or after the original loop (prologue or epilogue peeling respectively).
When doing the peeling we must maintain the control flow from the peeled to the original loop. Loops in GCC are in Single Static Assignment (SSA) form as is the rest of GIMPLE structures after lowering. However, the vectorizer has a stronger representation here and requires loops to be in Loop Closed Single Static Assignment (LCSSA) form. Described simply LCSSA means that there must be a connection between the output of the first loop’s value into the second loop’s inputs.
Historically GCC’s peeling code did not maintain this invariant and instead relied on later code to patch up the values as it modified the connection between the two loops. This worked OK enough when there was only one possible way to exit the loops but when there are multiple exits this patch up code would need to be spread in multiple places. Additionally, it means combining early exits/control flow vectorization with other GCC features such as alignment peeling became hard to support.
To make things simpler in GCC 14, we now always maintain LCSSA form, even after peeling. During peeling we have enough context to know how the values will be used as such leaving it up to later parts to fix it up is not necessary. The additional benefit of doing this as well is that we can rely on the use-def chains to determine where and when the second loop uses values from the first loop.
This refactoring at the expense of making peeling slightly more complicated allowed us to remove many hundreds of lines of code while receiving a more flexible vectorizer out of it.
Conventionally the vectorizer had the notion that what it considered to be the loop’s exit must be the same as the scalar loop’s exit. With one exit this was trivially true. When it came to multiple exits we wanted to decouple this notion and freeing the vectorizer to pick its own main exit independently.
The first step to doing this was removing all the hardcoded assumptions on there being a single exit. These were all replaced by code that tries to identify which exit is the most beneficial for vectorization and storing the information in the vectorization specific bookkeeping. This bookkeeping was then kept up to date as vectorization progressed.
GCC typically rejected loops from vectorization based on the number of basic blocks inside the loop. For inner loops, we only allowed vectorization for loops with 2 basic blocks and for outer vectorization we only allowed loops up to 4 basic blocks.
These restrictions were used as an approximation of the loop having control flow inside since a loop without control flow in GCC will typically consist of the loop body and a latch. However this misses cases where the loop had a simple fall through block in between that for whatever reason wasn’t removed before the vectorizer. More importantly we cannot use number of basic blocks to distinguish between loops with control that we can vectorize and those we cannot.
In GCC 14, we have removed the restrictions on the number of basic blocks and instead look at the control flow of the loop. In GCC 14, we only support control flow that leaves the loop, and in addition the control may not transfer from the outer loop into the inner loop, but it can go from inner loop to outer loop or completely leave the loop or function.
With these steps in place the implementation of early break support was added to GCC 14. When referring to vectorization, of early break we mean vectorization of loops with break, go to, exit, return or other calls that result in the loop’s execution ending. In addition, the condition on which the exit is taken must access memory. That is:
int foo (int *x, int n) { for (int i = 0; i < n; i++) { if (x[i] > 5) break; } }
Is the shape of the loops we are looking to handle. Loops such as:
int foo (int *x, int n) { for (int i = 0; i < n; i++) { if (i > 5) break; } }
This is already handled by the compiler because such loops simply add an upper bound to the iteration of the loop.
As this is the first implementation we have adopted a few limitations and design constraints to fit it into a single development cycle of GCC. The most important of which are:
To vectorize a loop, we need to be able to tell how often the loop iterates and how the loop invariants evolve from iteration to iteration. In practice, this means that for us to vectorize a loop today we require the loop’s exit to be a counted one (SVE can also do some uncounted loops, but that’s outside of scope of this release).
However, the vectorizer doesn’t need all exits to be counted, it just needs one of them to be. In the case of multiple exits this gives us the freedom to pick during analysis a different exit to use as the main one.
As a side-effect this means that the vectorizer can now vectorize a loop that it otherwise could not because of the presence of things such as asserts inside the loop.
When the vectorizer picks a different exit than the loop’s latch connected exit then it treats all exits in the loop as an early one. In those cases, the final iteration is always performed in the scalar loop for it to correctly apply the side-effects.
Any side effects before an early exit are not safe to perform in the vector loop until we know that none of the early exits will be taken. That is so say in a loop such as:
int foo (int *x, int *y, int n) { for (int i = 0; i < n; i++) { y[i] = x[i] * 2; if (x[i] > 5) break; } }
The store to y is not safe to perform until after the break statement. Before we start vectorization of loops which have such side-effects before the breaks we perform a couple of checks:
If either of these conditions are not true then we abort vectorization. However, if they are true then we move the side effects all to a safe block and they are moved in order. This also means that not only can the vectorizer’s CFG differ from that of the scalar loop, but the order of side-effects are different.
This is safe due to the “as-if” rule. We cannot guarantee that side-effects within a vector loop are the same as the scalar loop but we do guarantee this by the end of each vector iteration.
The following examples show how the vectorizer handles loops with two exits:
int z[100], y[100], x[100]; int foo (int n) { int res = 0; for (int i = 0; i < n; i++) { y[i] = x[i] * 2; res += x[i] + y[i]; if (x[i] > 5) break; if (z[i] > 5) break; } return res; }
Generates for SVE:
.L19: ld1w z26.s, p7/z, [x3, x1, lsl 2] cmpgt p14.s, p7/z, z26.s, #5 ptest p15, p14.b b.any .L4 st1w z27.s, p7, [x2, x1, lsl 2] incw x1 whilelo p7.s, w1, w0 b.none .L18 .L6: ld1w z31.s, p7/z, [x4, x1, lsl 2] cmpgt p14.s, p7/z, z31.s, #5 lsl z27.s, z31.s, #1 add z31.s, z31.s, z27.s mov z28.d, z30.d ptest p15, p14.b add z30.s, p7/m, z30.s, z31.s mov z31.d, z29.d incw z29.s b.none .L19
And Advanced SIMD:
.L18: ldr q31, [x1, x4] cmgt v31.4s, v31.4s, v26.4s umaxp v31.4s, v31.4s, v31.4s fmov x0, d31 cbnz x0, .L4 str q27, [x1, x5] add x1, x1, 16 cmp x1, x3 beq .L17 .L8: ldr q31, [x2, x1] mov v23.16b, v29.16b mov v25.16b, v28.16b cmgt v30.4s, v31.4s, v26.4s shl v27.4s, v31.4s, 1 add v28.4s, v28.4s, v24.4s umaxp v30.4s, v30.4s, v30.4s add v31.4s, v31.4s, v27.4s fmov x0, d30 add v29.4s, v29.4s, v31.4s cbz x0, .L18
And AArch32 Advanced SIMD:
.L19: vld1.64 {d16-d17}, [ip:64]! vcgt.s32 q8, q8, q13 vpmax.u32 d7, d16, d17 vpmax.u32 d7, d7, d7 vmov r3, s14 @ int cbnz r3, .L4 cmp r2, r0 vst1.64 {d24-d25}, [lr:64]! beq .L18 .L8: vld1.64 {d16-d17}, [r1:64]! adds r2, r2, #1 vmov q2, q10 @ v4si vmov q14, q9 @ v4si vcgt.s32 q11, q8, q13 vadd.i32 q12, q8, q8 vadd.i32 q10, q10, q15 vpmax.u32 d7, d22, d23 vadd.i32 q8, q8, q12 vpmax.u32 d7, d7, d7 vadd.i32 q9, q9, q8 vmov r3, s14 @ int cmp r3, #0 beq .L19
The generated sequences can be optimized further which is on the roadmap for GCC 15.
The AArch64 architecture has helpful instructions that can load and store two consecutive registers from/to memory. These instructions are not exposed as intrinsics as the expectation is that the compiler should be able to form them on their own.
In GCC, this has historically been done using the sched-fusion pass and relied on the scheduler to move possible candidates next to each other to reduce the search space. While this was initially sufficient this approach had proven to not be very flexible.
There are two main issues. First any limitations the scheduler has are automatically inherited by the pass. For instance, the scheduler does not do as extensive alias analysis as we would like to merge loads and stores that are further away. The scheduler is also currently limited to only scheduling instructions within a single basic block. The load/store forming is an example of an optimization that can be done on extended basic blocks instead. The current implementation of the new pass is limited to a single BB as well with plans to extend this in the future.
The second issue is that in certain cases we end up with worse code because the scheduler has placed the index updating instruction between the two memory accesses.
This would cause the register allocator to have to create a copy of the old value before incrementing the new one, resulting in higher register pressure and an additional instruction, instead of removing two.
To fix this in GCC 14, we have introduced a new AArch64 pass to form these pairs more aggressively using a framework that keeps SSA (single static assignment) information along in the RTL abstract syntax of GCC.
Having SSA information allows us to be more aggressive without incurring a large compile time overhead. The new pass Is enabled by default and catches many more cases that previously we did not. The example:
unsigned __int128 *f(unsigned __int128 *p) { unsigned __int128 ab; while ((ab = *p++)) ; return p; }
used to generate:
f(unsigned __int128*): .L2: ldr x1, [x0], 16 ldr x2, [x0, -8] orr x1, x1, x2 cbnz x1, .L2 ret
and now in GCC 14 generates:
f(unsigned __int128*): .L2: ldp x1, x2, [x0], 16 orr x1, x1, x2 cbnz x1, .L2 ret
Part of the power of the new pass is that it's light weight enough for us to run it twice, one before register allocation and one after. By running it before register allocation we prevent issues like scheduling from creating suboptimal allocations.
By running it after register allocation we catch any stack based loads and stores, which after register allocation have a fixed stack location.
In Part 2, we talk about the following topics:
Read GCC-14 Part 2