This year's GNU Compiler Collection (GCC) release not only contains a plethora of new Architecture features, but is also the most performant GCC release to date.
Arm, together with various partners and the GCC community, have been hard at work on improving the performance of the code generated by the compiler. This collaborative effort across the Arm partnership worked quite effectively with the wide GCC community during GCC 10 development. The result is one of the greatest increases in GCC performance in recent years.
You may be asking yourself what is new this year for GCC 10. Well, this year we put a 10 on the box. That is an estimated 10.5% performance improvement with AArch64 as measured on SPECrate® 2017 Integer on Neoverse N1 SDP Platform:
These are all rate=1 (single core) improvements. These improvements come from high-level transformations that benefit all CPUs. Most notably those based on Neoverse N1 such as the Graviton2 from Amazon and Altra from Ampere, as well as those based on other architectures.
How was this achieved? Can I expect to see any improvement in my own code? Keep reading to find out.
Auto vectorization is the process of which scalar code as written in your input language is automatically translated into SIMD instructions by the compiler. By doing this transformation each instruction processes multiple data points at the same time instead of sequentially. This allows your computation to finish quicker than it would normally have.
GCC's auto-vectorizer has undergone a lot of changes this year. Along the way such things as cost model and load permute issues were fixed. But from all the changes that happened in the vectorizer let us talk about some of the standout ones.
When vectorizing code for a fixed vector size ISA, which does not support masking, GCC has a choice to make when the iteration count is unknown. When the iteration count is unknown GCC vectorizes using the widest profitable vector length. As an example, GCC 9 generates for
void loop(int n, char * restrict a, char * restrict b, char * restrict res) { for (int i = 0; i < n; i++) { res[i] = a[i] + b[i]; } }
the following AArch64 assembly:
.L4: ldr q0, [x1, x4] ldr q1, [x2, x4] add v0.16b, v0.16b, v1.16b str q0, [x3, x4] add x4, x4, 16 cmp x4, x5 bne .L4 …
And a scalar epilogue. The compiler picked 128-bit vectors as the vector length to use since the range of the loop n does not give it any information about the iteration counts of the code. This is fine if for each call to loop the value of n is greater than or equal to 16. As the size of the elements of a and b get bigger the number n decreases as the vectorization factor (the number of elements processed in a single loop iteration) decreases.
If all calls to loop get called with a value of n between 8 and 16, then you never execute the vector code at all. In GCC 10, we have a new feature called vect-epilogues-no-mask which will create an additional vector "mask" in the loop epilogue. This mask is created by using the first vector length smaller than the one used in the main loop that is profitable. For loop GCC 10 generates:
.L4: ldr q0, [x1, x4] ldr q1, [x2, x4] add v0.16b, v0.16b, v1.16b str q0, [x3, x4] add x4, x4, 16 cmp x4, x5 bne .L4 ... beq .L1 .L3: ... ldr d0, [x1, w5, uxtw] and w6, w7, -8 ldr d1, [x2, w5, uxtw] add w4, w4, w6 add v0.8b, v0.8b, v1.8b str d0, [x3, w5, uxtw] cmp w7, w6 beq .L1 …
That is you have one loop as before that computes 16 elements at a time, one (not a loop) copy that does compute 8 elements at a time. The vector code is followed by the scalar loop that computes < 8 elements. This allows us to maximize the amount of time spent in vector code.
This option is enabled by default when vectorization is enabled and can be turned off with --param vect-epilogues-nomask=0.
GCC 9 is unable to vectorize a reduction (sum) during a loop if the reduction variable had a different sign than the one it stores to. As an example, this loop was not vectorized in GCC 9:
int32_t bar(int n, uint32_t * restrict x) { int32_t sum = 0; for (int i = 0; i < n; i++) { sum += x[i]; } return sum; }
and that is because sum has a different sign from x. However, because additions during the loop body are done using an unsigned value and unsigned overflow is defined, we can safely vectorize this reduction. GCC 10 will now generate:
.L14: ldr q1, [x2], 16 add v0.4s, v1.4s, v0.4s cmp x2, x0 bne .L14 addv s0, v0.4s umov w0, v0.s[0]
Such reductions no longer prevent vectorization of the entire loop like in GCC 9.
GCC 9's straight line (code that is not part of a loop) vectorizer was able to recognize sequences starting from reductions or group stores but not vector constructors.
To illustrate the problem, consider this loop:
char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for ( int y = 0; y < 128; y++ ) { for ( int x = 0; x < 16; x++ ) d[x] = s1[x] + s2[x]; d += 16; } }
The inner loop has a constant iteration count smaller than or equal to 16, which leads to the loop being fully unrolled by the tree level unroller in GCC's mid-end into something like:
char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for ( int y = 0; y < 128; y++ ) { d[0] = s1[0] + s2[0]; d[1] = s1[1] + s2[1]; d[2] = s1[2] + s2[2]; d[3] = s1[3] + s2[3]; ... d[15] = s1[15] + s2[15]; d += 16; } }
However what the compiler is actually doing due to constrains on the intermediate representation is:
char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char *d = g_d, *s1 = g_s1, *s2 = g_s2; for ( int y = 0; y < 128; y++ ) { d0 = s1[0] + s2[0]; d1 = s1[1] + s2[1]; d2 = s1[2] + s2[2]; d3 = s1[3] + s2[3]; ... d15 = s1[15] + s2[15]; d_constr = {d0, d1, d2, d3, d15}; *d = d_constr; //store the entire thing d += 16; } }
The store to d is now moved to a different basic block due to unrolling and loop unswitching. However GCC's straight-line vectorizer only looks for stores within a single basic block. To fix this the vectorizer has been taught to recognize the constr as a store as well. The added benefit here is that even if you do not end up creating an explicit store you still get vectorization.
GCC has a long list of idioms and patterns that it recognizes such as last year's Sum of Absolute Differences. This year we decided it's only fair to add some more.
Population count or popcount is an algorithm that counts the number of bits that are set to 1 in a value. A common C idiom for computing the population count (popcount) of a 64-bit number is
int foo (unsigned long long a) { unsigned long long b = a; b -= ((b>>1) & 0x5555555555555555ULL); b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL); b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL; b *= 0x0101010101010101ULL; return (int)(b >> 56); }
Architectures like AArch64 provide a single instruction for calculating the popcount of a value, and now GCC 10 can recognize this idiom for 32 bit and 64-bit popcounts and replace them with an optimal sequence. For the example above GCC 10 now recognizes this pattern and generates.
foo: fmov d0, x0 cnt v0.8b, v0.8b addv b0, v0.8b fmov w0, s0 ret
foo: lsr x1, x0, 1 mov x3, 0x101010101010101 and x1, x1, 0x5555555555555555 sub x1, x0, x1 and x2, x1, 0x3333333333333333 lsr x1, x1, 2 and x1, x1, 0x3333333333333333 add x1, x2, x1 add x1, x1, x1, lsr 4 and x0, x1, 0xF0F0F0F0F0F0F0F mul x0, x0, x3 lsr x0, x0, 56 ret
Similar to the popcount trick, there is also a reasonably commonly used trick to speed up count trailing zeros.
int ctz1 (unsigned x) { static const char table[32] = { 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 }; return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27]; }
which does the ctz lookup using a table. The table above is for a 32-bit number and the operation is essentially doing
array[((x & -x) * C) >> SHIFT]
Where C is a magic constant which when multiplied by a power of two creates a unique value in the top 5 bits or 6 bits. The matching of this table is reasonably expensive in the compiler as it needs to verify each table entry. However, most code bases will not have more than one of these tables.
With GCC 10 this entire sequence is compiled down to:
rbit w0, w0 clz w0, w0
Inter-procedural analysis in GCC is the set of phases that perform cross function optimizations such as cloning/specialization of functions. GCC 10 comes with a few improvements here, particularly in the area of constant propagation and switch statement specialization. Let us discuss some of the changes that made their way into the package this year.
Given the following simplified example:
int callee(int v) { // if (v < 3) // will make a constprop copy with v = 5 // if (v * 2 < 6) // will not make a constprop copy { ... } else return 1; } int caller() { return callee (5); }
GCC 9 (with in-lining disabled) will not propagate the value 5 into the if statement when the expression contains multiple operations. Because of this the function is not specialized and you still have a branch at runtime. This is because of the expression in the if conditional statement. GCC does a lot of optimizations based on costs. In this particular case, there is a conservative limit on how much it constant folds an expression.
With GCC 10, there is a more precise model which can be controlled by --params ipa-max-param-expr-ops. Consequently, the compiler can fully propagate constants in cases previously shown.
Like the if case above, switches had a similar limitation. Consider the example:
int callee(int v) { int i = 0; switch (v) { case 1: i = 3; break; case 2: i = 9; break; default: <large body>; } return i; } int caller() { return callee (1); }
In GCC 9, Inter-procedural analysis (IPA) constant propagation would always include the cost of the default case when calculating the cost of the individual case statements. Because of this it won't be able to fold away the switch in previous cases. GCC 10 has a new parameter --params ipa-max-switch-predicate-bounds which allows you to specify the maximum number of boundary endpoints of case ranges of switch statements. This allows you as the user to control how aggressive the compiler performs this optimization.
In GCC 9 Inter-procedural analysis (IPA) only supported constant propagation on by reference argument assignments with a constant. But if the value is instead the result of a computation with a unary or binary operation on a function argument then no constant propagation will be done. As an example, the following:
struct S { int a, b; }; void callee2 (struct S *s) { // use s->a; // use s->b; } void callee1 (int a, int *p) { struct S s; s.a = a; s.b = *p + 1; callee2 (&s); }
void caller () { int a = 1; int b = 2; callee1 (a, &b); }
Will not be optimized by IPA. No specialization is created because it does not know that s.b in s to callee2 is 3 due to the constant propagation failing on the addition. In GCC 10 this limitation has been lifted and such by reference cases are now handled correctly.
Another area of improvement is on allowing constant propagation of a parameter that is used to control the recursion of a function.
int recur_fn (int i) { if (i == 6) { do_work (); return 0; } do_prepare (); recur_fn (i + 1); do_post (); return 0; } int foo() { ... recur_fn (1); ... }
In such cases, you can duplicate recur_fn in foo six times as you know when the recursion will terminate. After you duplicate, you can optimize the recursive calls away to branches to labels which remove any function call overhead.
This transformation can be controlled by three new parameters --param ipa-cp-eval-threshold and --param ipa-cp-unit-growth and --param ipa-cp-max-recursive-depth. However, do keep in mind that using more aggressive values may get trigger worst case behavior within certain parts of the compiler such as register allocation. This tends to happen when you also have rather large functions that are only called once. In such cases using -fno-inline-functions-called-once will get you the best of both worlds.
One particular area where GCC 10 now does better is C++ loops using iterators.
A new feature in GCC 10 is the ability to split a loop up based on a condition that is only affected by some branches inside a loop.As an example consider the following loop:
void foo (std::map<int, int> m) { for (auto it = m.begin (); it != m.end (); ++it) { /* if (b) is semi-invariant. */ if (b) { b = do_something(); /* Has effect on b */ } else { /* No effect on b */ } statements; /* Also no effect on b */ } }
Here b is only changed in one branch inside the loop. However as soon as b is no longer true then you cannot ever enter the branch again but we have to keep testing the value in each iteration. Instead such loops can be split to a form that avoids this penalty such as:
void foo (std::map<int, int> m) { for (auto it = m.begin (); it != m.end (); ++it) { if (b) { b = do_something(); } else { ++it; statements; break; } statements; } for (; it != m.end (); ++it) { statements; } }
Aside from avoiding the branch check in each iteration, there is the extra benefit that if statements are empty then the second loop can be completely eliminated. If statements are not empty the simpler loop form gives makes it easier to vectorize.
This transformation is enabled by default at -O3 and can be turned off by -fno-split-loop. When using feedback directed optimization this transformation will not apply if the execution frequency of the loop is too low to be profitable. This cut-off percentage can be controlled by the new --params min-loop-cond-split-prob parameter.
As a consequence of the previous transformation we have can have a lot of empty loops. In general given the following loop:
void foo (std::map<int, int> m) { for (auto it = m.begin (); it != m.end (); ++it); }
If your iterator is a pure function (or more generally, the loop if finite) then the loop can be removed. To do this an option -ffinite-loops was added which assumes that a loop written with an exit will eventually take the exit. This together with a pure function causes the compiler to remove empty loops safely.
This option is on by default at -O2 for C++11 only.
Thanks to the hard work of everyone involved we have an impressive estimated 10.5% with SPECrate® 2017 Integer on Neoverse N1. These improvements are applicable in a wide range of applications and not just this benchmark.
To give an indication of the scale of these improvements, we can compare the estimated improvements in integer performance from GCC 7 to GCC 10:
To read the previous installments in the GCC performance series visit the 2019 or 2018 entries.
Stay tuned, we are back with more in GCC 11!
Read about new Arm architecture features in GCC 10