Significant performance improvements in GCC 10 through Vectorization and In-lining

May 22, 2020

13 minute read time.

This year's GNU Compiler Collection (GCC) release not only contains a plethora of new Architecture features, but is also the most performant GCC release to date.

Arm, together with various partners and the GCC community, have been hard at work on improving the performance of the code generated by the compiler. This collaborative effort across the Arm partnership worked quite effectively with the wide GCC community during GCC 10 development. The result is one of the greatest increases in GCC performance in recent years.

You may be asking yourself what is new this year for GCC 10. Well, this year we put a 10 on the box. That is an estimated 10.5% performance improvement with AArch64 as measured on SPECrate® 2017 Integer on Neoverse N1 SDP Platform:

GCC 10 SPECrate® 2017 Integer on Neoverse N1 estimated relative improvements.

These are all rate=1 (single core) improvements. These improvements come from high-level transformations that benefit all CPUs. Most notably those based on Neoverse N1 such as the Graviton2 from Amazon and Altra from Ampere, as well as those based on other architectures.

How was this achieved? Can I expect to see any improvement in my own code? Keep reading to find out.

GCC auto-vectorizer improvements

Auto vectorization is the process of which scalar code as written in your input language is automatically translated into SIMD instructions by the compiler. By doing this transformation each instruction processes multiple data points at the same time instead of sequentially. This allows your computation to finish quicker than it would normally have.

GCC's auto-vectorizer has undergone a lot of changes this year. Along the way such things as cost model and load permute issues were fixed. But from all the changes that happened in the vectorizer let us talk about some of the standout ones.

Vector epilogues

When vectorizing code for a fixed vector size ISA, which does not support masking, GCC has a choice to make when the iteration count is unknown. When the iteration count is unknown GCC vectorizes using the widest profitable vector length. As an example, GCC 9 generates for

void loop(int n, char * restrict a, char * restrict b, char * restrict res) {
  for (int i = 0; i < n; i++) {
    res[i] = a[i] + b[i];
  }
}

the following AArch64 assembly:

.L4:
        ldr     q0, [x1, x4]
        ldr     q1, [x2, x4]
        add     v0.16b, v0.16b, v1.16b
        str     q0, [x3, x4]
        add     x4, x4, 16
        cmp     x4, x5
        bne     .L4
        …

And a scalar epilogue. The compiler picked 128-bit vectors as the vector length to use since the range of the loop n does not give it any information about the iteration counts of the code. This is fine if for each call to loop the value of n is greater than or equal to 16. As the size of the elements of a and b get bigger the number n decreases as the vectorization factor (the number of elements processed in a single loop iteration) decreases.

If all calls to loop get called with a value of n between 8 and 16, then you never execute the vector code at all. In GCC 10, we have a new feature called vect-epilogues-no-mask which will create an additional vector "mask" in the loop epilogue. This mask is created by using the first vector length smaller than the one used in the main loop that is profitable. For loop GCC 10 generates:

.L4:
        ldr     q0, [x1, x4]
        ldr     q1, [x2, x4]
        add     v0.16b, v0.16b, v1.16b
        str     q0, [x3, x4]
        add     x4, x4, 16
        cmp     x4, x5
        bne     .L4
       ...
        beq     .L1
.L3:
        ...
        ldr     d0, [x1, w5, uxtw]
        and     w6, w7, -8
        ldr     d1, [x2, w5, uxtw]
        add     w4, w4, w6
        add     v0.8b, v0.8b, v1.8b
        str     d0, [x3, w5, uxtw]
        cmp     w7, w6
        beq     .L1
              …

That is you have one loop as before that computes 16 elements at a time, one (not a loop) copy that does compute 8 elements at a time. The vector code is followed by the scalar loop that computes < 8 elements. This allows us to maximize the amount of time spent in vector code.

This option is enabled by default when vectorization is enabled and can be turned off with --param vect-epilogues-nomask=0.

Sign reduction

GCC 9 is unable to vectorize a reduction (sum) during a loop if the reduction variable had a different sign than the one it stores to. As an example, this loop was not vectorized in GCC 9:

int32_t bar(int n, uint32_t * restrict x) {
  int32_t sum = 0;
  for (int i = 0; i < n; i++) {
     sum += x[i];
  }
  return sum;
}

and that is because sum has a different sign from x. However, because additions during the loop body are done using an unsigned value and unsigned overflow is defined, we can safely vectorize this reduction. GCC 10 will now generate:

.L14:
        ldr     q1, [x2], 16
        add     v0.4s, v1.4s, v0.4s
        cmp     x2, x0
        bne     .L14
        addv    s0, v0.4s
        umov    w0, v0.s[0]

Such reductions no longer prevent vectorization of the entire loop like in GCC 9.

Vectorize vector constructors

GCC 9's straight line (code that is not part of a loop) vectorizer was able to recognize sequences starting from reductions or group stores but not vector constructors.

To illustrate the problem, consider this loop:

char g_d[1024], g_s1[1024], g_s2[1024];
void test_loop(void)
{
    char *d = g_d, *s1 = g_s1, *s2 = g_s2;
   for ( int y = 0; y < 128; y++ )
   {
     for ( int x = 0; x < 16; x++ )
       d[x] = s1[x] + s2[x];
     d += 16;
   }
}

The inner loop has a constant iteration count smaller than or equal to 16, which leads to the loop being fully unrolled by the tree level unroller in GCC's mid-end into something like:

char g_d[1024], g_s1[1024], g_s2[1024];  
void test_loop(void)  
{  
    char *d = g_d, *s1 = g_s1, *s2 = g_s2;  
    for ( int y = 0; y < 128; y++ )  
    {  
        d[0] = s1[0] + s2[0];  
        d[1] = s1[1] + s2[1];  
        d[2] = s1[2] + s2[2];  
        d[3] = s1[3] + s2[3];  
        ...  
        d[15] = s1[15] + s2[15];  
        d += 16;  
    }  
}

However what the compiler is actually doing due to constrains on the intermediate representation is:

char g_d[1024], g_s1[1024], g_s2[1024];  
void test_loop(void)  
{  
    char *d = g_d, *s1 = g_s1, *s2 = g_s2;  
    for ( int y = 0; y < 128; y++ )  
    {  
        d0 = s1[0] + s2[0];  
        d1 = s1[1] + s2[1];  
        d2 = s1[2] + s2[2];  
        d3 = s1[3] + s2[3];  
        ...  
        d15 = s1[15] + s2[15];  
        d_constr = {d0, d1, d2, d3, d15};
        *d = d_constr; //store the entire thing
        d += 16;  
    }  
}

The store to d is now moved to a different basic block due to unrolling and loop unswitching. However GCC's straight-line vectorizer only looks for stores within a single basic block. To fix this the vectorizer has been taught to recognize the constr as a store as well. The added benefit here is that even if you do not end up creating an explicit store you still get vectorization.

More advanced pattern recognition

GCC has a long list of idioms and patterns that it recognizes such as last year's Sum of Absolute Differences. This year we decided it's only fair to add some more.

Popcount

Population count or popcount is an algorithm that counts the number of bits that are set to 1 in a value. A common C idiom for computing the population count (popcount) of a 64-bit number is

int
foo (unsigned long long a)
{
    unsigned long long b = a;
    b -= ((b>>1) & 0x5555555555555555ULL);
    b = ((b>>2) & 0x3333333333333333ULL) + (b & 0x3333333333333333ULL);
    b = ((b>>4) + b) & 0x0F0F0F0F0F0F0F0FULL;
    b *= 0x0101010101010101ULL;
    return (int)(b >> 56);
}

Architectures like AArch64 provide a single instruction for calculating the popcount of a value, and now GCC 10 can recognize this idiom for 32 bit and 64-bit popcounts and replace them with an optimal sequence. For the example above GCC 10 now recognizes this pattern and generates.

foo:
        fmov    d0, x0
        cnt     v0.8b, v0.8b
        addv    b0, v0.8b
        fmov    w0, s0
        ret

instead of

foo:
        lsr     x1, x0, 1
        mov     x3, 0x101010101010101
        and     x1, x1, 0x5555555555555555
        sub     x1, x0, x1
        and     x2, x1, 0x3333333333333333
        lsr     x1, x1, 2
        and     x1, x1, 0x3333333333333333
        add     x1, x2, x1
        add     x1, x1, x1, lsr 4
        and     x0, x1, 0xF0F0F0F0F0F0F0F
        mul     x0, x0, x3
        lsr     x0, x0, 56
        ret

Table based count trailing zeros (ctz)

Similar to the popcount trick, there is also a reasonably commonly used trick to speed up count trailing zeros.

Consider:

int ctz1 (unsigned x)
{
  static const char table[32] =
    {
      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };
  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
}

which does the ctz lookup using a table. The table above is for a 32-bit number and the operation is essentially doing

array[((x & -x) * C) >> SHIFT]

Where C is a magic constant which when multiplied by a power of two creates a unique value in the top 5 bits or 6 bits. The matching of this table is reasonably expensive in the compiler as it needs to verify each table entry. However, most code bases will not have more than one of these tables.

With GCC 10 this entire sequence is compiled down to:

 rbit w0, w0
 clz w0, w0

IPA improvements

Inter-procedural analysis in GCC is the set of phases that perform cross function optimizations such as cloning/specialization of functions. GCC 10 comes with a few improvements here, particularly in the area of constant propagation and switch statement specialization. Let us discuss some of the changes that made their way into the package this year.

Improved constant propagation

Given the following simplified example:

int callee(int v)
{
  // if (v < 3)      // will make a constprop copy with v = 5
  // if (v * 2 < 6)  // will not make a constprop copy
    {
       ...
    }
  else
    return 1;
}

int caller()
{
  return callee (5);
}

GCC 9 (with in-lining disabled) will not propagate the value 5 into the if statement when the expression contains multiple operations. Because of this the function is not specialized and you still have a branch at runtime. This is because of the expression in the if conditional statement. GCC does a lot of optimizations based on costs. In this particular case, there is a conservative limit on how much it constant folds an expression.

With GCC 10, there is a more precise model which can be controlled by --params ipa-max-param-expr-ops. Consequently, the compiler can fully propagate constants in cases previously shown.

More aggressive switch statement specializations

Like the if case above, switches had a similar limitation. Consider the example:

int callee(int v)
{
  int i = 0;
  switch (v)
   {
     case 1:
       i = 3;
       break;
     case 2:
       i = 9;
       break;
     default: 
        <large body>;
  }
  return i;
}

int caller()
{
  return callee (1);
}

In GCC 9, Inter-procedural analysis (IPA) constant propagation would always include the cost of the default case when calculating the cost of the individual case statements. Because of this it won't be able to fold away the switch in previous cases.

GCC 10 has a new parameter --params ipa-max-switch-predicate-bounds which allows you to specify the maximum number of boundary endpoints of case ranges of switch statements. This allows you as the user to control how aggressive the compiler performs this optimization.

By-ref arguments constant propagation

In GCC 9 Inter-procedural analysis (IPA) only supported constant propagation on by reference argument assignments with a constant. But if the value is instead the result of a computation with a unary or binary operation on a function argument then no constant propagation will be done. As an example, the following:

struct S
{
  int a, b;
};

void callee2 (struct S *s) {
  // use s->a;
  // use s->b; 
}

void callee1 (int a, int *p) {
  struct S s;
  s.a = a;
  s.b = *p + 1;
  callee2 (&s);
}

when called with

void caller () {
  int a = 1;
  int b = 2;
  callee1 (a, &b);
}

Will not be optimized by IPA. No specialization is created because it does not know that s.b in s to callee2 is 3 due to the constant propagation failing on the addition. In GCC 10 this limitation has been lifted and such by reference cases are now handled correctly.

Recursive in-lining

Another area of improvement is on allowing constant propagation of a parameter that is used to control the recursion of a function.

int recur_fn (int i)
{
  if (i == 6) {
      do_work ();
      return 0;
  }
  do_prepare ();
  recur_fn (i + 1);
  do_post ();
  return 0;
}

int foo() {
  ...
  recur_fn (1);
  ...
}

In such cases, you can duplicate recur_fn in foo six times as you know when the recursion will terminate. After you duplicate, you can optimize the recursive calls away to branches to labels which remove any function call overhead.

This transformation can be controlled by three new parameters --param ipa-cp-eval-threshold and --param ipa-cp-unit-growth and --param ipa-cp-max-recursive-depth. However, do keep in mind that using more aggressive values may get trigger worst case behavior within certain parts of the compiler such as register allocation. This tends to happen when you also have rather large functions that are only called once. In such cases using -fno-inline-functions-called-once will get you the best of both worlds.

Loop optimization improvements

One particular area where GCC 10 now does better is C++ loops using iterators.

Split loops on semi-invariant conditionals

A new feature in GCC 10 is the ability to split a loop up based on a condition that is only affected by some branches inside a loop.

As an example consider the following loop:

void foo (std::map<int, int> m) {
 for (auto it = m.begin (); it != m.end (); ++it) {
  /* if (b) is semi-invariant. */
  if (b) {
   b = do_something();    /* Has effect on b */
  } else {
           /* No effect on b */
  }
  statements;                /* Also no effect on b */
 }
}

Here b is only changed in one branch inside the loop. However as soon as b is no longer true then you cannot ever enter the branch again but we have to keep testing the value in each iteration.
Instead such loops can be split to a form that avoids this penalty such as:

void foo (std::map<int, int> m) {
 for (auto it = m.begin (); it != m.end (); ++it) {
  if (b) {
   b = do_something();
  } else {
   ++it;
   statements;
   break;
  }
  statements;
 }
 for (; it != m.end (); ++it) {
  statements;
 }
}

Aside from avoiding the branch check in each iteration, there is the extra benefit that if statements are empty then the second loop can be completely eliminated.
If statements are not empty the simpler loop form gives makes it easier to vectorize.

This transformation is enabled by default at -O3 and can be turned off by -fno-split-loop. When using feedback directed optimization this transformation will not apply if the execution frequency of the loop is too low to be profitable. This cut-off percentage can be controlled by the new --params min-loop-cond-split-prob parameter.

C++ empty loop elimination

As a consequence of the previous transformation we have can have a lot of empty loops. In general given the following loop:

void foo (std::map<int, int> m) {
  for (auto it = m.begin (); it != m.end (); ++it);
}

If your iterator is a pure function (or more generally, the loop if finite) then the loop can be removed. To do this an option -ffinite-loops was added which assumes that a loop written with an exit will eventually take the exit. This together with a pure function causes the compiler to remove empty loops safely.

This option is on by default at -O2 for C++11 only.

Performance impact past, present and future

Thanks to the hard work of everyone involved we have an impressive estimated 10.5% with SPECrate® 2017 Integer on Neoverse N1. These improvements are applicable in a wide range of applications and not just this benchmark.

To give an indication of the scale of these improvements, we can compare the estimated improvements in integer performance from GCC 7 to GCC 10:

Estimated GCC improvements on SPECrate® 2017 Integer on Neoverse N1 from GCC 7 to GCC 10.

To read the previous installments in the GCC performance series visit the 2019 or 2018 entries.

Stay tuned, we are back with more in GCC 11!

Read about new Arm architecture features in GCC 10

Tools, Software and IDEs blog

What is new in LLVM 21?

Volodymyr Turanskyy

LLVM 21.1.0 was released on August 26, 2025 with contributions from Arm engineers to improve performance. This post provides the key highlights.
- October 24, 2025
CPython Core Dev Sprint 2025 at Arm Cambridge: The biggest one yet

Diego Russo

For one week, Arm’s Cambridge HQ became the heart of Python development. Contributors globally came together for the CPython Core Developer Sprint.
- October 9, 2025
Python on Arm: 2025 Update

Diego Russo

Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
- August 21, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog