An update on GNU performance

October 18, 2018

7 minute read time.

Welcome back!

A lot has happened since the last GNU Tool chain performance update. The GCC and glibc communities have been hard at work since, improving the compiler and the library with new usability, language, debugging, security and a myriad of other features.

The GNU toolchain team here at Arm is deeply involved in both projects, making sure that all these vital components take full advantage of the awesome features in the Arm Architecture and Arm-based processors. I'd like to update you on some of the performance improvements we've been working on. These are queued for the GCC 9 and glibc 2.29 release in 2019. Let's dive in.

GCC auto-vectorization improvements

Popular media codecs like x264 have many performance-critical loops operating on 8-bit values, usually representing the RGB, alpha channels, or other properties of a pixel. The loops themselves may calculate various aggregates used in a filter, like a sum of absolute differences of adjacent pixels, or their average. Let's have a look at some examples.

Sum of absolute differences

A simple, commonly-used way of computing the "difference" between two images is to compute the absolute differences between the corresponding pixels in each image and sum them up.

In C, this can be expressed as:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <stdlib.h>
#define N 16
/* The two 16-pixel images.  */
unsigned char img1[N];
unsigned char img2[N];
int
sum_of_absolute_differences (void)
{
  int sum = 0;
    
  for (int i = 0; i < N; i++)
    sum += abs (img1[i] - img2[i]);
    
  return sum;
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#include <stdlib.h>

#define N 16

/* The two 16-pixel images.  */
unsigned char img1[N];
unsigned char img2[N];

int
sum_of_absolute_differences (void)
{
  int sum = 0;
    
  for (int i = 0; i < N; i++)
    sum += abs (img1[i] - img2[i]);
    
  return sum;
}

Let's compile that with an AArch64-targeted GCC 8.1 compiler at the -O3 level. We get the following assembly:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
sum_of_absolute_differences:
        adrp    x1, img1
        adrp    x0, img2
        add     x1, x1, :lo12:img1
        add     x0, x0, :lo12:img2
        ldr     q0, [x1]
        ldr     q4, [x0]
        ushll   v1.8h, v0.8b, 0
        ushll2  v0.8h, v0.16b, 0
        ushll   v2.8h, v4.8b, 0
        ushll2  v4.8h, v4.16b, 0
        usubl   v3.4s, v1.4h, v2.4h
        usubl2  v1.4s, v1.8h, v2.8h
        usubl   v2.4s, v0.4h, v4.4h
        usubl2  v0.4s, v0.8h, v4.8h
        abs     v3.4s, v3.4s
        abs     v1.4s, v1.4s
        abs     v2.4s, v2.4s
        abs     v0.4s, v0.4s
        add     v1.4s, v3.4s, v1.4s
        add     v1.4s, v2.4s, v1.4s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

sum_of_absolute_differences:
        adrp    x1, img1
        adrp    x0, img2
        add     x1, x1, :lo12:img1
        add     x0, x0, :lo12:img2
        ldr     q0, [x1]
        ldr     q4, [x0]
        ushll   v1.8h, v0.8b, 0
        ushll2  v0.8h, v0.16b, 0
        ushll   v2.8h, v4.8b, 0
        ushll2  v4.8h, v4.16b, 0
        usubl   v3.4s, v1.4h, v2.4h
        usubl2  v1.4s, v1.8h, v2.8h
        usubl   v2.4s, v0.4h, v4.4h
        usubl2  v0.4s, v0.8h, v4.8h
        abs     v3.4s, v3.4s
        abs     v1.4s, v1.4s
        abs     v2.4s, v2.4s
        abs     v0.4s, v0.4s
        add     v1.4s, v3.4s, v1.4s
        add     v1.4s, v2.4s, v1.4s
        add     v0.4s, v0.4s, v1.4s
        addv    s0, v0.4s
        umov    w0, v0.s[0]
        ret

At a first glance, we see that the generated code uses the Advanced SIMD registers, which means the compiler has auto-vectorized the code. This is a great start. However, if you know your AArch64 assembly, you'll notice that the abs instruction is operating on vectors of 32-bit values. But our input values are 8-bit chars! What's going on? The reason for that is the implicit promotion rules in the C languages. And, the abs standard function takes an int, rather than a char. So, in reality the calculation in the simple loop given above is actually equivalent to:

Fullscreen

1
2
3
4
5
6
7
for (int i = 0; i < N; i++)
  {
    int pix1 = (int) img1[i]; // sign-extend to 32 bits
    int pix2 = (int) img2[i]; // sign-extend to 32 bits
    int diff = pix1 - pix2; // 32-bit subtraction
    sum += abs (diff); // 32-bit abs and addition
  }
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

for (int i = 0; i < N; i++)
  {
    int pix1 = (int) img1[i]; // sign-extend to 32 bits
    int pix2 = (int) img2[i]; // sign-extend to 32 bits
    int diff = pix1 - pix2; // 32-bit subtraction
    sum += abs (diff); // 32-bit abs and addition
  }

So, even though the inputs are two 128-bit vectors (16x8) that fit neatly into each vector register, in order to compute the sum of absolute differences we end up sign-extending the input values to a wider type before performing the subtraction, absolute value and accumulation on the wider type.

Can we do better?

As it turns out, yes we can. GCC has pattern-matching capabilities in its auto-vectorizer and can detect this pattern in order to emit something smarter. GCC 9 knows how to emit a much more efficient sequence for the above operation. Check it out:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
sum_of_absolute_differences:
        adrp    x1, img1
        adrp    x0, img2
        add     x1, x1, :lo12:img1
        add     x0, x0, :lo12:img2
        movi    v0.4s, 0
        ldr     q3, [x0]
        ldr     q2, [x1]
        uabdl2  v1.8h, v2.16b, v3.16b
        uabal   v1.8h, v2.8b, v3.8b
        uadalp  v0.4s, v1.8h
        addv    s0, v0.4s
        umov    w0, v0.s[0]
        ret
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

sum_of_absolute_differences:
        adrp    x1, img1
        adrp    x0, img2
        add     x1, x1, :lo12:img1
        add     x0, x0, :lo12:img2
        movi    v0.4s, 0
        ldr     q3, [x0]
        ldr     q2, [x1]
        uabdl2  v1.8h, v2.16b, v3.16b
        uabal   v1.8h, v2.8b, v3.8b
        uadalp  v0.4s, v1.8h
        addv    s0, v0.4s
        umov    w0, v0.s[0]
        ret

We've reduced the computation to just 3 Advanced SIMD instructions, but there is a lot going on in those 3 operations! What we're after is eliminating as many explicit extension operations as possible. Thankfully, the Advanced SIMD instruction set offers many such operations that combine vectorized arithmetic operations with the necessary extensions. As you can see in the diagram below, we eliminate all the explicit extension instructions. With the UABDL2, UABAL and UADALP instruction sequence we can calculate the sums of absolute differences between each pair of elements in the input 16 byte vectors, add them up, and widen to the appropriate 32-bit aggregates, ready for a final reduction in the end with the ADDV instruction.

GCC 9 optimal sequence to generate

Vectorized average

Another common operation encountered in filters is computing the average of two images. This can be expressed in C as:

As with the previous example, appearances can be deceiving. The implicit promotion rules result in more code than you'd expect. GCC 8.1 generates this assembly:

Fullscreen

1
2
3
4
5
6
7
8
#include <stdint.h>
void
avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c)
{
  for (int i = 0; i < 16; ++i)
    a[i] = (b[i] + c[i] + 1) >> 1; // Average rounding upwards
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#include <stdint.h>

void
avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c)
{
  for (int i = 0; i < 16; ++i)
    a[i] = (b[i] + c[i] + 1) >> 1; // Average rounding upwards
}

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
avg:
        ldr     q0, [x1]
        ldr     q4, [x2]
        ushll   v1.8h, v0.8b, 0
        ushll2  v0.8h, v0.16b, 0
        ushll   v2.8h, v4.8b, 0
        ushll2  v4.8h, v4.16b, 0
        movi    v5.4s, 0x1
        uaddl   v3.4s, v1.4h, v2.4h
        uaddl2  v1.4s, v1.8h, v2.8h
        uaddl   v2.4s, v0.4h, v4.4h
        uaddl2  v0.4s, v0.8h, v4.8h
        add     v3.4s, v3.4s, v5.4s
        add     v1.4s, v1.4s, v5.4s
        add     v2.4s, v2.4s, v5.4s
        add     v0.4s, v0.4s, v5.4s
        sshr    v3.4s, v3.4s, 1
        sshr    v1.4s, v1.4s, 1
        sshr    v2.4s, v2.4s, 1
        sshr    v0.4s, v0.4s, 1
        xtn     v4.4h, v3.4s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

avg:
        ldr     q0, [x1]
        ldr     q4, [x2]
        ushll   v1.8h, v0.8b, 0
        ushll2  v0.8h, v0.16b, 0
        ushll   v2.8h, v4.8b, 0
        ushll2  v4.8h, v4.16b, 0
        movi    v5.4s, 0x1
        uaddl   v3.4s, v1.4h, v2.4h
        uaddl2  v1.4s, v1.8h, v2.8h
        uaddl   v2.4s, v0.4h, v4.4h
        uaddl2  v0.4s, v0.8h, v4.8h
        add     v3.4s, v3.4s, v5.4s
        add     v1.4s, v1.4s, v5.4s
        add     v2.4s, v2.4s, v5.4s
        add     v0.4s, v0.4s, v5.4s
        sshr    v3.4s, v3.4s, 1
        sshr    v1.4s, v1.4s, 1
        sshr    v2.4s, v2.4s, 1
        sshr    v0.4s, v0.4s, 1
        xtn     v4.4h, v3.4s
        xtn2    v4.8h, v1.4s
        xtn     v1.4h, v2.4s
        xtn2    v1.8h, v0.4s
        xtn     v0.8b, v4.8h
        xtn2    v0.16b, v1.8h
        str     q0, [x0]
        ret

As you can see, the implicit casts have snuck in again. And now there's an explicit narrowing operation at the end too! Let's make them explicit:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <stdint.h>
void
avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c)
{
  for (int i = 0; i < 16; ++i)
    {
      int bel = (int)b[i];
      int cel = (int)c[i];
      int sum = bel + cel + 1;
      int int_avg = sum >> 1;
      a[i] = (uint8_t) int_avg; 
    }
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#include <stdint.h>

void
avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c)
{
  for (int i = 0; i < 16; ++i)
    {
      int bel = (int)b[i];
      int cel = (int)c[i];
      int sum = bel + cel + 1;
      int int_avg = sum >> 1;
      a[i] = (uint8_t) int_avg; 
    }
}

Even though the inputs and outputs to the function are all 8-bit char types the language forces costly conversions to the wider int type. This is a common problem with code dealing with short data types. In GCC 9 we improved the auto-vectorizer to detect such excessive extensions and limit them to only the precision

strictly needed for the computation. Combine that with a bit more pattern detection and GCC 9 can now generate:

Fullscreen

1
2
3
4
5
6
avg:
        ldr     q0, [x1]
        ldr     q1, [x2]
        urhadd  v0.16b, v0.16b, v1.16b
        str     q0, [x0]
        ret
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

avg:
        ldr     q0, [x1]
        ldr     q1, [x2]
        urhadd  v0.16b, v0.16b, v1.16b
        str     q0, [x0]
        ret

The computation is collapsed into a single URHADD instruction that computes the average of each pair of vector elements, rounding upwards. The computation is performed in 9-bit precision, which the compiler has now determined is the minimum precision needed to give the correct result. No expensive two-step conversions to 32-bit ints.

Performance impact

With the above, and other vectorizer improvements, the GCC 9-generated code shows an improvement of 17% on the 525.x264_r benchmark from the SPECCPU 2017 suite over GCC 8.1 as measured on an Arm Cortex-A72 processor. Not bad!

Math library improvements

Many scientific workloads depend heavily on standard math functions to perform fundamental operations like exponentiation and logarithms. To that end, we've worked hard to improve the performance of the exp, log and pow double precision floating-point functions, as well as their derivatives such as log2, exp2. Szabolcs Nagy has reimplemented the routines, resulting in lower latency, higher throughput, and simpler, more maintainable code. Good accuracy and performance was achieved by carefully designed table based methods, optimized special case handling and considering instruction level parallelism of modern CPUs.

Wilco Dijkstra rewrote the single-precision trigonometric routines sinf, cosf, sincosf resulting in significant speedups on them as well. The performance improvement on these routines can vary significantly depending on the input value.

Improving the math routines in glibc is a large ongoing theme, with big rewrites such as this and continued incremental improvements. For example, the exp, log, pow, sin and cos functions were correctly rounded in nearest rounding mode, which meant they could spend huge amount of time doing multi-precision arithmetic to decide which way to round the result when that happened to be near half way between two floating-point values. The miniscule gain in accuracy from that was not worth the huge performance penalty for the vast majority of users of these routines. This was fixed in glibc 2.28, speeding up these routines in the process. Check out the numbers below.

Performance improvements over glibc 2.27

Notice how the throughput in functions such as pow, log and exp is more than 3x the previous implementation. With such dramatic improvements in the functions themselves, the effect on full programs is very noticeable. We see gains of about 15% on the 527.cam4_r and 544.nab_r benchmaks from the SPECCPU 2017 suite whereas the 465.tonto benchmark from SPECCPU 2006 sees a whopping 30% improvement on an Arm Cortex-A57 system.

Szabolcs Nagy presented this work at the GNU Cauldron 2018 in Manchester, UK. The feedback from the glibc community helped in getting these routines in production-ready shape.

Conclusion

We presented some of the auto-vectorization improvements in the upcoming GCC 9 that help media encoding workloads and introduced some of the problems a compiler must solve to make the best of a SIMD instruction set. We also showed the math library improvements in the upcoming glibc 2.29 release and the latency and throughput improvements we expect. As you can see, it's a very exciting time for the GNU toolchain.

What improvements would you like to see or read about? Let us know in the comments!

Find out more about the impact of GCC

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

An update on GNU performance

Welcome back!

GCC auto-vectorization improvements

Sum of absolute differences

Can we do better?

Vectorized average

Performance impact

Math library improvements

Conclusion

GCC 15: Continuously Improving

GitHub and Arm are transforming development on Windows for developers

What is new in LLVM 20?