## Welcome back!

A lot has happened since the last GNU Toolchain performance update. The **GCC** and **glibc** communities have been hard at work since, improving the compiler and the library with new usability, language, debugging, security and a myriad of other features.

The GNU toolchain team here at Arm is deeply involved in both projects, making sure that all these vital components take full advantage of the awesome features in the Arm Architecture and Arm-based processors. I'd like to update you on some of the performance improvements we've been working on. These are queued for the **GCC 9** and **glibc 2.29** release in 2019. Let's dive in.

## GCC auto-vectorization improvements

Popular media codecs like x264 have many performance-critical loops operating on 8-bit values, usually representing the RGB, alpha channels, or other properties of a pixel. The loops themselves may calculate various aggregates used in a filter, like a sum of absolute differences of adjacent pixels, or their average. Let's have a look at some examples.

### Sum of absolute differences

A simple, commonly-used way of computing the "difference" between two images is to compute the absolute differences between the corresponding pixels in each image and sum them up.

In C, this can be expressed as:

#include <stdlib.h> #define N 16 /* The two 16-pixel images. */ unsigned char img1[N]; unsigned char img2[N]; int sum_of_absolute_differences (void) { int sum = 0; for (int i = 0; i < N; i++) sum += abs (img1[i] - img2[i]); return sum; }

Let's compile that with an AArch64-targeted **GCC 8.1** compiler at the `-O3`

level. We get the following assembly:

sum_of_absolute_differences: adrp x1, img1 adrp x0, img2 add x1, x1, :lo12:img1 add x0, x0, :lo12:img2 ldr q0, [x1] ldr q4, [x0] ushll v1.8h, v0.8b, 0 ushll2 v0.8h, v0.16b, 0 ushll v2.8h, v4.8b, 0 ushll2 v4.8h, v4.16b, 0 usubl v3.4s, v1.4h, v2.4h usubl2 v1.4s, v1.8h, v2.8h usubl v2.4s, v0.4h, v4.4h usubl2 v0.4s, v0.8h, v4.8h abs v3.4s, v3.4s abs v1.4s, v1.4s abs v2.4s, v2.4s abs v0.4s, v0.4s add v1.4s, v3.4s, v1.4s add v1.4s, v2.4s, v1.4s add v0.4s, v0.4s, v1.4s addv s0, v0.4s umov w0, v0.s[0] ret

At a first glance, we see that the generated code uses the Advanced SIMD registers, which means the compiler has auto-vectorized the code. This is a great start. However, if you know your AArch64 assembly, you'll notice that the `abs`

instruction is operating on vectors of 32-bit values. But our input values are 8-bit chars! What's going on? The reason for that is the implicit promotion rules in the C languages. And, the `abs`

standard function takes an `int`

, rather than a `char`

. So, in reality the calculation in the simple loop given above is actually equivalent to:

for (int i = 0; i < N; i++) { int pix1 = (int) img1[i]; // sign-extend to 32 bits int pix2 = (int) img2[i]; // sign-extend to 32 bits int diff = pix1 - pix2; // 32-bit subtraction sum += abs (diff); // 32-bit abs and addition }

So, even though the inputs are two 128-bit vectors (16x8) that fit neatly into each vector register, in order to compute the sum of absolute differences we end up sign-extending the input values to a wider type before performing the subtraction, absolute value and accumulation on the wider type.

#### Can we do better?

As it turns out, yes we can. **GCC** has pattern-matching capabilities in its auto-vectorizer and can detect this pattern in order to emit something smarter. **GCC 9** knows how to emit a much more efficient sequence for the above operation. Check it out:

sum_of_absolute_differences: adrp x1, img1 adrp x0, img2 add x1, x1, :lo12:img1 add x0, x0, :lo12:img2 movi v0.4s, 0 ldr q3, [x0] ldr q2, [x1] uabdl2 v1.8h, v2.16b, v3.16b uabal v1.8h, v2.8b, v3.8b uadalp v0.4s, v1.8h addv s0, v0.4s umov w0, v0.s[0] ret

We've reduced the computation to just 3 Advanced SIMD instructions, but there is a lot going on in those 3 operations! What we're after is eliminating as many explicit extension operations as possible. Thankfully, the Advanced SIMD instruction set offers many such operations that combine vectorized arithmetic operations with the necessary extensions. As you can see in the diagram below, we eliminate all the explicit extension instructions. With the UABDL2, UABAL and UADALP instruction sequence we can calculate the sums of absolute differences between each pair of elements in the input 16 byte vectors, add them up, and widen to the appropriate 32-bit aggregates, ready for a final reduction in the end with the ADDV instruction.

### Vectorized average

Another common operation encountered in filters is computing the average of two images. This can be expressed in C as:

As with the previous example, appearances can be deceiving. The implicit promotion rules result in more code than you'd expect. **GCC 8.1** generates this assembly:

#include <stdint.h> void avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c) { for (int i = 0; i < 16; ++i) a[i] = (b[i] + c[i] + 1) >> 1; // Average rounding upwards }

avg: ldr q0, [x1] ldr q4, [x2] ushll v1.8h, v0.8b, 0 ushll2 v0.8h, v0.16b, 0 ushll v2.8h, v4.8b, 0 ushll2 v4.8h, v4.16b, 0 movi v5.4s, 0x1 uaddl v3.4s, v1.4h, v2.4h uaddl2 v1.4s, v1.8h, v2.8h uaddl v2.4s, v0.4h, v4.4h uaddl2 v0.4s, v0.8h, v4.8h add v3.4s, v3.4s, v5.4s add v1.4s, v1.4s, v5.4s add v2.4s, v2.4s, v5.4s add v0.4s, v0.4s, v5.4s sshr v3.4s, v3.4s, 1 sshr v1.4s, v1.4s, 1 sshr v2.4s, v2.4s, 1 sshr v0.4s, v0.4s, 1 xtn v4.4h, v3.4s xtn2 v4.8h, v1.4s xtn v1.4h, v2.4s xtn2 v1.8h, v0.4s xtn v0.8b, v4.8h xtn2 v0.16b, v1.8h str q0, [x0] ret

As you can see, the implicit casts have snuck in again. And now there's an explicit narrowing operation at the end too! Let's make them explicit:

#include <stdint.h> void avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c) { for (int i = 0; i < 16; ++i) { int bel = (int)b[i]; int cel = (int)c[i]; int sum = bel + cel + 1; int int_avg = sum >> 1; a[i] = (uint8_t) int_avg; } }

Even though the inputs and outputs to the function are all 8-bit `char`

types the language forces costly conversions to the wider `int`

type. This is a common problem with code dealing with short data types. In **GCC 9** we improved the auto-vectorizer to detect such excessive extensions and limit them to only the precision

strictly needed for the computation. Combine that with a bit more pattern detection and **GCC 9** can now generate:

avg: ldr q0, [x1] ldr q1, [x2] urhadd v0.16b, v0.16b, v1.16b str q0, [x0] ret

The computation is collapsed into a single URHADD instruction that computes the average of each pair of vector elements, rounding upwards. The computation is performed in 9-bit precision, which the compiler has now determined is the minimum precision needed to give the correct result. No expensive two-step conversions to 32-bit `int`

s.

### Performance impact

With the above, and other vectorizer improvements, the GCC 9-generated code shows an improvement of **17% **on the 525.x264_r benchmark from the SPECCPU 2017 suite over **GCC 8.1** as measured on an Arm Cortex-A72 processor. Not bad!

## Math library improvements

Many scientific workloads depend heavily on standard math functions to perform fundamental operations like exponentiation and logarithms. To that end, we've worked hard to improve the performance of the `exp`

, `log`

and `pow `

double precision floating-point functions, as well as their derivatives such as `log2`

, `exp2`

. Szabolcs Nagy has reimplemented the routines, resulting in lower latency, higher throughput, and simpler, more maintainable code. Good accuracy and performance was achieved by carefully designed table based methods, optimized special case handling and considering instruction level parallelism of modern CPUs.

Wilco Dijkstra rewrote the single-precision trigonometric routines `sinf`

, `cosf`

, `sincosf `

resulting in significant speedups on them as well. The performance improvement on these routines can vary significantly depending on the input value.

Improving the math routines in **glibc** is a large ongoing theme, with big rewrites such as this and continued incremental improvements. For example, the `exp`

, `log`

, `pow`

, `sin`

and `cos `

functions were correctly rounded in nearest rounding mode, which meant they could spend huge amount of time doing multi-precision arithmetic to decide which way to round the result when that happened to be near half way between two floating-point values. The miniscule gain in accuracy from that was not worth the huge performance penalty for the vast majority of users of these routines. This was fixed in **glibc 2.28**, speeding up these routines in the process. Check out the numbers below.

Notice how the throughput in functions such as `pow`

, `log`

and `exp`

is more than **3x** the previous implementation. With such dramatic improvements in the functions themselves, the effect on full programs is very noticeable. We see gains of about **15%** on the 527.cam4_r and 544.nab_r benchmaks from the SPECCPU 2017 suite whereas the 465.tonto benchmark from SPECCPU 2006 sees a whopping **30%** improvement on an Arm Cortex-A57 system.

Szabolcs Nagy presented this work at the GNU Cauldron 2018 in Manchester, UK. The feedback from the glibc community helped in getting these routines in production-ready shape.

## Conclusion

We presented some of the auto-vectorization improvements in the upcoming **GCC 9 **that help media encoding workloads and introduced some of the problems a compiler must solve to make the best of a SIMD instruction set. We also showed the math library improvements in the upcoming **glibc 2.29 **release** **and the latency and throughput improvements we expect. As you can see, it's a very exciting time for the GNU toolchain.

What improvements would you like to see or read about? Let us know in the comments!