A lot has happened since the last GNU Tool chain performance update. The GCC and glibc communities have been hard at work since, improving the compiler and the library with new usability, language, debugging, security and a myriad of other features.
The GNU toolchain team here at Arm is deeply involved in both projects, making sure that all these vital components take full advantage of the awesome features in the Arm Architecture and Arm-based processors. I'd like to update you on some of the performance improvements we've been working on. These are queued for the GCC 9 and glibc 2.29 release in 2019. Let's dive in.
Popular media codecs like x264 have many performance-critical loops operating on 8-bit values, usually representing the RGB, alpha channels, or other properties of a pixel. The loops themselves may calculate various aggregates used in a filter, like a sum of absolute differences of adjacent pixels, or their average. Let's have a look at some examples.
A simple, commonly-used way of computing the "difference" between two images is to compute the absolute differences between the corresponding pixels in each image and sum them up.
In C, this can be expressed as:
#include <stdlib.h> #define N 16 /* The two 16-pixel images. */ unsigned char img1[N]; unsigned char img2[N]; int sum_of_absolute_differences (void) { int sum = 0; for (int i = 0; i < N; i++) sum += abs (img1[i] - img2[i]); return sum; }
Let's compile that with an AArch64-targeted GCC 8.1 compiler at the -O3 level. We get the following assembly:
-O3
sum_of_absolute_differences: adrp x1, img1 adrp x0, img2 add x1, x1, :lo12:img1 add x0, x0, :lo12:img2 ldr q0, [x1] ldr q4, [x0] ushll v1.8h, v0.8b, 0 ushll2 v0.8h, v0.16b, 0 ushll v2.8h, v4.8b, 0 ushll2 v4.8h, v4.16b, 0 usubl v3.4s, v1.4h, v2.4h usubl2 v1.4s, v1.8h, v2.8h usubl v2.4s, v0.4h, v4.4h usubl2 v0.4s, v0.8h, v4.8h abs v3.4s, v3.4s abs v1.4s, v1.4s abs v2.4s, v2.4s abs v0.4s, v0.4s add v1.4s, v3.4s, v1.4s add v1.4s, v2.4s, v1.4s add v0.4s, v0.4s, v1.4s addv s0, v0.4s umov w0, v0.s[0] ret
At a first glance, we see that the generated code uses the Advanced SIMD registers, which means the compiler has auto-vectorized the code. This is a great start. However, if you know your AArch64 assembly, you'll notice that the abs instruction is operating on vectors of 32-bit values. But our input values are 8-bit chars! What's going on? The reason for that is the implicit promotion rules in the C languages. And, the abs standard function takes an int, rather than a char. So, in reality the calculation in the simple loop given above is actually equivalent to:
abs
int
char
for (int i = 0; i < N; i++) { int pix1 = (int) img1[i]; // sign-extend to 32 bits int pix2 = (int) img2[i]; // sign-extend to 32 bits int diff = pix1 - pix2; // 32-bit subtraction sum += abs (diff); // 32-bit abs and addition }
So, even though the inputs are two 128-bit vectors (16x8) that fit neatly into each vector register, in order to compute the sum of absolute differences we end up sign-extending the input values to a wider type before performing the subtraction, absolute value and accumulation on the wider type.
As it turns out, yes we can. GCC has pattern-matching capabilities in its auto-vectorizer and can detect this pattern in order to emit something smarter. GCC 9 knows how to emit a much more efficient sequence for the above operation. Check it out:
sum_of_absolute_differences: adrp x1, img1 adrp x0, img2 add x1, x1, :lo12:img1 add x0, x0, :lo12:img2 movi v0.4s, 0 ldr q3, [x0] ldr q2, [x1] uabdl2 v1.8h, v2.16b, v3.16b uabal v1.8h, v2.8b, v3.8b uadalp v0.4s, v1.8h addv s0, v0.4s umov w0, v0.s[0] ret
We've reduced the computation to just 3 Advanced SIMD instructions, but there is a lot going on in those 3 operations! What we're after is eliminating as many explicit extension operations as possible. Thankfully, the Advanced SIMD instruction set offers many such operations that combine vectorized arithmetic operations with the necessary extensions. As you can see in the diagram below, we eliminate all the explicit extension instructions. With the UABDL2, UABAL and UADALP instruction sequence we can calculate the sums of absolute differences between each pair of elements in the input 16 byte vectors, add them up, and widen to the appropriate 32-bit aggregates, ready for a final reduction in the end with the ADDV instruction.
Another common operation encountered in filters is computing the average of two images. This can be expressed in C as:
As with the previous example, appearances can be deceiving. The implicit promotion rules result in more code than you'd expect. GCC 8.1 generates this assembly:
#include <stdint.h> void avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c) { for (int i = 0; i < 16; ++i) a[i] = (b[i] + c[i] + 1) >> 1; // Average rounding upwards }
avg: ldr q0, [x1] ldr q4, [x2] ushll v1.8h, v0.8b, 0 ushll2 v0.8h, v0.16b, 0 ushll v2.8h, v4.8b, 0 ushll2 v4.8h, v4.16b, 0 movi v5.4s, 0x1 uaddl v3.4s, v1.4h, v2.4h uaddl2 v1.4s, v1.8h, v2.8h uaddl v2.4s, v0.4h, v4.4h uaddl2 v0.4s, v0.8h, v4.8h add v3.4s, v3.4s, v5.4s add v1.4s, v1.4s, v5.4s add v2.4s, v2.4s, v5.4s add v0.4s, v0.4s, v5.4s sshr v3.4s, v3.4s, 1 sshr v1.4s, v1.4s, 1 sshr v2.4s, v2.4s, 1 sshr v0.4s, v0.4s, 1 xtn v4.4h, v3.4s xtn2 v4.8h, v1.4s xtn v1.4h, v2.4s xtn2 v1.8h, v0.4s xtn v0.8b, v4.8h xtn2 v0.16b, v1.8h str q0, [x0] ret
As you can see, the implicit casts have snuck in again. And now there's an explicit narrowing operation at the end too! Let's make them explicit:
#include <stdint.h> void avg (uint8_t *restrict a, uint8_t *restrict b, uint8_t *restrict c) { for (int i = 0; i < 16; ++i) { int bel = (int)b[i]; int cel = (int)c[i]; int sum = bel + cel + 1; int int_avg = sum >> 1; a[i] = (uint8_t) int_avg; } }
Even though the inputs and outputs to the function are all 8-bit char types the language forces costly conversions to the wider int type. This is a common problem with code dealing with short data types. In GCC 9 we improved the auto-vectorizer to detect such excessive extensions and limit them to only the precision
strictly needed for the computation. Combine that with a bit more pattern detection and GCC 9 can now generate:
avg: ldr q0, [x1] ldr q1, [x2] urhadd v0.16b, v0.16b, v1.16b str q0, [x0] ret
The computation is collapsed into a single URHADD instruction that computes the average of each pair of vector elements, rounding upwards. The computation is performed in 9-bit precision, which the compiler has now determined is the minimum precision needed to give the correct result. No expensive two-step conversions to 32-bit ints.
With the above, and other vectorizer improvements, the GCC 9-generated code shows an improvement of 17% on the 525.x264_r benchmark from the SPECCPU 2017 suite over GCC 8.1 as measured on an Arm Cortex-A72 processor. Not bad!
Many scientific workloads depend heavily on standard math functions to perform fundamental operations like exponentiation and logarithms. To that end, we've worked hard to improve the performance of the exp, log and pow double precision floating-point functions, as well as their derivatives such as log2, exp2. Szabolcs Nagy has reimplemented the routines, resulting in lower latency, higher throughput, and simpler, more maintainable code. Good accuracy and performance was achieved by carefully designed table based methods, optimized special case handling and considering instruction level parallelism of modern CPUs.
exp
log
pow
log2
exp2
Wilco Dijkstra rewrote the single-precision trigonometric routines sinf, cosf, sincosf resulting in significant speedups on them as well. The performance improvement on these routines can vary significantly depending on the input value.
sinf
cosf
sincosf
Improving the math routines in glibc is a large ongoing theme, with big rewrites such as this and continued incremental improvements. For example, the exp, log, pow, sin and cos functions were correctly rounded in nearest rounding mode, which meant they could spend huge amount of time doing multi-precision arithmetic to decide which way to round the result when that happened to be near half way between two floating-point values. The miniscule gain in accuracy from that was not worth the huge performance penalty for the vast majority of users of these routines. This was fixed in glibc 2.28, speeding up these routines in the process. Check out the numbers below.
sin
cos
Notice how the throughput in functions such as pow, log and exp is more than 3x the previous implementation. With such dramatic improvements in the functions themselves, the effect on full programs is very noticeable. We see gains of about 15% on the 527.cam4_r and 544.nab_r benchmaks from the SPECCPU 2017 suite whereas the 465.tonto benchmark from SPECCPU 2006 sees a whopping 30% improvement on an Arm Cortex-A57 system.
Szabolcs Nagy presented this work at the GNU Cauldron 2018 in Manchester, UK. The feedback from the glibc community helped in getting these routines in production-ready shape.
We presented some of the auto-vectorization improvements in the upcoming GCC 9 that help media encoding workloads and introduced some of the problems a compiler must solve to make the best of a SIMD instruction set. We also showed the math library improvements in the upcoming glibc 2.29 release and the latency and throughput improvements we expect. As you can see, it's a very exciting time for the GNU toolchain.
What improvements would you like to see or read about? Let us know in the comments!
[CTAToken URL = "https://www.linux.com/blog/2018/10/gcc-optimizing-linux-internet-and-everything" target="_blank" text="Find out more about the impact of GCC" class ="green"]