Welcome back. This year, as with those before it, the new GCC release comes with a slew of performance improvements. As always Arm has worked together with the GCC community to cook up some nice things. Some general, some Arm specific. This year we have focused on enabling the use of more Arm instructions in the vectorizer along with improving our codegen for intrinsics and heavily optimized applications can use them with confidence. Read on for all the goodies in this year's GCC 11 bag.
Optimizing compilers like GCC try to avoid emitting library calls whenever it is faster to handle the behavior of the call inline. These built-in functions, as GCC calls them, include some well-known ISO C90 functions including memset and memcpy. When the call to these two functions is made using a constant size the compiler may choose to instead emit the equivalent instructions directly in the assembly. This is done with a trade-off between code-size and performance.
As an example, the following function:
void fc (char *restrict a, char *restrict b) { memcpy (a, b, 48); }
Is inlined by the compiler and instead of a call to memcpy GCC 10 generates:
fc: ldp x2, x3, [x1] stp x2, x3, [x0] ldp x2, x3, [x1, 16] stp x2, x3, [x0, 16] ldp x2, x3, [x1, 32] stp x2, x3, [x0, 32] ret
Which is performing the memcpy directly using loads and stores. However, this is not the most optimal form as it uses integer registers which have a maximum size of 64-bits to do the copy. If we were to use SIMD registers, we could instead copy 128-bits using a single instruction or 256-bits using a single pair instruction.
GCC 11 now favors using SIMD registers for doing inlined memset and memcpy whenever the amount to be copied or set is smaller than 256 bytes (unless optimizing for code size, in which case the limit is set to 128 bytes).
For the example GCC 11 now generates:
fc: ldp q1, q2, [x1] ldr q0, [x1, 32] stp q1, q2, [x0] str q0, [x0, 32] ret
It is worth noting that even though you may not have explicitly typed memcpy or memset the compiler may have recognized the sequence during optimizations. As an example, at -O2 the compiler will automatically idiom recognize a loop such as:
void f (char *restrict a, char *restrict b) { for (int x = 0; x < 48; x++) a[x] = b[x]; }
As a memcpy between a and b.
AArch64 does not have conditional stores as part of the ISA, however we can make a conditional store by using a conditional select (csel) and then using an unconditional store.
That would allow us to remove more branches in the output.
Consider the example:
unsigned test(unsigned k, unsigned b) { unsigned a[2]; // Initialize a as needed if (b < a[k]) { a[k] = b; } return a[0]+a[2]; }
Where the store of a[k] depends on whether you enter the conditional or not. The array a is intentionally left uninitialized in this example to keep the output easier to follow.
As such GCC 10 generated:
test: sub sp, sp, #16 uxtw x0, w0 add x2, sp, 8 ldr w3, [x2, x0, lsl 2] cmp w3, w1 bls .L6 str w1, [x2, x0, lsl 2] .L6: ldr w1, [sp, 8] ldr w0, [sp, 16] add sp, sp, 16 add w0, w1, w0 ret
Where it tests w1 and w3 to determine whether it needs to perform the store.
However, notice that the value being stored a[k] is governed by a load that we know must also be writable. That is, the array a is stored on the stack and the stack cannot be read-only. Because of these two conditions, we know it is safe to unconditionally store to a[k] and so the branch is converted into a conditional select of the original value of a[k] and the new value b:
test: sub sp, sp, #16 uxtw x0, w0 add x3, sp, 8 ldr w2, [x3, x0, lsl 2] cmp w2, w1 csel w2, w2, w1, ls str w2, [x3, x0, lsl 2] ldr w1, [sp, 8] ldr w0, [sp, 16] add sp, sp, 16 add w0, w1, w0 ret
AArch64 has various SIMD instructions that perform an arithmetic operation and a widening and shortening in one. By using these instructions, we can reduce the widening operations emitted during vectorization. Some were implemented before (such as widening multiply) but plenty more such as widening shifts were not.
Consider the following example:
void wide1 (char *restrict a, short *restrict b, int n) { for (int x = 0; x < 16; x++) b[x] = a[x] << 8; }
GCC 10 generated:
wide1: ldr q0, [x0] uxtl v1.8h, v0.8b uxtl2 v0.8h, v0.16b shl v1.8h, v1.8h, 8 shl v0.8h, v0.8h, 8 stp q1, q0, [x1] ret
Where we emit explicit zero extensions. This has a secondary effect in that it also increases the costs of vectorization. When weighing the cost of the vector vs scalar versions of the function if, there are enough of these superfluous extensions the vectorizer decides it is too expensive to vectorize. Adding support for these instructions results in more and better vectorization.
In GCC 11 we now generate:
wide1: ldr q0, [x0] shll v1.8h, v0.8b, 8 shll2 v0.8h, v0.16b, 8 stp q1, q0, [x1] ret
Many of the intrinsics in the arm_neon.h header were defined using simple inline assembly. The problem with inline assembly is that it is completely opaque to the compiler. The compiler has no idea about the operational semantics of the instructions and does not know what type of instruction it is and so cannot do any sort of instruction scheduling.
As an example, if vadd_f32 and vmul_f32 were defined as inline assembly, the compiler would never be able to transform them into vfma_f32. The compiler would also always emit them grouped just as specified in the source. Lastly, when the instruction is accumulating this often causes the register allocator to generate unneeded register copies.
This caused AArch64 intrinsics code generated by GCC to be larger and perform worse than other AArch64 compilers. Consider the simple example:
int32x4_t intrinsics(int32x4_t acc, int16x4_t b, int16x4_t c) { return vmlal_n_s16(acc, b, c[3]); }
Where GCC 10 would generate:
intrinsics: dup h2, v2.h[3] smlal v0.4s,v1.4h,v2.h[0] ret
But with GCC 11 happily generates:
intrinsics: smlal v0.4s, v1.4h, v2.h[3] ret
Many intrinsics have been converted and semantically described in GCC 11 with the rest to be done in GCC 12. An example of the impact of this work can be seen in the popular libjpeg-turbo library:
To get started with optimizing your applications using intrinsics do check out the resources Arm makes available for this:
View Neon app developers resources
Visit the Neon intrinsics search engine
Instructions performing complex arithmetic exist in Armv8.3-a (AArch32 and AArch64), Armv8-R AArch64, SVE, SVE2, and MVE, covering the spread of Arm architecture profiles. As a basic example, consider the complex addition where the second operand is rotated by 90* around the argand plane:
#include <complex.h> void cadd (complex float *restrict a, complex float *restrict b, complex float *restrict c, int n) { for (int x = 0; x < n; x++) c[x] = a[x] + (b[x] * I); }
Which for Neon on AArch64 generates:
cadd: … loop prologue … .L14: ld2 {v2.4s - v3.4s}, [x4], 32 ld2 {v4.4s - v5.4s}, [x6], 32 fsub v0.4s, v2.4s, v5.4s fadd v1.4s, v4.4s, v3.4s st2 {v0.4s - v1.4s}, [x5], 32 cmp x7, x4 bne .L14 and w6, w3, -4 tst x3, 3 beq .L11 .L13: … scalar loop end … .L11: ret .L16: mov w6, 0 b .L13
Essentially to vectorize the compiler must use gathers to de-interleave the imaginary and real components of the number, perform the desired computation, and then interleave the values again to store them in c. This loop also has a higher minimum element count to hit the vector code. It requires at least 8 complex floats for the vectorized loop to be entered. For GCC 11 with -Ofast -march=armv8.3-a the compiler now generates:
cadd: … loop prologue … .L13: ldr q0, [x0, x4] ldr q1, [x1, x4] fcadd v0.4s, v0.4s, v1.4s, #90 str q0, [x2, x4] add x4, x4, 16 cmp x4, x5 bne .L13 and w4, w3, -2 tbz x3, 0, .L10 .L12: uxtw x3, w4 ldr d0, [x0, x3, lsl 3] ldr d1, [x1, x3, lsl 3] fcadd v0.2s, v0.2s, v1.2s, #90 str d0, [x2, x3, lsl 3] .L10: ret .L15: mov w4, 0 b .L12
This loop has a couple of advantages:
GCC 11 recognizes the following sequences:
The implementation in the compiler does not rely on the elements being complex numbers. The matching is performed based on the shape of the computation and not the complex type.As an example, the following two examples are both recognized as complex additions:
void cadd_pat (float *restrict a, float *restrict b, float *restrict c, int n) { for (int i=0; i < n; i+=2) { c[i] = a[i] - b[i+1]; c[i+1] = a[i+1] + b[i]; } } void cadd_pat_unrolled (float *restrict a, float *restrict b, float *restrict c, int n) { for (int i=0; i < n; i+=4) { c[i] = a[i] - b[i+1]; c[i+1] = a[i+1] + b[i]; c[i+2] = a[i+2] - b[i+3]; c[i+3] = a[i+3] + b[i+2]; } }
Because of this, the implementation is not restricted to just floating point but can also handle integer versions of these sequences. As an example:
void cadd_int (int *restrict a, int *restrict b, int *restrict c, int n) { for (int i=0; i < n; i+=2) { c[i] = a[i] - b[i+1]; c[i+1] = a[i+1] + b[i]; } }
Generates with -O3 -march=armv8.2-a+sve2:
cadd_int: cmp w3, 0 ble .L28 sub w4, w3, #1 mov x3, 0 lsr w4, w4, 1 add w4, w4, 1 lsl x4, x4, 1 whilelo p0.s, xzr, x4 .L30: ld1w z0.s, p0/z, [x0, x3, lsl 2] ld1w z1.s, p0/z, [x1, x3, lsl 2] cadd z0.s, z0.s, z1.s, #90 st1w z0.s, p0, [x2, x3, lsl 2] incw x3 whilelo p0.s, x3, x4 b.any .L30 .L28: ret
While the floating-point cases are more common, these integer versions are used quite often in video and image processing. One particularly common use-case is motion estimation for videos when implementing sum of absolute transformed differences (SATD).
This year’s GCC has continued the focus on real world user performance issues and optimizing for our recently announced CPUs.
But this is not the end, in GCC 12 expect more intrinsics improvements along with more SVE optimizations.Speaking of SVE, stay tuned for a blog post on the exciting SVE specific changes in GCC 11.
In the meantime, be sure to check out which performance changes were done in GCC 10
GCC 10 better and faster than ever