The GNU Compiler Collection is used to program a rich variety of systems: from the fastest of supercomputers to the tiniest of micro-controllers. We at Arm love ecosystems. The recently released GCC 10.1 is the culmination of a year of hard work from the GCC community. And the Arm partnership has played its part. This blog gives you an insight into some of the new Arm-related features we are most excited about new CPU support, architecture support, portable software deployment aids, and performance optimizations. For a deeper dive, do check out the official release notes.
More than 25000 lines of code were added in the past year to implement the SVE ACLE. These statistics are now easy to gather since the GCC project moved to using git for version control in 2020. We are proud to announce that GCC 10.1 fully supports the Arm C Language Extensions for SVE. This gives you access to more than 4000 intrinsics to use any of SVE's many advanced features in your kernels. For example let's try to compile a slightly modified example from the Arm Scalable Vector Extensions and application to Machine Learning whitepaper:
#error "Must use SVE for this example!"
#endif /* __ARM_FEATURE_SVE */
vla_add_arrays (double *dst, double *src, double c, long N)
for (long i = 0; i < N; i += svcntd ())
svbool_t Pg = svwhilelt_b64 (i, N);
svfloat64_t vsrc = svld1 (Pg, &src[i]);
svfloat64_t vdst = svadd_x (Pg, vsrc, c);
svst1 (Pg, &dst[i], vdst);
Compiling it with an AArch64 GCC 10.1:
$ aarch64-none-linux-gnu-gcc -O2 -march=armv8.2-a+sve example.c
Compiles without a problem and gives us the SVE assembly:
cmp x2, 0
mov x3, 0
mov z1.d, d0
whilelt p0.d, x3, x2
ld1d z0.d, p0/z, [x1, x3, lsl 3]
fadd z0.d, p0/m, z0.d, z1.d
st1d z0.d, p0, [x0, x3, lsl 3]
cmp x2, x3
You can see here the use of the SVE per-lane predication feature to vectorize the loop. This avoids the scalar epilogues and fallbacks that would otherwise be necessary with traditional SIMD instruction sets.
Preparing for the deployment of the Future Architecture Technologies, GCC 10.1 provides support for the SVE2 ACLE intrinsics as well. Compilation for SVE2 can be enabled with the +sve2 extension to the -march and -mcpu options. For example:
$ aarch64-none-linux-gnu -march=armv8.5-a+sve2
Being a SIMD ISA, SVE is a great target for compiler auto-vectorization. Though the job of an optimizing compiler is never finished, GCC 10.1 has improved auto-vectorization capabilities when targeting SVE. Look out for more details coming soon.
The Armv8.1-A architecture introduced the Large System Extensions (LSE). These include instructions to perform commonly used operations like compare-and-swap (CAS) and atomic load and increment (LDADD). They can be used to efficiently map high-level language constructs like __atomic_compare_exchange and __atomic_fetch_add down to instruction sequences respecting the Arm memory model. These instructions can be vital for getting the best performance scaling in large core count systems. Indeed, GCC will use the LSE instructions automatically when compiling for -march=armv8.1-a or higher rather than using a load-exclusive, operation, store-exclusive loop as when compiling for -march=armv8-a.
GNU/Linux distributions compile for a baseline -march=armv8-a architecture to ensure it runs correctly on every AArch64 implementation out there. But they still want to take advantage of LSE instructions when they are available. To that end Richard Henderson of Linaro contributed into GCC 10.1 the -moutline-atomics option, which is on by default in GCC 10.1. When compiling for an Armv8-A baseline with this option the compiler will generate a stub calling a runtime helper function rather than emitting a load-exclusive-store-exclusive loop. The helper function performs a runtime check of the availability of LSE instructions through the HWCAP mechanism (caching the outcome for faster subsequent checks). It then dispatches to an LSE instruction sequence if available, or to a load-exclusive-store-exclusive loop. If this sounds complicated, here is a simple example in C utilizing a language-level atomic construct:
test_cas_atomic_int (int *val, int *foo, int *bar)
return __atomic_compare_exchange_n (val, foo, bar, 0, 0, 0);
Compiled with -march=armv8-a -O2 GCC 9 generates a load-exclusive-store-exclusive loop:
ldr w3, [x1]
ldxr w4, [x0]
cmp w4, w3
stxr w5, w2, [x0]
cbnz w5, .L4
cset w0, eq
str w4, [x1]
where you see the loop that retries the exclusive stores with STXR until it succeeds atomically. Compiled with -O2 -march=armv8.1-a (which has an implicit +lse) it generates:
ldr w4, [x1]
mov w3, w4
cas w3, w2, [x0]
cmp w3, w4
cset w0, eq
str w3, [x1]
where you see a simpler sequence utilising the CAS instruction from LSE. Now, with GCC 10.1 we for the options -O2 -march=armv8-a we get:
stp x29, x30, [sp, -32]!
mov x29, sp
stp x19, x20, [sp, 16]
mov x19, x1
mov w1, w2
mov x2, x0
ldr w20, [x19]
mov w0, w20
cmp w0, w20
mov w1, w0
cset w0, eq
str w1, [x19]
ldp x19, x20, [sp, 16]
ldp x29, x30, [sp], 32
There is some complexity going on here related to preparing the arguments for a function call to __aarch64_cas4_relax which is a helper function provided by the libgcc runtime library. In there, the runtime can test for the presence of LSE instructions and dispatch to either of the two previous sequences. This indirection allows this function to run correctly on all AArch64 systems, even Armv8-A ones, while still using the LSE instructions from Armv8.1-A where possible. Various members in the Arm ecosystem have measured the performance impact of this indirection on a diverse set of systems and we were happy to find out that it was minimal compared to the benefit of using the LSE instructions for better scalability at large core counts.
The Armv8.6-A architecture update introduced a number of innovations to accelerate Machine Learning workloads. These include instructions for general matrix multiplication (GEMM) and the bfloat16 data type for training and inference. Underscoring the importance of these workloads, we introduced these extensions to both the AArch32 and AArch64 states, with the latter also getting an SVE variant.
You can use these extensions in GCC 10.1 through ACLE intrinsics and the -march=armv8.6-a option and associated extensions.
The announcement of Arm Custom Instructions by Simon Segars at Arm TechCon 2019 made a big splash. Behind the scenes, the engineering teams have been hard at work to make it happen. With the initial architecture specification in place we defined a number of ACLE intrinsics to provide access to the new instructions made available through the Custom Datapath Extension (CDE). The nature of these instructions allows an M-profile processor vendor to customize the behavior of these instructions for their particular application. The compiler only needs to be aware of the input and output registers used and some basic guarantees about the absence of unpredictable side-effects of these instructions to properly model the data flow of the program. You can now generate these custom instructions with a GCC 10.1 compiler for Armv8.1-M by specifying the coprocessor you want to use like so:
$ arm-none-eabi-gcc -march=armv8.1-m.main+cdecp0
This compile command enables the CP0 CDE coprocessor and associated instructions. A toy example using the intrinsics defined in the new header arm_cde.h such as:
test_cde_cx1 (uint32_t a)
return __arm_cx2 (0, a, 33);
then compiles to the assembly:
cx2 p0, r0, r0, #33
We would love to hear your feedback on how to best expose the Arm Custom Instructions for your favorite use case. Please email firstname.lastname@example.org with your ideas.
GCC 10.1 brings support for the Armv8.1-M Mainline architecture through the -march=armv8.1-m.mainline option and its extensions. This includes code generation for the updated CMSE specification with the -mcmse option and initial support for the MVE SIMD architecture. This includes support for the ACLE intrinsics that are available when including the new arm_mve.h header file.
Along with the architecture support comes a brand new -mcpu=cortex-m55 option for the Cortex-M55 processor.
This was just a small Arm-specific taste of the many features in the GCC 10.1 release. There are many conformance improvements to the latest language standards, new developer aids, CPU support and more. Plus an experimental integrated static analyzer.
Beyond supporting the latest Arm IP features, the Arm partnership also works hard on performance optimizations. Watch this space for a deep dive into that world.
Read about the new features in the GCC 10.1 release
If like me you can't wait to try out these new features, you can always build yourself a gcc 10.1 cross-toolchain from source. Preshing has a great tutorial here:
The only thing I had to do, to update this was to configure gcc with an additional --disable-libsanitizer
For those who are wondering how to get GCC10 based toolchain, Arm will release two GCC10 based cross-toolchains before the end of the year 2020 (1) GNU Arm Embedded release (cross-toolchain for 32-bit bare-metal Arm hardware) (2) GNU-A toolchain (cross-toolchain for 64/32-bit A-Profile hardware). Both these releases are scheduled for Q4 CY20 and will be based of GCC 10.2. For users who are interested in MVE, we plan to provide a preview GNU Arm Embedded release based on GCC 10.1 soon. We also expect native GCC 10.x based toolchain to be available in popular Linux distributions by the end of 2020. Watch this space for more updates!