Making the most of the Arm architecture with GCC 10

May 14, 2020

7 minute read time.

The GCC 10.1 release

The GNU Compiler Collection is used to program a rich variety of systems: from the fastest of supercomputers to the tiniest of micro-controllers. We at Arm love ecosystems. The recently released GCC 10.1 is the culmination of a year of hard work from the GCC community. And the Arm partnership has played its part. This blog gives you an insight into some of the new Arm-related features we are most excited about new CPU support, architecture support, portable software deployment aids, and performance optimizations. For a deeper dive, do check out the official release notes.

Scalable Vector Extensions

More than 25000 lines of code were added in the past year to implement the SVE ACLE. These statistics are now easy to gather since the GCC project moved to using git for version control in 2020. We are proud to announce that GCC 10.1 fully supports the Arm C Language Extensions for SVE. This gives you access to more than 4000 intrinsics to use any of SVE's many advanced features in your kernels. For example let's try to compile a slightly modified example from the Arm Scalable Vector Extensions and application to Machine Learning whitepaper:

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#else
#error "Must use SVE for this example!"
#endif /* __ARM_FEATURE_SVE */ 

void
vla_add_arrays (double *dst, double *src, double c, long N)
{
  for (long i = 0; i < N; i += svcntd ())
    {
      svbool_t Pg = svwhilelt_b64 (i, N);
      svfloat64_t vsrc = svld1 (Pg, &src[i]);
      svfloat64_t vdst = svadd_x (Pg, vsrc, c);
      svst1 (Pg, &dst[i], vdst);
    }
}

Compiling it with an AArch64 GCC 10.1:

$ aarch64-none-linux-gnu-gcc -O2 -march=armv8.2-a+sve example.c

Compiles without a problem and gives us the SVE assembly:

vla_add_arrays:
        cmp     x2, 0
        ble     .L1
        mov     x3, 0
        mov     z1.d, d0
        .p2align 3,,7
.L3:
        whilelt p0.d, x3, x2
        ld1d    z0.d, p0/z, [x1, x3, lsl 3]
        fadd    z0.d, p0/m, z0.d, z1.d
        st1d    z0.d, p0, [x0, x3, lsl 3]
        incd    x3
        cmp     x2, x3
        bgt     .L3
.L1:
        ret

You can see here the use of the SVE per-lane predication feature to vectorize the loop. This avoids the scalar epilogues and fallbacks that would otherwise be necessary with traditional SIMD instruction sets.

Preparing for the deployment of the Future Architecture Technologies, GCC 10.1 provides support for the SVE2 ACLE intrinsics as well. Compilation for SVE2 can be enabled with the +sve2 extension to the -march and -mcpu options. For example:

$ aarch64-none-linux-gnu -march=armv8.5-a+sve2

Being a SIMD ISA, SVE is a great target for compiler auto-vectorization. Though the job of an optimizing compiler is never finished, GCC 10.1 has improved auto-vectorization capabilities when targeting SVE. Look out for more details coming soon.

Out of line atomics for LSE deployment

The Armv8.1-A architecture introduced the Large System Extensions (LSE). These include instructions to perform commonly used operations like compare-and-swap (CAS) and atomic load and increment (LDADD). They can be used to efficiently map high-level language constructs like __atomic_compare_exchange and __atomic_fetch_add down to instruction sequences respecting the Arm memory model. These instructions can be vital for getting the best performance scaling in large core count systems. Indeed, GCC will use the LSE instructions automatically when compiling for -march=armv8.1-a or higher rather than using a load-exclusive, operation, store-exclusive loop as when compiling for -march=armv8-a.

GNU/Linux distributions compile for a baseline -march=armv8-a architecture to ensure it runs correctly on every AArch64 implementation out there. But they still want to take advantage of LSE instructions when they are available. To that end Richard Henderson of Linaro contributed into GCC 10.1 the -moutline-atomics option, which is on by default in GCC 10.1. When compiling for an Armv8-A baseline with this option the compiler will generate a stub calling a runtime helper function rather than emitting a load-exclusive-store-exclusive loop. The helper function performs a runtime check of the availability of LSE instructions through the HWCAP mechanism (caching the outcome for faster subsequent checks). It then dispatches to an LSE instruction sequence if available, or to a load-exclusive-store-exclusive loop. If this sounds complicated, here is a simple example in C utilizing a language-level atomic construct:

int
test_cas_atomic_int (int *val, int *foo, int *bar)
{
  return __atomic_compare_exchange_n (val, foo, bar, 0, 0, 0);
}

Compiled with -march=armv8-a -O2 GCC 9 generates a load-exclusive-store-exclusive loop:

test_cas_atomic_int:
        ldr     w3, [x1]
.L4:
        ldxr    w4, [x0]
        cmp     w4, w3
        bne     .L5
        stxr    w5, w2, [x0]
        cbnz    w5, .L4
.L5:
        cset    w0, eq
        beq     .L2
        str     w4, [x1]
.L2:
        ret

where you see the loop that retries the exclusive stores with STXR until it succeeds atomically. Compiled with -O2 -march=armv8.1-a (which has an implicit +lse) it generates:

test_cas_atomic_int:
        ldr     w4, [x1]
        mov     w3, w4
        cas     w3, w2, [x0]
        cmp     w3, w4
        cset    w0, eq
        beq     .L2
        str     w3, [x1]
.L2:
        ret

where you see a simpler sequence utilising the CAS instruction from LSE. Now, with GCC 10.1 we for the options -O2 -march=armv8-a we get:

test_cas_atomic_int:
        stp     x29, x30, [sp, -32]!
        mov     x29, sp
        stp     x19, x20, [sp, 16]
        mov     x19, x1
        mov     w1, w2
        mov     x2, x0
        ldr     w20, [x19]
        mov     w0, w20
        bl      __aarch64_cas4_relax
        cmp     w0, w20
        mov     w1, w0
        cset    w0, eq
        beq     .L2
        str     w1, [x19]
.L2:
        ldp     x19, x20, [sp, 16]
        ldp     x29, x30, [sp], 32
        ret

There is some complexity going on here related to preparing the arguments for a function call to __aarch64_cas4_relax which is a helper function provided by the libgcc runtime library. In there, the runtime can test for the presence of LSE instructions and dispatch to either of the two previous sequences. This indirection allows this function to run correctly on all AArch64 systems, even Armv8-A ones, while still using the LSE instructions from Armv8.1-A where possible. Various members in the Arm ecosystem have measured the performance impact of this indirection on a diverse set of systems and we were happy to find out that it was minimal compared to the benefit of using the LSE instructions for better scalability at large core counts.

Annual updates to the Arm architecture: Armv8.6-A

The Armv8.6-A architecture update introduced a number of innovations to accelerate Machine Learning workloads. These include instructions for general matrix multiplication (GEMM) and the bfloat16 data type for training and inference. Underscoring the importance of these workloads, we introduced these extensions to both the AArch32 and AArch64 states, with the latter also getting an SVE variant.

You can use these extensions in GCC 10.1 through ACLE intrinsics and the -march=armv8.6-a option and associated extensions.

Arm Custom Instructions and the Custom Datapath Extensions

The announcement of Arm Custom Instructions by Simon Segars at Arm TechCon 2019 made a big splash. Behind the scenes, the engineering teams have been hard at work to make it happen. With the initial architecture specification in place we defined a number of ACLE intrinsics to provide access to the new instructions made available through the Custom Datapath Extension (CDE). The nature of these instructions allows an M-profile processor vendor to customize the behavior of these instructions for their particular application. The compiler only needs to be aware of the input and output registers used and some basic guarantees about the absence of unpredictable side-effects of these instructions to properly model the data flow of the program. You can now generate these custom instructions with a GCC 10.1 compiler for Armv8.1-M by specifying the coprocessor you want to use like so:

$ arm-none-eabi-gcc -march=armv8.1-m.main+cdecp0

This compile command enables the CP0 CDE coprocessor and associated instructions. A toy example using the intrinsics defined in the new header arm_cde.h such as:

#include "arm_cde.h"

uint32_t
test_cde_cx1 (uint32_t a)
{

  return __arm_cx2 (0, a, 33);
}

then compiles to the assembly:

test_cde_cx1:
        cx2     p0, r0, r0, #33
        bx      lr

We would love to hear your feedback on how to best expose the Arm Custom Instructions for your favorite use case. Please email arm.acle@arm.com with your ideas.

Armv8.1-M

GCC 10.1 brings support for the Armv8.1-M Mainline architecture through the -march=armv8.1-m.mainline option and its extensions. This includes code generation for the updated CMSE specification with the -mcmse option and initial support for the MVE SIMD architecture. This includes support for the ACLE intrinsics that are available when including the new arm_mve.h header file.

Along with the architecture support comes a brand new -mcpu=cortex-m55 option for the Cortex-M55 processor.

There's more!

This was just a small Arm-specific taste of the many features in the GCC 10.1 release. There are many conformance improvements to the latest language standards, new developer aids, CPU support and more. Plus an experimental integrated static analyzer.

Beyond supporting the latest Arm IP features, the Arm partnership also works hard on performance optimizations. Watch this space for a deep dive into that world.

[CTAToken URL = "" target="https://gcc.gnu.org/gcc-10/changes.html" text="Read about the new features in the GCC 10.1 release" class ="green"]

Viatorus over 3 years ago in reply to Ashok Bhat

Will there be any official release for a new GCC ARM none eabi version in the next few weeks? In the last years, a GCC update happend twice a year.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Xiangdong Ji over 4 years ago

Just wondering if the built-in function __atomic_compare_exchange_n referenced below should be __atomic_compare_exchange (without _n) instead, as the third argument of the former is not expected to be a pointer. Thanks.

int test_cas_atomic_int (int *val, int *foo, int *bar)

{

return __atomic_compare_exchange_n (val, foo, bar, 0, 0, 0);

}
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Matt Horsnell over 4 years ago in reply to Ashok Bhat

If like me you can't wait to try out these new features, you can always build yourself a gcc 10.1 cross-toolchain from source. Preshing has a great tutorial here:

https://preshing.com/20141119/how-to-build-a-gcc-cross-compiler/

The only thing I had to do, to update this was to configure gcc with an additional --disable-libsanitizer

Enjoy!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Ashok Bhat over 4 years ago

For those who are wondering how to get GCC10 based toolchain, Arm will release two GCC10 based cross-toolchains before the end of the year 2020 (1) GNU Arm Embedded release (cross-toolchain for 32-bit bare-metal Arm hardware) (2) GNU-A toolchain (cross-toolchain for 64/32-bit A-Profile hardware). Both these releases are scheduled for Q4 CY20 and will be based of GCC 10.2. For users who are interested in MVE, we plan to provide a preview GNU Arm Embedded release based on GCC 10.1 soon. We also expect native GCC 10.x based toolchain to be available in popular Linux distributions by the end of 2020. Watch this space for more updates!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Tools, Software and IDEs blog

Arm Toolchain for Embedded: next-generation Arm C/C++ embedded compiler

Paul Black

Arm is launching Arm Toolchain for Embedded (ATfE), an embedded C/C++ cross-compiler. The toolchain is expected to be launched in April 2025, but a beta version is available now.
- January 9, 2025
Product update: Arm Development Studio 2024.1 now available

Ronan Synnott

Arm Development Studio 2024.1 is now available with support for Cortex-A725 and Cortex-X925.
- January 2, 2025
Part 3: Leveraging Rust with Rich Operating Systems on Arm

Jonathan Pallant

Understand how Rust can take full advantage of running on a full-blown operating system such as Linux.
- November 15, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog