Here at Arm, we use LLVM a lot. It is an important toolchain in its own right, plus it forms the basis of many open source and commercial toolchains targeting Arm, including our own Arm Compiler for Embedded and Arm Compiler for Linux. It is used across the Arm ecosystem - from the development of code for tiny embedded processors, through mobile phone app development on Android, right up to High Performance Computing (HPC) development on the biggest systems in the world.
LLVM 14 has just been released, and has some great new enhancements for developers targeting Arm-based systems. This blog post will introduce some of those enhancements, and talk about how you might want to make use of them.
SVE and SVE2 are vector enhancements to the Arm A–profile architecture. Many readers will be aware of them in detail already, but to learn more please visit our newly updated SVE/SVE2 Programmer's Guide.
Side note: for the sake of brevity, I'll use 'SVE2' for the rest of this article to refer to both SVE and SVE2. The majority of the work is applicable to both architecture features.
For the past few years, our LLVM compiler development team - along with our ecosystem partners - have been adding enhancements to allow LLVM to make use of SVE2 instructions. The primary mechanism in LLVM for making use of SVE2 is via Scalable auto-vectorisation. This allows the compiler to turn scalar loops into SVE2 vector instructions that will execute correctly and performantly on CPUs with any SVE2 vector length, up to 2048 bits.
In LLVM 14, for the first time, the vectorizer uses Scalable auto-vectorization by default, to generate SVE2 instructions when compiling for targets that support it.
Along the way, the team have made significant improvements to SVE2 code quality, enhanced the LLVM cost–model, and added enhancements to a number of compiler features, such as stack protection, to make them aware of SVE2.
Making use of scalable auto-vectorization is easy. You just need to:
typically by giving an optimisation flag of –02 or higher. –03 or –0fast is recommended where appropriate.
–02
–03
–0fast
typically by specifying an appropriate CPU target, such as -mcpu=neoverse-v1
-mcpu=neoverse-v1
(alternatively, this can be achieved using a suitable -march flag)
-march
For suitable workloads, we have found that LLVM 14 gives performance improvements over NEON-only auto-vectorization on SVE-capable systems, such as those using the Neoverse V1 CPU.
There's more work coming in this space - we intend to make significant further enhancements to LLVM SVE2 code quality in LLVM 15, later this year.
We can see an example of SVE auto-vectorisation in practice, by looking at the following code snippet:
void foo(short * __restrict__ dst, short *src, int N) { for (int i = 0; i < N; i++) dst[i] = src[i] + 42; }
Note: In the code examples below, I have disabled unrolling by adding -fno-unroll-loops, purely to make the resulting assembly code easier to read. By default, when targeting Neoverse V1 systems, the compiler will choose to unroll this code to improve runtime performance.
-fno-unroll-loops
If we compile this with -mcpu=neoverse-v1 -O3 -g0 -fno-unroll-loops, we see code that looks like this:
-mcpu=neoverse-v1 -O3 -g0 -fno-unroll-loops
_Z3fooPsS_i: // @_Z3fooPsS_i .cfi_startproc // %bb.0: cmp w2, #1 b.lt .LBB0_8 // %bb.1: mov w8, w2 cnth x10 cmp x10, x8 b.ls .LBB0_3 // %bb.2: mov x9, xzr b .LBB0_6 .LBB0_3: udiv x9, x8, x10 mov x11, xzr ptrue p0.h mul x9, x9, x10 sub x12, x8, x9 .p2align 5, 0x0, 16 .LBB0_4: // =>This Inner Loop Header: Depth=1 ld1h { z0.h }, p0/z, [x1, x11, lsl #1] add z0.h, z0.h, #42 // =0x2a st1h { z0.h }, p0, [x0, x11, lsl #1] add x11, x11, x10 cmp x11, x9 b.ne .LBB0_4 // %bb.5: cbz x12, .LBB0_8 .LBB0_6: lsl x11, x9, #1 sub x8, x8, x9 add x10, x0, x11 add x11, x1, x11 .p2align 5, 0x0, 16 .LBB0_7: // =>This Inner Loop Header: Depth=1 ldrh w9, [x11], #2 subs x8, x8, #1 add w9, w9, #42 strh w9, [x10], #2 b.ne .LBB0_7 .LBB0_8: ret
In the middle of this code block, you can see a section labelled .LBB0_4 . This is the SVE scalable vectorized loop.
.LBB0_4
Another new capability in LLVM 14 is to fold the scalar tail (shown in sections .LBB0_6 and .LBB0_7) into the main loop.
.LBB0_6
.LBB0_7)
Using a non-production flag to the compiler, we can see it the effect this has on the code above. If we compile this with -mcpu=neoverse-v1 -O3 -g0 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -fno-unroll-loops, we see code that looks like this:
-mcpu=neoverse-v1 -O3 -g0 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -fno-unroll-loops
_Z3fooPsS_i: // @_Z3fooPsS_i .cfi_startproc // %bb.0: cmp w2, #1 b.lt .LBB0_3 // %bb.1: mov w9, w2 cnth x10 mov x8, xzr add x11, x10, x9 sub x11, x11, #1 udiv x11, x11, x10 mul x11, x11, x10 .p2align 5, 0x0, 16 .LBB0_2: // =>This Inner Loop Header: Depth=1 whilelo p0.h, x8, x9 ld1h { z0.h }, p0/z, [x1, x8, lsl #1] add z0.h, z0.h, #42 // =0x2a st1h { z0.h }, p0, [x0, x8, lsl #1] add x8, x8, x10 cmp x8, x11 b.ne .LBB0_2 .LBB0_3: ret
Here, we can see the compiler making full use of the predication features of SVE, to allow the scalar tail to be folded into the main loop body. Although I have 'forced' this using a non-production flag, the compiler will apply this optimization automatically where it believes it to be beneficial. Further work in LLVM 15 will increase the code quality of this optimization.
We expect most users of SVE2 to use the vector length agnostic auto-vectorisation described above. However, in some specialised fields, such as HPC, Arm has partners with LLVM–based commercial toolchains that can safely assume a specific SVE vector width. To help support these partners, LLVM supports a style of auto vectorisation known as vector-length specific auto-vectorization. By invoking the compiler with -msve-vector-bits=256 or similar, the user gives the compiler freedom to assume that this code will only ever be executed on systems with exactly 256 bit SVE vector registers.
-msve-vector-bits=256
In LLVM 14, the code quality of fixed length auto-vectorization has been significantly improved, with support for interleaved accesses when using wider-than-NEON vectors, and better code quality when vectorising loops containing mixed datatypes.
The Arm C Language extensions (or ACLE) for SVE provide an API to enable C/C++ programmers to exploit the Arm architecture with minimal restrictions on source code portability. The language extensions have two main purposes: to provide a set of types and accessors for SVE vectors and predicates, and to provide a function interface for all relevant SVE and SVE2 instructions. For more information on the ACLE, see our ACLE home page.
A new feature of the ACLE, known as the NEON/SVE bridge, has recently been added to the ACLE specification, and implemented in LLVM 14. The bridge allows for conversions between a NEON vector type and an SVE type. Since the NEON and SVE register files overlap, this conversion is typically free, reusing the existing in-register data without causing copies via memory.
As a side-note: the Arm ACLE specification is now open source, and the above addition was added via our public contribution process. Further contributions are very welcome!
Additionally, we have made a number of improvements to LLVM's implementation of the SVE ACLE:
svbool_t
PTEST
ASRD
SUBR
ABD
FABD
In combination, we believe these improvements mean that LLVM 14 is ready for production-use of the SVE2 ACLE. Please let us know if you have any issues.
LLVM 14 expands support for both Arm architectural features and Arm Cortex CPUs.
On the architecture side, this includes:
For more information on architectural features, including the new Architectural Exploration tools for A-profile, please see our CPU Architecture homepage.
The following CPUs are now supported in LLVM:
Cortex-A510
-mcpu=cortex-a510
Cortex-A710
-mcpu=cortex-a710
Cortex-X2
-mcpu=cortex-x2
Cortex-X1C
-mcpu=cortex-x1c
Since LLVM 11, the compiler team at Arm have been steadily improving the performance of SPECINT 2017 for big AArch64 cores, such as those seen in Neoverse and high-end mobile Cortex CPUs, by around 2% per release. In LLVM 14, we achieved a 1.5% improvement on SPEC2017 geomean intrate score over LLVM.
This is primarily down to two pieces of work:
We expect to continue to make improvements to SPECINT in LLVM 15 and beyond.
In addition to our focus on the SPEC CPU benchmark, we have made a number of other improvements, including:
In partnership with Linaro, the LLDB debugger has a number of new features in LLVM 14, including:
I'd like to end this post with a shout-out to our awesome development team. The following engineers at Arm contributed to LLVM 14:
Aaron DeBattista, Alban Bridonneau, Alexandros Lamprineas, Anastasia Stulova, Andre Simoes Dias Vieira, Andrzej Warzynski, Bradley Smith, Caroline Concatto, Cullen Rhodes, Daniel Kiss, David Candler, David Green, David Sherwood, David Spickett, David Truby, Graham Hunter, Igor Kirillov, Ivan Zhechev, James Greenhalgh, Javier Setoain, Jingu Kang, John Brawn, Jolanta Jensen, Josh Mottley, Justas Janickas, Karl Meakin, Keith Walker, Kerry McLaughlin, Kevin Cheng, Kevin Petit, Kiran Chandramohan, Kristof Beyls, Kyrylo Tkachov, Lucas Prates, Maciej Gabka, Malhar Jajoo, Mark Murray, Mats Petersson, Matthew Devereau, Mikhail Maltsev, Momchil Velikov, Mubashar Ahmad, Nicholas Guy, Ole Marius Strohm, Oliver Stannard, Paul Walker, Pawel Osmialowski, Peter Smith, Peter Waller, Ranjeet Singh, Richard Barton, Rosie Sumpter, Sam Elliott, Sam Parker-Haynes, Sander de Smalen, Simon Tatham, Sjoerd Meijer, Son Tuan Vu, Steve Suzuki, Stuart Brady, Stuart Ellis, Suraj Sudhir, Sven van Haastregt, Ties Stuij, Tomas Matheson, Victor Campos, Volodymyr Turanskyy