LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Other than the regular architecture enablement work, Arm contributed several pieces of functionality. This includes support for frame chains in AArch32, SVE whole-loop scalable auto-vectorization and C/C++ operator support for ACLE types. The release also benefits from numerous performance improvements, both for A-profile and M-profile cores.
LLVM 15 expands support for the A-profile architecture with the new Armv8.8-A and Armv9.3-A extensions. The most relevant change from a toolchain point of view is the addition of new instructions to improve the performance and portability of memcpy() and memset(). These can now be accessed through a series of ACLE intrinsics of the form __builtin_arm_mops*(). For more information on the full extensions, refer to the Arm A-Profile Architecture Developments 2021 blog post.
memcpy()
memset()
__builtin_arm_mops*()
Another addition worth noting is support for the Arm Cortex-M85 processor. This is the highest-performance M-profile CPU to date, and is also the first CPU to allow the PACBTI Security Extension (optional). More info on the Cortex-M85 is available here. You can build for this CPU by adding -mcpu=cortex-m85 to your command line.
-mcpu=cortex-m85
LLVM 15 bundles several performance improvements. One of the main areas of work has been vectorization with SVE.
This new feature allows the vectorizer to handle all iterations within a vectorized loop, removing the need for a scalar epilogue loop.
void foo(int* __restrict__ dst, int* __restrict__ src, int N) { #pragma nounroll for (int i = 0; i < 12; i++) { dst[i] = 1000 * src[i]; } }
Compile with -O3 -g0 -march=armv9-a.
-O3 -g0 -march=armv9-a
Before LLVM 15, without tail folding, the compiler would refuse to vectorize:
foo(int*, int*, int): // @foo(int*, int*, int) mov x8, xzr mov w9, #1000 .LBB0_1: // =>This Inner Loop Header: Depth=1 ldr w10, [x1, x8] mul w10, w10, w9 str w10, [x0, x8] add x8, x8, #4 cmp x8, #48 b.ne .LBB0_1 ret
Because of tail folding, vectorization does not require an epilogue. LLVM estimates the vector version of this loop to be faster than the scalar one, and the loop is successfully vectorized:
foo(int*, int*, int): // @foo(int*, int*, int) mov w9, #12 mov w11, #1000 mov x8, xzr cntw x10 whilelo p0.s, xzr, x9 mov z0.s, w11 .LBB0_1: // =>This Inner Loop Header: Depth=1 ld1w { z1.s }, p0/z, [x1, x8, lsl #2] mul z1.s, z1.s, z0.s st1w { z1.s }, p0, [x0, x8, lsl #2] add x8, x8, x10 whilelo p0.s, x8, x9 b.mi .LBB0_1 ret
The pass is enabled by default only in certain circumstances, such as with known trip counts. Otherwise, it can be enabled with option -mllvm -sve-tail-folding=<option>.
-mllvm -sve-tail-folding=<option>
Some AArch64 cores like Neoverse N1 can benefit from sequences of 256-byte STP being ordered by ascending order of offsets. This impacts performance of large memsets or any equivalent code with a hot path through a series of STP Qs. In LLVM 15, we added a scheduler to reorder such stores after register allocation for AArch64. For example:
void init_one (char *a) { __builtin_memset (a, 1, 128); }
With the older scheduling, LLVM generated:
init_one: movi v0.16b, #1 stp q0, q0, [x0, #96] stp q0, q0, [x0, #64] stp q0, q0, [x0, #32] stp q0, q0, [x0] ret
The improved scheduler now generates:
init_one: movi v0.16b, #1 stp q0, q0, [x0] stp q0, q0, [x0, #32] stp q0, q0, [x0, #64] stp q0, q0, [x0, #96] ret
LLVM 15 brings various new features for Arm users.
LLVM 15 adds an option for ensuring the generation of AAPCS-compliant frame records in AArch32. Frame records allow applications to easily analyze and traverse the complete call stack starting from a frame pointer.
Without this option, the compiler might choose to use slightly different layouts and registers for the frame chain. In thumb1 mode, for example, using r11 (a high register) as the Frame Pointer can have negative impacts in performance and code size. The compiler chooses to use r7 instead as a default. The new options ensure that the compiled program is compatible with tools and applications that rely on the behavior specified in the AAPCS.
The AAPCS32 standard can be found here. You can turn on AAPCS-compliant frame chains by passing -mframe-chain=aapcs to the compiler.
-mframe-chain=aapcs
Support for macro __ARM_FEATURE_SVE_VECTOR_OPERATORS has been added to LLVM 15. This macro allows using the GNU vector extensions and length-agnostic vector types such as svint32_t. The language extension is described in more detail as part of the ACLE specification.
__ARM_FEATURE_SVE_VECTOR_OPERATORS
svint32_t
Among other uses, defining this macro allows the compiler to perform operations on length-agnostic vectors. The following example illustrates the addition of two vectors:
#include <arm_sve.h> #if (__ARM_FEATURE_SVE_VECTOR_OPERATORS == 2) int32_t foo(svint32_t x, svint32_t y) { return (x + y)[3]; } #endif
foo(__SVInt32_t, __SVInt32_t): // @foo(__SVInt32_t, __SVInt32_t) add z0.s, z1.s, z0.s mov w0, v0.s[3] ret
Note that, although the element size of the vectors is 32 bits, the total length of the vector is unknown at compile time.
In the past months, Arm has been heavily involved in the main LLVM Fortran front end, Flang. Up until LLVM 14, the development of the Flang front-end was split between two repositories: llvm-project and f18-llvm-project. Having two repositories caused maintenance overhead, confusion with merging patches and inconsistencies between repositories. In preparation for LLVM 15, Arm participated in a community effort to merge the second repository into the llvm-project's main branch. The completion of this upstreaming initiative gives us a single repository that contains a mostly functional Fortran 95 compiler. Executables can be built with option flang-experimental-new.
llvm-project
f18-llvm-project
llvm-project's
main
flang-experimental-new
The most significant parts of the move were the OpenMP lowering code, the Fortran loop-related constructs and intrinsics, and support for CMake. In the area of OpenMP, we contributed support for reductions, and we are now close to complete support for OpenMP 1.1.
Some features of Fortran 95 are not yet supported, especially in the area of derived type components such as arrays and strings. Performance is also an area for future improvement. We continue to lead and invest further in the Flang driver, and we look forward to delivering more improvements in future releases of LLVM.