LLVM 18.1.0 was released on March 6th, 2024. Among multiple new features and improvements, Arm contributed support for the latest Armv9.5-A architecture as well as numerous performance and security enhancements further detailed below.
To find out more about the previous LLVM release, you can read the What is new in LLVM 17? blog post.
By Lucas Prates
The Armv9.5-A extensions are now supported in LLVM. You can learn more about the new extensions in the announcement blog, notably checked pointer arithmetic, floating-point 8, live VM migration and others.
Assembly and disassembly support are now available for all the extensions introduced as part of the 2023 updates.
The PAC Enhancement (FEAT_PAuth_LR) is one of the main security improvements coming with Armv9.5-A. The extension is part of the Memory System Extensions and is designed to harden the PAC for return address signing by using the value of PC as a second diversifier. The new +pc modifier to the -mbranch-protection option has been introduced in LLVM 18 to enable the new FEAT_PAuth_LR instructions for return address signing. For example, when compiled with -march=armv9.5a+pauth-lr -mbranch-protection=pac-ret+pc, the code:
+pc
-mbranch-protection
-march=armv9.5a+pauth-lr -mbranch-protection=pac-ret+pc
void example_leaf(); void example_function() { example_leaf(); }
Results in the following assembly:
example_function(): .Ltmp0: paciasppc stp x29, x30, [sp, #-16]! mov x29, sp bl example_leaf() ldp x29, x30, [sp], #16 retaasppc .Ltmp0
Checked Pointer Arithmetic (FEAT_CPA) is second major security improvement coming with Armv9.5-A. Part of the Memory Tagging Extensions, the extension is aimed at detecting and preventing modifications of bits [62:56] of virtual addresses during pointer manipulation based on user-controlled data. Assembly and disassembly support have been added for the new Checked Pointer Arithmetic instructions, and code generation will be supported in the next LLVM release.
By Jonathan Thackray
On the CPU side, this release extends the lineup of Armv9.2-A cores with support for our Cortex-A520, Cortex-A720 and Cortex-X4, and the Armv8.1-M Cortex-M52.
By Kyrylo Tkachov
Several optimizations have been added that benefit popular benchmarks such as SPEC2017. These include major improvements to code generation in Flang, a new loop idiom recognition pass, optimizations to vectorization checks, PGO improvements and more.
This leads to an overall geomean improvement of more than 10% in the estimated intrate score:
by David Sherwood & Kerry McLaughlin
The AArch64LoopIdiomTransform pass was introduced in LLVM 18, which is intended to recognize common idioms which would benefit from vectorization but are not currently handled by the vectorizer. The idiom which this pass is currently targeting is a simple loop which compares bytes of memory and breaks when a mismatch is found. The motivation for targeting this idiom was to improve the performance of the 557.xz_r workload of SPEC2017, which has multiple occurrences of this pattern.
Arch64LoopIdiomTransform
Below is a simple example of such a loop:
void foo1_u32(unsigned char *p1, unsigned char *p2, unsigned *startp, unsigned end) { unsigned start = *startp; while (++start != end) if (p1[start] != p2[start]) break; *startp = start; }
We first experimented with trying to improve the performance of xz by replacing the source with handwritten Neon and SVE loops where this pattern occurs. The most optimal form was found to be a predicated SVE loop using regular load instructions, with runtime memory checks before the loop which will fall back on a scalar version if memory accesses in the loop are found to cross a page boundary.
This SVE loop was then used as the basis of the new pass which transforms the pattern when matched. In LLVM 18 this was expanded to cover more variants of the pattern described above. The new pass allows us to generate the most optimal form until LoopVectorise can vectorize multi-exit loops such as this. The pass gives a 6-7% improvement on Neoverse V1 and around a 5% improvement on Neoverse V2.
LoopVectorise
Below is the output from the example above after the AArch64LoopIdiomTransform pass, built with clang -O3 -mcpu=neoverse-v1 -S example.c:
AArch64LoopIdiomTransform
clang
O3 -mcpu=neoverse-v1 -S example.c:
foo1_u32: ldr w8, [x2] add w8, w8, #1 cmp w8, w3 b.hi .LBB0_7 // Runtime checks mov w9, w3 add x10, x0, x8 add x11, x1, x8 add x12, x0, x9 add x13, x1, x9 eor x10, x10, x12 eor x11, x11, x13 orr x10, x10, x11 cmp x10, #4095 b.hi .LBB0_7 whilelo p1.b, x8, x9 rdvl x10, #1 ptrue p0.b .p2align 5, , 16 .LBB0_3: // Main SVE loop ld1b { z0.b }, p1/z, [x0, x8] ld1b { z1.b }, p1/z, [x1, x8] cmpne p1.b, p1/z, z0.b, z1.b b.ne .LBB0_9 add x8, x8, x10 whilelo p1.b, x8, x9 b.mi .LBB0_3 .LBB0_5: str w3, [x2] ret .p2align 5, , 16 .LBB0_6: // Fallback scalar loop add w8, w8, #1 cmp w3, w8 b.eq .LBB0_5 .LBB0_7: ldrb w9, [x0, w8, uxtw] ldrb w10, [x1, w8, uxtw] cmp w9, w10 b.eq .LBB0_6 mov w3, w8 str w3, [x2] ret .LBB0_9: // SVE code to find correct index of mismatch brkb p0.b, p0/z, p1.b mov w3, w8 incp x3, p0.b str w3, [x2] ret
by Kyrylo Tkachov
The Cortex-A510 scheduling model has been added and is now used as the default scheduling model for AArch64. This brings in better scheduling for more modern cores, including much improved scheduling for SVE instructions.
By Maciej Gabka
LLVM 18 now supports automatic vectorization of popular standard math functions, making use of vector math libraries like ArmPL.
From LLVM 18 clang can vectorize loops which contain calls to standard math functions. The vector variants do not modify value of the errno variable, hence the need to explicitly disable it by passing -fno-math-errno, what is assumed when -Ofast optimization is used.
-fno-math-errno
-Ofast
The following are examples showing how to use vector routines from Arm Performance Libraries available at https://developer.arm.com/downloads/-/arm-performance-libraries, keep in mind that in order to produce an executable binary, user needs to link with the libamath (part of ArmPL) library as well.
libamath
Input source code:
#include <math.h> void compute_sin(double * a, double * b, unsigned N) { for (unsigned i = 0; i < N; ++i) { a[i] = sin(b[i]); } }
Vectorize using Neon instructions:
clang -fveclib=ArmPL -O2 -fno-math-errno compute_sin.c -S -o - .LBB0_4: // %vector.body // =>This Inner Loop Header: Depth=1 ldp q0, q1, [x23, #-16] str q1, [sp, #16] // 16-byte Folded Spill bl armpl_vsinq_f64 str q0, [sp] // 16-byte Folded Spill ldr q0, [sp, #16] // 16-byte Folded Reload bl armpl_vsinq_f64 ldr q1, [sp] // 16-byte Folded Reload subs x25, x25, #4 add x23, x23, #32 stp q1, q0, [x24, #-16] add x24, x24, #32 b.ne .LBB0_4
Vectorize using SVE instructions:
clang -fveclib=ArmPL -O2 -mcpu=neoverse-v1 -fno-math-errno compute_sin.c -S -o - .LBB0_10: // %vector.body // =>This Inner Loop Header: Depth=1 ld1d { z0.d }, p4/z, [x20, x24, lsl #3] ld1d { z16.d }, p4/z, [x25, x24, lsl #3] mov p0.b, p4.b bl armpl_svsin_f64_x mov z17.d, z0.d mov z0.d, z16.d mov p0.b, p4.b bl armpl_svsin_f64_x st1d { z17.d }, p4, [x19, x24, lsl #3] st1d { z0.d }, p4, [x26, x24, lsl #3] add x24, x24, x23 cmp x22, x24 b.ne .LBB0_10
By Kiran Chandramohan
One of the major features that was enabled for the LLVM 18 release is the HLFIR (High-Level Fortran IR) dialect flow. The HLFIR dialect retains more information from the Fortran source, including details of Array Expressions. Fortran Arrays must be buffered in Array Expressions, but in many cases, these buffers are not necessary. To achieve high performance, the unnecessary buffers must be removed. The HLFIR flow enables the removal of these buffers. We participated in the community effort to enable the HLFIR flow. The HLFIR flow provided huge benefits for 527.cam4 as can be seen in the following diagram. We also modeled several Fortran intrinsics in the HLFIR dialect. Modeling intrinsics as operations helped combine the Matmul and Transpose intrinsic into a MatmulTranspose Operation. A call to a dedicated runtime function for MatmulTranspose provided significant improvements for 178.galgel in Spec2000. Fortran procedure arguments are not alias by default, unlike C/C++. The LLVM alias analysis assumes that the arguments alias. We added a pass and other improvements to Flang to provide more alias information to LLVM. By passing more alias information to LLVM we obtained significant speedups for 549.fotonik3d_r and a few other benchmarks.
Another performance feature that we worked on was to enable the vector library for flang using the fveclib= option. This allows the vectorization of loops containing arithmetic functions by calling their vector variants in the ArmPL library. This feature provided significant benefits for 521.wrf_r and 527.cam4_r and minor benefits for other benchmarks. All these features combined provided LLVM 18 with a 17.5 % over LLVM 17 for the fprate Fortran benchmarks.
fveclib=
ArmPL
fprate
We also added support for new driver flags to support vscale-range and frame-pointer. The work involved also enabled these as attributes in the MLIR LLVM dialect. Addition of vscale-range information provided speedups for 548.exchange2. We also added some intrinsics to improve language conformance and to make sure commonly used extensions that are not part of the standard also work with Flang. This includes the Fortran 2008 execute_command_line intrinsic. This intrinsic can be used to run any system command from a Fortran program. We also added support for the system intrinsic which is a non-standard extension that is similar to execute_command_line. Besides these, support was added for the fdate, system, getlog, and getpid intrinsic extensions. Some work was carried out to better support different Windows CRT (C run-time libraries). There is now a selection flag (fms-runtime-lib=) for which CRT to use.
execute_command_line
fdate
system
getlog,
getpid
(fms-runtime-lib=)
By Andrzej Warzynski
The support for scalable vectors and SVE has improved significantly since the previous release of LLVM. In particular, the vectorizer for the Linalg dialect has gained support for scalable auto-vectorization, which means that it can target SVE and SVE2 using VLA programming paradigm. At the moment, this support is limited to linalg.generic and a few other named operations, for example linalg.matmul.
Building on top of this, we have also added support for targeting SME so that every Machine Learning framework that lowers with the Linalg dialect can leverage SME for matrix multiplication operations. In future releases of LLVM we will extend this work to other Linalg Ops, improve the quality and the performance of the generated code, and add support for scalable vectors in other areas of MLIR. Note that this is work-in-progress and should be used for testing and evaluation purposes only.
Read LLVM 18 Part 2