LLVM 16 was announced on March 17, 2022. As usual, Arm added support for new architectures and CPUs, and significant performance improvements. This time around, we also brought exciting new functionality such as function multi-versioning and full support for strict floating-point, and several existing features have been improved. llvm-objdump is now a better substitute for GNU objdump. We fixed support for the older Armv4 architecture, and improvements to the Fortran front-end means that we can now build SPEC2017.
llvm-objdump
Many thanks to all the people who contributed content to this blog post. Most notably:
If you want to know more about the previous release, you can read the blog about what is new in LLVM 15.
LLVM now supports the Armv8.9-A and Armv9.4-A extensions. You can learn more about the new extensions in the announcement blog.
Other than the standard support for this year's architecture, we finished assembly support for the Scalable Matrix Extension (SME and SME2). On the CPU side, this release extends the line-up of Armv9-A cores with support for our Cortex-A715 and Cortex-X3 CPUs.
Assembly and disassembly is now available for all extensions except for the Guarded Call Stacks (GCS), GCS will be supported in the next LLVM release. The Arm C Language Extensions (ACLE) have also been extended with 2 new intrinsics, __rsr128 and __wsr128. These intrinsics make the new 128-bit System registers easier to access and now supported in LLVM.
__rsr128
__wsr128.
The Translation Hardening Extension (THE) is one of the main security improvements coming with Armv9.4-A and it is part of the Virtual Memory System Architecture (VMSA). Its purpose is to prevent arbitrary changes to the virtual memory's translation tables in situations where an attacker has gained kernel privileges. The new Read-Check-Write (RCW) instructions have been added to the architecture to allow controlled modification of such tables while disabling ordinary writes.
Even though these are intended for kernel rather than user-space developers, the RCW instructions map nicely to various atomic operations on 128-bit datatypes in C++. More specifically, fetch_and, fetch_or, and exchange can be implemented directly with these instructions. This functionality is useful for anyone using atomics, so we added code generation support in LLVM 16. In targets where the LRCPC3 and LSE2 extensions are also available, these specialized instructions are directly generated from C++ code without the need of assembly or intrinsics. The following code is an example for std::atomic::fetch_and:
fetch_and
fetch_or,
exchange
std::atomic::fetch_and
#include <atomic> std::atomic<__uint128_t> global; void sink(__uint128_t); void ldclrpal_example(__uint128_t x) { __uint128_t res = global.fetch_and(x); sink(res); } void ldclrp_example(__uint128_t x) { __uint128_t res = global.fetch_and(x, std::memory_order_relaxed); sink(res); }
Compiling with -march=armv9.4a+lse128+rcpc3 -O3, the resulting assembly shows the new instructions being generated:
-march=armv9.4a+lse128+rcpc3 -O3
ldclrpal_example(unsigned __int128): mvn x1, x1 mvn x0, x0 adrp x8, global add x8, x8, :lo12:global ldclrpal x0, x1, [x8] b sink(unsigned __int128) ldclrp_example(unsigned __int128): mvn x1, x1 mvn x0, x0 adrp x8, global add x8, x8, :lo12:global ldclrp x0, x1, [x8] b sink(unsigned __int128)
Nowadays, many platforms have a single-binary deployment model: each application is distributed through exactly 1 binary. This model makes it hard for developers to target multiple architectural features. To solve this problem, LLVM 16 provides a convenient way to target specific architectural features without the need to deal with feature detection and other details. This new feature is called function multi-versioning.A new macro __HAVE_FUNCTION_MULTI_VERSIONING is provided to detect the availability of the feature. If present, we can ask the compiler to generate multiple versions of the given function by marking it with __attribute__((target_clones()). The most appropriate version of the function is called at runtime.
__HAVE_FUNCTION_MULTI_VERSIONING
__attribute__((target_clones())
In the below example, a function has been marked to be built for Advanced SIMD (Neon) and SVE. The SVE version is used if SVE is available on the target.
#ifdef __HAVE_FUNCTION_MULTI_VERSIONING __attribute__((target_clones("sve", "simd"))) #endif float foo(float *a, float *b) { // }
In some cases, developers want to provide different code for each feature. This is also possible by using __attribute__((target_version())). In the following example, we provide 2 versions for the same function. Again, the SVE version will be called if SVE is available. Macro __HAVE_FUNCTION_MULTI_VERSIONING allows writing code compatible with compilers with and without function multi-versioning.
__attribute__((target_version()))
#ifdef __HAVE_FUNCTION_MULTI_VERSIONING __attribute__((target_version("sve"))) static void foo(void) { printf("FMV uses SVE\n"); } #endif // this attribute is optional // __attribute__((target_version("default"))) static void foo(void) { printf("FMV default\n"); return; }
This feature depends on compiler-rt (-rtlib=compiler-rt) and is enabled by default, but it can be disabled with flag -mno-fmv. Be aware that function multi-versioning is still in beta state. Feedback is very welcome on the ACLE spec, either by opening a new issue or by creating a pull request.
-mno-fmv
LLVM 16 includes support for the autovectorization of common operations on complex numbers. These use instructions available in the Advanced SIMD (Neon) and MVE instruction sets for the Armv8-A and Armv8-M architectures, respectively. For example, the code:
#include <complex.h> #define N 512 void fma (_Complex float a[restrict N], _Complex float b[restrict N], _Complex float c[restrict N]) { for (int i=0; i < N; i++) c[i] = a[i] * b[i]; }
results in the following assembly:
fma: // @fma mov x8, xzr .LBB0_1: // =>This Inner Loop Header: Depth=1 add x9, x0, x8 add x10, x1, x8 movi v2.2d, #0000000000000000 movi v3.2d, #0000000000000000 ldp q1, q0, [x9] add x9, x2, x8 add x8, x8, #32 cmp x8, #1, lsl #12 // =4096 ldp q5, q4, [x10] fcmla v3.4s, v1.4s, v5.4s, #0 fcmla v2.4s, v0.4s, v4.4s, #0 fcmla v3.4s, v1.4s, v5.4s, #90 fcmla v2.4s, v0.4s, v4.4s, #90 stp q3, q2, [x9] b.ne .LBB0_1 ret
Note the use of the FCMLA instruction, which performs a fused-multiply-add vector operation with an optional complex rotation on vectors of complex numbers.
Specialization of functions has been enabled by default at all optimization levels when optimizing for speed. The optimization heuristics and compile-time properties of the pass have been improved and is deemed to be generally beneficial enough to be enabled by default. This optimization particularly improves the 505.mcf_r benchmark in SPEC2017 intrate by about 10% on various AArch64 platforms. This optimization contributes to an improvement of the SPEC2017 intrate C and C++ benchmarks by an estimated 3% geomean on AArch64. Note that the SPEC2017 performance uplift is also aided by tuning and enabling by default of the SelectOpt pass and other advanced pattern recognition.
Autovectorization with SVE has been a very active area of development. For example, up until now, vectorization of pointers accessed in different branches of a conditional was very basic. Most of the time, the cost would be computed as too high. Now, the cost model of the vectorizer includes basic arithmetic on the pointer. This means the following code is now vectorized when it is profitable to do so:
void foo(float *dst, float *src, int *cond, long disp) { for (long i=0; i<1024; i++) { if (cond[i] != 0) { dst[i] = src[i]; } else { dst[i] = src[i+disp]; } } }
This said, hitting the right circumstances to make vectorization profitable is tricky on a synthetic example, and the generated code is very long. If you want to see what the vectorized code looks like, you can tweak the cost model. Compile the previous example with -march=v9a -O3 -Rpass=loop-vectorize -mllvm -force-target-instruction-cost=1.
-march=v9a -O3 -Rpass=loop-vectorize -mllvm -force-target-instruction-cost=1
Vectorization of tail-folded loops has also been improved by reducing the need for explicit merging operations. For example, the following code:
float foo(float *a, float *b) { float sum = 0.0; for (int i = 0; i < 1024; ++i) sum += a[i] * b[i]; return sum; }
compiled with -march=armv9-a -Ofast -mllvm -sve-tail-folding=all shows that a predicated FMLA is now emitted:
-march=armv9-a -Ofast -mllvm -sve-tail-folding=all
.LLVM_15_LOOP: ld1w { z2.s }, p1/z, [x0, x8, lsl #2] ld1w { z3.s }, p1/z, [x1, x8, lsl #2] add x8, x8, x10 fmul z2.s, z3.s, z2.s sel z2.s, p1, z2.s, z0.s whilelo p1.s, x8, x9 fadd z1.s, z1.s, z2.s b.mi .LLVM_15_LOOP .LLVM_16_LOOP: ld1w { z1.s }, p1/z, [x0, x8, lsl #2] ld1w { z2.s }, p1/z, [x1, x8, lsl #2] add x8, x8, x10 fmla z0.s, p1/m, z2.s, z1.s whilelo p1.s, x8, x9 b.mi .LLVM_16_LOOP
Also, vectorization of loops with reverse iteration counts is improved by reducing the need for explicit reverse operations. Take this loop as an example:
void foo(int *a, int *b, int* c) { for (int i = 1024; i >= 0; --i) { if (c[i] > 10) a[i] = b[i] + 5; } }
Compiled with -march=armv9-a -O3, the LLVM 16 output no longer reverses the loaded data nor the predicate used for the conditional:
-march=armv9-a -O3
.LLVM_15_LOOP: ld1w { z0.s }, p0/z, [x16, x9, lsl #2] ld1w { z1.s }, p0/z, [x17, x9, lsl #2] rev z0.s, z0.s rev z1.s, z1.s cmpgt p1.s, p0/z, z0.s, #10 cmpgt p2.s, p0/z, z1.s, #10 rev p1.s, p1.s rev p2.s, p2.s ld1w { z0.s }, p1/z, [x14, x9, lsl #2] ld1w { z1.s }, p2/z, [x15, x9, lsl #2] add z0.s, z0.s, #5 // =0x5 add z1.s, z1.s, #5 // =0x5 st1w { z0.s }, p1, [x12, x9, lsl #2] st1w { z1.s }, p2, [x13, x9, lsl #2] sub x9, x9, x10 cmp x18, x9 b.ne .LLVM_15_LOOP .LLVM_16_LOOP: ld1w { z0.s }, p0/z, [x13, x9, lsl #2] ld1w { z1.s }, p0/z, [x14, x9, lsl #2] cmpgt p1.s, p0/z, z0.s, #10 cmpgt p2.s, p0/z, z1.s, #10 ld1w { z0.s }, p1/z, [x15, x9, lsl #2] ld1w { z1.s }, p2/z, [x16, x9, lsl #2] add z0.s, z0.s, #5 // =0x5 add z1.s, z1.s, #5 // =0x5 st1w { z0.s }, p1, [x17, x9, lsl #2] st1w { z1.s }, p2, [x18, x9, lsl #2] sub x9, x9, x10 cmp x12, x9 b.ne .LLVM_16_LOOP
Other performance improvements to SVE on LLVM 16 include:
Last December, we met the milestone of all Fortran rate benchmarks working at O3 with LLVM and Flang. The main focus has been to enable 4 benchmarks (521.wrf_r, 527.cam4_r, 549.fotonik3d_r, 554.roms_r) that were failing. One of the main improvements was removing the dependency on external complex math libraries by using the complex dialect.
O3
Also, some performance has been gained by improving information sharing between the front-end and LLVM, and by improving support for fast math.
You can build Flang by passing -DLLVM_ENABLE_PROJECTS="flang;clang;mlir" to CMake. The flang executable is called flang-new; make sure to pass option -flang-experimental-exec to generate executables.
-DLLVM_ENABLE_PROJECTS="flang;clang;mlir"
flang-new
-flang-experimental-exec
Initially sparked by the Highway library, the target("<string>") attributes have seen some improvements in the latest clang, aiming at bringing them in line with GCC's implementation.
The supported formats are now:
arch=<arch>
-march=arch+feature
cpu=<cpu>
-mcpu=cpu+feature
tune=<cpu>
-mtune
+<feature>
+no<feature>
<feature>
no-<feature>
Along with the changes above, the implementation of ACLE intrinsics has been modified so that they are no longer based on preprocessor macros. Instead, they are enabled based on the current target. This allows making intrinsics available in individual functions without requiring the entire file to be compiled for the same target. The following example illustrates the use of the attributes on a function sve2_log :
sve2_log
#include <math.h> #include <arm_sve.h> void base_log(float *src, int *dst, int n) { for(int i = 0; i < n; i++) dst[i] = log2f(src[i]); } void __attribute__((target("sve2"))) sve2_log(float *src, int *dst, int n) { int i = 0; svbool_t p = svwhilelt_b32(i, n); while(svptest_any(svptrue_b32(), p)) { svfloat32_t d = svld1_f32(p, src+i); svint32_t l = svlogb_f32_z(p, d); svst1_s32(p, dst+i, l); i += svcntb(); p = svwhilelt_b32(i, n); } }
In LLVM 16, the output of llvm-objdump for Arm targets has been improved for readability and correctness, making it a more suitable replacement to GNU objdump on LLVM-based toolchains.
Disassembly of big-endian object files now works correctly. Previously, each instruction word was accidentally byte-swapped and disassembled as something entirely different.
Also, unrecognized instructions encountered in disassembly are handled in a more useful manner. Previously, the disassembler would advance by just 1 byte, and try again from an odd-numbered address. This policy makes sense on architectures with variable-length instructions, but never on Arm. The new behavior is to advance a whole instruction so that the rest of the file will likely be disassembled correctly.
LLVM 16 includes other quality improvements on Arm architectures, including bug fixes around Thumb vs. Arm disassembly and .byte directives now including the right byte. Some readability improvements to instruction encodings have been added to make Arm and 32-bit Thumb easier to tell apart: now you see one 8-digit number for Arm instructions and two 4-digit numbers with a space in-between for Thumb.
.byte
Strict floating-point semantics have been implemented for AArch64. The clang command-line option -ffp-model=strict is now accepted on AArch64 targets instead of being ignored with a warning. Take this example where an FP division is executed only if it is safe to do so:
-ffp-model=strict
float fn(int n, float x, float y) { if (n == 0) { x += 1; } else { x += y/n; } return x; }
On LLVM 15, compiling with -O2 resulted in the following generated code:
-O2
fn(int, float, float): // @fn(int, float, float) scvtf s3, w0 fmov s2, #1.00000000 cmp w0, #0 fdiv s1, s1, s3 fadd s1, s1, s0 fadd s0, s0, s2 fcsel s0, s1, s0, ne ret
which will execute both branches, including the divide, and select the right result afterwards in the fcsel. Although the functionality of the code is preserved, it results in a spurious FE_DIVBYZERO floating-point exception when n==0. On LLVM 16, compiling with -O2 -ffp-model=strict results in the following code:
fcsel
FE_DIVBYZERO
n==0
-O2 -ffp-model=strict
fn(int, float, float): // @fn(int, float, float) cbz w0, .LBB0_2 scvtf s2, w0 fdiv s1, s1, s2 fadd s0, s0, s1 ret .LBB0_2: mov w8, #1 scvtf s1, w8 fadd s0, s0, s1 ret
where the 2 different branches of execution are kept separate, preventing the FP exception from happening.
As a result of supporting strict FP, options -ftrapping-math and -frounding-math are now also accepted. On 1 side, -ftrapping-math ensures that the code does not introduce or remove side effects that could be caused by any kind of FP exceptions. These include exceptions that software can detect asynchronously by inspecting the FPSR. Similarly, -frounding-math avoids applying optimizations that assume a specific FP rounding behavior.
-ftrapping-math
-frounding-math
LLD can now be used as a linker for ARMv4 and ARMv4T: it now emits thunks compatible with Armv4 and Armv4T instead of incompatible BX instructions for Armv4 or BLX instructions for either Armv4 or Armv4T.
On a related note, support for compiler-rt built-ins was added for ARMv4T, ARMv5TE, and ARMv6, unlocking runtime support for these architectures.
Thanks to this enabling work, it is now possible to have a full LLVM-based toolchain for these 32-bit Arm architectures. Therefore, the Linux kernel has now added support for building Clang with LLD, and Rust programs do not need to depend on the GNU linker anymore.