Part 1: What is new in LLVM 18?

April 12, 2024

8 minute read time.

LLVM 18.1.0 was released on March 6th, 2024. Among multiple new features and improvements, Arm contributed support for the latest Armv9.5-A architecture as well as numerous performance and security enhancements further detailed below.

To find out more about the previous LLVM release, you can read the What is new in LLVM 17? blog post.

New architecture and CPU support

A-profile 2023 updates: Armv9.5-A

By Lucas Prates

The Armv9.5-A extensions are now supported in LLVM. You can learn more about the new extensions in the announcement blog, notably checked pointer arithmetic, floating-point 8, live VM migration and others.

Assembly and disassembly support are now available for all the extensions introduced as part of the 2023 updates.

The PAC Enhancement (FEAT_PAuth_LR) is one of the main security improvements coming with Armv9.5-A. The extension is part of the Memory System Extensions and is designed to harden the PAC for return address signing by using the value of PC as a second diversifier. The new +pc modifier to the -mbranch-protection option has been introduced in LLVM 18 to enable the new FEAT_PAuth_LR instructions for return address signing. For example, when compiled with -march=armv9.5a+pauth-lr -mbranch-protection=pac-ret+pc, the code:

void example_leaf();
 
void example_function() {
    example_leaf();
}

Results in the following assembly:

example_function():
.Ltmp0:
        paciasppc
        stp     x29, x30, [sp, #-16]!
        mov     x29, sp
        bl      example_leaf()
        ldp     x29, x30, [sp], #16
        retaasppc       .Ltmp0

Checked Pointer Arithmetic (FEAT_CPA) is second major security improvement coming with Armv9.5-A. Part of the Memory Tagging Extensions, the extension is aimed at detecting and preventing modifications of bits [62:56] of virtual addresses during pointer manipulation based on user-controlled data. Assembly and disassembly support have been added for the new Checked Pointer Arithmetic instructions, and code generation will be supported in the next LLVM release.

CPUs

By Jonathan Thackray

On the CPU side, this release extends the lineup of Armv9.2-A cores with support for our Cortex-A520, Cortex-A720 and Cortex-X4, and the Armv8.1-M Cortex-M52.

Performance improvements

SPEC2017 performance improvements

By Kyrylo Tkachov

Several optimizations have been added that benefit popular benchmarks such as SPEC2017. These include major improvements to code generation in Flang, a new loop idiom recognition pass, optimizations to vectorization checks, PGO improvements and more.

This leads to an overall geomean improvement of more than 10% in the estimated intrate score:

Estimated SPEC2017 intrate improvements

Loop idiom recognition pass

by David Sherwood & Kerry McLaughlin

The AArch64LoopIdiomTransform pass was introduced in LLVM 18, which is intended to recognize common idioms which would benefit from vectorization but are not currently handled by the vectorizer. The idiom which this pass is currently targeting is a simple loop which compares bytes of memory and breaks when a mismatch is found. The motivation for targeting this idiom was to improve the performance of the 557.xz_r workload of SPEC2017, which has multiple occurrences of this pattern.

Below is a simple example of such a loop:

void foo1_u32(unsigned char *p1, unsigned char *p2, unsigned *startp, unsigned end) {
  unsigned start = *startp;
  while (++start != end)
    if (p1[start] != p2[start])
      break;
  *startp = start;
}

We first experimented with trying to improve the performance of xz by replacing the source with handwritten Neon and SVE loops where this pattern occurs. The most optimal form was found to be a predicated SVE loop using regular load instructions, with runtime memory checks before the loop which will fall back on a scalar version if memory accesses in the loop are found to cross a page boundary.

This SVE loop was then used as the basis of the new pass which transforms the pattern when matched. In LLVM 18 this was expanded to cover more variants of the pattern described above. The new pass allows us to generate the most optimal form until LoopVectorise can vectorize multi-exit loops such as this. The pass gives a 6-7% improvement on Neoverse V1 and around a 5% improvement on Neoverse V2.

Below is the output from the example above after the AArch64LoopIdiomTransform pass, built with clang -O3 -mcpu=neoverse-v1 -S example.c:

foo1_u32:
        ldr     w8, [x2]
        add     w8, w8, #1
        cmp     w8, w3
        b.hi    .LBB0_7
// Runtime checks
        mov     w9, w3
        add     x10, x0, x8
        add     x11, x1, x8
        add     x12, x0, x9
        add     x13, x1, x9
        eor     x10, x10, x12
        eor     x11, x11, x13
        orr     x10, x10, x11
        cmp     x10, #4095
        b.hi    .LBB0_7
        whilelo p1.b, x8, x9
        rdvl    x10, #1
        ptrue   p0.b
        .p2align        5, , 16
.LBB0_3:
// Main SVE loop
        ld1b    { z0.b }, p1/z, [x0, x8]
        ld1b    { z1.b }, p1/z, [x1, x8]
        cmpne   p1.b, p1/z, z0.b, z1.b
        b.ne    .LBB0_9
        add     x8, x8, x10
        whilelo p1.b, x8, x9
        b.mi    .LBB0_3
.LBB0_5:
        str     w3, [x2]
        ret
        .p2align        5, , 16
.LBB0_6:
// Fallback scalar loop
        add     w8, w8, #1
        cmp     w3, w8
        b.eq    .LBB0_5
.LBB0_7:
        ldrb    w9, [x0, w8, uxtw]
        ldrb    w10, [x1, w8, uxtw]
        cmp     w9, w10
        b.eq    .LBB0_6
        mov     w3, w8
        str     w3, [x2]
        ret
.LBB0_9:
// SVE code to find correct index of mismatch
        brkb    p0.b, p0/z, p1.b
        mov     w3, w8
        incp    x3, p0.b
        str     w3, [x2]
        ret

Default scheduling model change for AArch64

by Kyrylo Tkachov

The Cortex-A510 scheduling model has been added and is now used as the default scheduling model for AArch64. This brings in better scheduling for more modern cores, including much improved scheduling for SVE instructions.

Vectorisation of math functions

By Maciej Gabka

LLVM 18 now supports automatic vectorization of popular standard math functions, making use of vector math libraries like ArmPL.

From LLVM 18 clang can vectorize loops which contain calls to standard math functions. The vector variants do not modify value of the errno variable, hence the need to explicitly disable it by passing -fno-math-errno, what is assumed when -Ofast optimization is used.

The following are examples showing how to use vector routines from Arm Performance Libraries available at https://developer.arm.com/downloads/-/arm-performance-libraries, keep in mind that in order to produce an executable binary, user needs to link with the libamath (part of ArmPL) library as well.

Input source code:

#include <math.h>
 
void compute_sin(double * a, double * b, unsigned N) {
    for (unsigned i = 0; i < N; ++i) {
        a[i] = sin(b[i]);
    }
}

Vectorize using Neon instructions:

clang -fveclib=ArmPL -O2 -fno-math-errno compute_sin.c -S -o -
 
 
.LBB0_4:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
    ldp q0, q1, [x23, #-16]
    str q1, [sp, #16]                   // 16-byte Folded Spill
    bl  armpl_vsinq_f64
    str q0, [sp]                        // 16-byte Folded Spill
    ldr q0, [sp, #16]                   // 16-byte Folded Reload
    bl  armpl_vsinq_f64
    ldr q1, [sp]                        // 16-byte Folded Reload
    subs    x25, x25, #4
    add x23, x23, #32
    stp q1, q0, [x24, #-16]
    add x24, x24, #32
    b.ne    .LBB0_4

Vectorize using SVE instructions:

clang -fveclib=ArmPL -O2 -mcpu=neoverse-v1 -fno-math-errno compute_sin.c -S -o -
 
 
.LBB0_10:                               // %vector.body
                                        // =>This Inner Loop Header: Depth=1
    ld1d    { z0.d }, p4/z, [x20, x24, lsl #3]
    ld1d    { z16.d }, p4/z, [x25, x24, lsl #3]
    mov p0.b, p4.b
    bl  armpl_svsin_f64_x
    mov z17.d, z0.d
    mov z0.d, z16.d
    mov p0.b, p4.b
    bl  armpl_svsin_f64_x
    st1d    { z17.d }, p4, [x19, x24, lsl #3]
    st1d    { z0.d }, p4, [x26, x24, lsl #3]
    add x24, x24, x23
    cmp x22, x24
    b.ne    .LBB0_10

Flang improvements

By Kiran Chandramohan

One of the major features that was enabled for the LLVM 18 release is the HLFIR (High-Level Fortran IR) dialect flow. The HLFIR dialect retains more information from the Fortran source, including details of Array Expressions. Fortran Arrays must be buffered in Array Expressions, but in many cases, these buffers are not necessary. To achieve high performance, the unnecessary buffers must be removed. The HLFIR flow enables the removal of these buffers. We participated in the community effort to enable the HLFIR flow. The HLFIR flow provided huge benefits for 527.cam4 as can be seen in the following diagram. We also modeled several Fortran intrinsics in the HLFIR dialect. Modeling intrinsics as operations helped combine the Matmul and Transpose intrinsic into a MatmulTranspose Operation. A call to a dedicated runtime function for MatmulTranspose provided significant improvements for 178.galgel in Spec2000.

Fortran procedure arguments are not alias by default, unlike C/C++. The LLVM alias analysis assumes that the arguments alias. We added a pass and other improvements to Flang to provide more alias information to LLVM. By passing more alias information to LLVM we obtained significant speedups for 549.fotonik3d_r and a few other benchmarks.

Another performance feature that we worked on was to enable the vector library for flang using the fveclib= option. This allows the vectorization of loops containing arithmetic functions by calling their vector variants in the ArmPL library. This feature provided significant benefits for 521.wrf_r and 527.cam4_r and minor benefits for other benchmarks. All these features combined provided LLVM 18 with a 17.5 % over LLVM 17 for the fprate Fortran benchmarks.

Estimated SPEC2017 fprate improvements

We also added support for new driver flags to support vscale-range and frame-pointer. The work involved also enabled these as attributes in the MLIR LLVM dialect. Addition of vscale-range information provided speedups for 548.exchange2.

We also added some intrinsics to improve language conformance and to make sure commonly used extensions that are not part of the standard also work with Flang. This includes the Fortran 2008 execute_command_line intrinsic. This intrinsic can be used to run any system command from a Fortran program. We also added support for the system intrinsic which is a non-standard extension that is similar to execute_command_line. Besides these, support was added for the fdate, system, getlog, and getpid intrinsic extensions.

Some work was carried out to better support different Windows CRT (C run-time libraries). There is now a selection flag (fms-runtime-lib=) for which CRT to use.

Scalable vectors support for MLIR

By Andrzej Warzynski

The support for scalable vectors and SVE has improved significantly since the previous release of LLVM. In particular, the vectorizer for the Linalg dialect has gained support for scalable auto-vectorization, which means that it can target SVE and SVE2 using VLA programming paradigm. At the moment, this support is limited to linalg.generic and a few other named operations, for example linalg.matmul.

Building on top of this, we have also added support for targeting SME so that every Machine Learning framework that lowers with the Linalg dialect can leverage SME for matrix multiplication operations. In future releases of LLVM we will extend this work to other Linalg Ops, improve the quality and the performance of the generated code, and add support for scalable vectors in other areas of MLIR. Note that this is work-in-progress and should be used for testing and evaluation purposes only.

Read LLVM 18 Part 2

Tools, Software and IDEs blog

Arm Toolchain for Embedded: next-generation Arm C/C++ embedded compiler

Paul Black

Arm is launching Arm Toolchain for Embedded (ATfE), an embedded C/C++ cross-compiler. The toolchain is expected to be launched in April 2025, but a beta version is available now.
- January 9, 2025
Product update: Arm Development Studio 2024.1 now available

Ronan Synnott

Arm Development Studio 2024.1 is now available with support for Cortex-A725 and Cortex-X925.
- January 2, 2025
Part 3: Leveraging Rust with Rich Operating Systems on Arm

Jonathan Pallant

Understand how Rust can take full advantage of running on a full-blown operating system such as Linux.
- November 15, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog