What is new in LLVM 20?

April 29, 2025

8 minute read time.

LLVM 20.1.0 was released on March 11th, 2025. Arm teams focused on integrating support making sure LLVM has the latest Arm architecture support as well as gets best performance out of it. Below is the summary of key contributions to LLVM 20.

To find out more about the previous LLVM release, read What is new in LLVM 19?

New architecture and CPU support

Armv9.6-A architecture support

By Maciej Gabka

LLVM 20 implements full assembly support and relevant command line options for the Armv9.6-A architecture. Learn more about the new extensions in the annual announcement blog post. The new architecture extensions implemented in LLVM are documented also on the Armv9.6-A architecture extension page. Moreover, LLVM 20 assembler is compatible with the December 2024 release of the architecture XML published as part of the Exploration Tools, the announced changes to architecture are documented in the release notes. If you are interested in the full instruction set, see the December 2024 release of Arm A-profile A64 Instruction Set Architecture.

Scalable Vector Extension 2.1 (SVE2.1) ACLE support

By Maciej Gabka

LLVM 20 includes support for the beta version of the SVE2.1 and Scalable Matrix Extension 2.1 (SME2.1) ACLE which are now available in the ACLE Q4 2024 specification release.

It enables LLVM users to generate instructions for those architecture features directly from C and C++ code.

SME ACLE ZA-agnostic support

By Sander De Smalen

To avoid the cost of setting up the lazy save scheme around calls to functions that don't use SME ZA state, the __arm_agnostic keyword attribute was defined in the Q4 2024 ACLE. This enables the user to specify that the function will preserve all SME ZA state if it exists. In practice this means such functions can be called from any function regardless of their ZA state without incurring the cost of a lazy-save. More details can be found in the clang attributes documentation.

Example:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// `my_compatible_fn` can be called in any context using the `__arm_agnostic`
// and `__arm_streaming_compatible` attributes.
void my_compatible_fn(void) 
        __arm_streaming_compatible
        __arm_agnostic("sme_za_state");
 
void init_sme(void) __arm_streaming __arm_out("za", "zt0");
void use_sme(void) __arm_streaming __arm_inout("za", "zt0");
 
__arm_new("za", "zt0")
__attribute__((target("armv9-a+sme")))
void some_sme_function(void) __arm_streaming {
  init_sme();
  my_compatible_fn();  // no lazy-save is required
  use_sme();
}
 
__attribute__((target("armv8-a")))
void some_non_sme_function(void) {
  my_compatible_fn();  // The caller doesn't require +sme to call a __arm_agnostic("sme_za_state")
                       // or __arm_streaming_compatible function.
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

// `my_compatible_fn` can be called in any context using the `__arm_agnostic`
// and `__arm_streaming_compatible` attributes.
void my_compatible_fn(void) 
        __arm_streaming_compatible
        __arm_agnostic("sme_za_state");
 
void init_sme(void) __arm_streaming __arm_out("za", "zt0");
void use_sme(void) __arm_streaming __arm_inout("za", "zt0");
 
__arm_new("za", "zt0")
__attribute__((target("armv9-a+sme")))
void some_sme_function(void) __arm_streaming {
  init_sme();
  my_compatible_fn();  // no lazy-save is required
  use_sme();
}
 
__attribute__((target("armv8-a")))
void some_non_sme_function(void) {
  my_compatible_fn();  // The caller doesn't require +sme to call a __arm_agnostic("sme_za_state")
                       // or __arm_streaming_compatible function.
}

Performance improvements

Spec 2017 performance improvements

By Kiran Chandramohan

A set of LLVM transformations developed by our team has delivered significant performance improvements across multiple SPEC2017 benchmarks.

Canonicalizing getelementptr (GEP) instructions with constant base pointers and offsets improved performance by over 3% in the 548.exchange2_r benchmark by enabling better constant folding and shared offset computation. Enhancements to the Superword Level Parallelism (SLP) vectorizer, particularly in load instruction ordering and clustering, yielded a 3% improvement in 525.x264_r.

Loop-Invariant Code Motion (LICM) was improved by teaching Clang to emit nuw (no unsigned wrap) GEPs and enhancing BasicAA to leverage them for alias analysis, enabling better hoisting/sinking of struct field updates and delivering a 0.89% gain in 557.xz_r. Function specialization for constants passed by reference, commonly seen in Flang code, significantly boosted performance in the 503.bwaves_r benchmark. Additionally, the AArch64 backend was optimized to avoid generating expensive ld4 instructions for uitofp(ld4) patterns, preferring more efficient and/shift operations, resulting in a 22% gain in 538.imagick_r.

Overall, these and other community-driven improvements led to a 3% uplift in SPEC2017 floating-point rate (fprate) and a 1% gain in integer rate (intrate) on Graviton 4.

Our team also improved the scheduling model for fdiv and square root instructions. These improvements provided a 2.5% improvement for 519.lbm_r.

Code generation improvements

Function multiversioning improvements

By Alexandros Lamprineas

Function multiversioning (FMV) support has been added in LLVM 16, in this release the following improvements were made based on the feedback from users:

We have removed FMV features without compiler support and consolidated others following the ACLE specification.
We now generate FMV feature dependencies using TableGen.
We have added runtime detection for depend-on FMV features.
We have introduced new metadata info for multi-versioned functions.
We have added support in GlobalOpt for optimizing calls that can be statically resolved, without routing them through an ifunc resolver function.

Better SVE code generation for add reduction loops

By Samuel Tebbs

LLVM can now generate different variants of NEON dot product instructions (udot, sdot and usdot). For example, the following code sample:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <stdint.h>
 
uint32_t udot(uint8_t *a, uint8_t *b, unsigned N) {
    uint32_t total = 0;
    for (unsigned i = 0; i < N; i++) {
        total += a[i] * b[i];
    }
    return total;
}
 
int32_t sdot(int8_t *a, int8_t *b, unsigned N) {
    int32_t total = 0;
    for (unsigned i = 0; i < N; i++) {
        total += a[i] * b[i];
    }
    return total;
}
 
int32_t usdot(uint8_t *a, int8_t *b, unsigned N) {
    int32_t total = 0;
    for (unsigned i = 0; i < N; i++) {
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#include <stdint.h>
 
uint32_t udot(uint8_t *a, uint8_t *b, unsigned N) {
    uint32_t total = 0;
    for (unsigned i = 0; i < N; i++) {
        total += a[i] * b[i];
    }
    return total;
}
 
int32_t sdot(int8_t *a, int8_t *b, unsigned N) {
    int32_t total = 0;
    for (unsigned i = 0; i < N; i++) {
        total += a[i] * b[i];
    }
    return total;
}
 
int32_t usdot(uint8_t *a, int8_t *b, unsigned N) {
    int32_t total = 0;
    for (unsigned i = 0; i < N; i++) {
        total += a[i] * b[i];
    }
    return total;
}


compiled with -O3 -march=armv8-a+dotprod+i8mm flags to:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
udot:                                   // @udot
        [...]
.LBB0_11:                               // %vector.body
        ldr     q1, [x13], #16
        ldr     q2, [x12], #16
        subs    x8, x8, #16
        udot    v0.4s, v2.16b, v1.16b
        [...]
 
 
sdot:                                   // @sdot
        [...]
.LBB1_11:                               // %vector.body
        ldr     q1, [x13], #16
        ldr     q2, [x12], #16
        subs    x8, x8, #16
        sdot    v0.4s, v2.16b, v1.16b
        [...]
 
usdot:                                  // @usdot
        [...]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

udot:                                   // @udot
        [...]
.LBB0_11:                               // %vector.body
        ldr     q1, [x13], #16
        ldr     q2, [x12], #16
        subs    x8, x8, #16
        udot    v0.4s, v2.16b, v1.16b
        [...]
 
 
sdot:                                   // @sdot
        [...]
.LBB1_11:                               // %vector.body
        ldr     q1, [x13], #16
        ldr     q2, [x12], #16
        subs    x8, x8, #16
        sdot    v0.4s, v2.16b, v1.16b
        [...]
 
usdot:                                  // @usdot
        [...]
.LBB2_11:                               // %vector.body
        ldr q1, [x13], #16
        ldr q2, [x12], #16
        subs x8, x8, #16
usdot v0.4s, v1.16b, v2.16b
        [...]


SVE dot products require the addition of -mllvm -vectorizer-maximize-bandwidth to the compile command and further tweaks to the cost model are required for SVE dot products to be produced in all cases:

clang -O3 -march=armv8-a+sve+i8mm -mllvm -vectorizer-maximize-bandwidth=1 -mllvm -force-target-instruction-cost=1

There is an ongoing effort in the LLVM community to go further and reduce the need to use extra flags.

Support for SVE2 complex dot-product vectorization

By Nicholas Guy

Using some of the functionality added to LLVM to enable generation of the dot product instructions, the complex deinterleaving pass (introduced in LLVM 16) was extended to support complex dot product instructions found in the SVE2 instruction set.

Current support only extends to cdot operations where the rotation would be either 90° or 180°. Support for 0° and 270° (and thus full complex dot product) operations will be implemented in a future LLVM release.

The following C++ snippet performs a partial complex dot product, with a rotation of 180°:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include<stdint.h>
 
struct cint16_t {
    int8_t real;
    int8_t imag;
};
 
int32_t cdot(cint16_t *a, cint16_t *b, uint32_t N) {
    int32_t res = 0;
    #pragma clang loop vectorize_width(16, scalable) unroll(disable)
    for(uint32_t i = 0; i < N; i++)
        res += (a[i].real * b[i].real) + (a[i].imag * b[i].imag);
    return res;
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

#include<stdint.h>
 
struct cint16_t {
    int8_t real;
    int8_t imag;
};
 
int32_t cdot(cint16_t *a, cint16_t *b, uint32_t N) {
    int32_t res = 0;
    #pragma clang loop vectorize_width(16, scalable) unroll(disable)
    for(uint32_t i = 0; i < N; i++)
        res += (a[i].real * b[i].real) + (a[i].imag * b[i].imag);
    return res;
}

when compiled with -O2 -march=armv9-a, compiler will generate the following assembly for the loop body:

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
.LBB0_5:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        add     x12, x0, x8
        ld1b    { z2.b }, p0/z, [x0, x8]
        ld1b    { z4.b }, p0/z, [x1, x8]
        addvl   x11, x11, #-1
        ldr     z3, [x12, #1, mul vl]
        add     x12, x1, x8
        addvl   x8, x8, #2
        ldr     z5, [x12, #1, mul vl]
        cdot    z0.s, z2.b, z4.b, #180
        cdot    z1.s, z3.b, z5.b, #180
        cbnz    x11, .LBB0_5
// %bb.6:                               // %middle.block
        add     z0.s, z0.s, z1.s
        ptrue   p0.s
        uaddv   d0, p0, z0.s
        fmov    w8, s0
        cmp     x10, x9
        b.eq    .LBB0_9
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

.LBB0_5:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        add     x12, x0, x8
        ld1b    { z2.b }, p0/z, [x0, x8]
        ld1b    { z4.b }, p0/z, [x1, x8]
        addvl   x11, x11, #-1
        ldr     z3, [x12, #1, mul vl]
        add     x12, x1, x8
        addvl   x8, x8, #2
        ldr     z5, [x12, #1, mul vl]
        cdot    z0.s, z2.b, z4.b, #180
        cdot    z1.s, z3.b, z5.b, #180
        cbnz    x11, .LBB0_5
// %bb.6:                               // %middle.block
        add     z0.s, z0.s, z1.s
        ptrue   p0.s
        uaddv   d0, p0, z0.s
        fmov    w8, s0
        cmp     x10, x9
        b.eq    .LBB0_9

Base loop vectorization support for SVE2 histcnt instruction

By Graham Hunter

LLVM is now able to vectorize some loops containing histogram-like operations using the histcnt instruction from SVE2.

Given a loop like the following (with the restrict so that the compiler knows the buckets don't alias with the indices):

Fullscreen

1
2
3
4
void simple_histogram(int *restrict buckets, unsigned *indices, int N, int inc) {
  for (int i = 0; i < N; ++i)
    buckets[indices[i]] += inc;
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

void simple_histogram(int *restrict buckets, unsigned *indices, int N, int inc) {
  for (int i = 0; i < N; ++i)
    buckets[indices[i]] += inc;
}

If we compile that using the following optimization flags: -O3 -march=armv9-a -mllvm -force-vector-interleave=1 -mllvm -enable-histogram-loop-vectorization then we get a loop using histcnt:

Fullscreen

1
2
3
4
5
6
7
8
9
10
.LBB0_4:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
    ld1w    { z1.s }, p0/z, [x1, x10, lsl #2]
    incw    x10
    ld1w    { z2.s }, p0/z, [x0, z1.s, uxtw #2]
    histcnt z3.s, p0/z, z1.s, z1.s
    mla z2.s, p0/m, z3.s, z0.s
    st1w    { z2.s }, p0, [x0, z1.s, uxtw #2]
    cmp x9, x10
    b.ne    .LBB0_4
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

.LBB0_4:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
    ld1w    { z1.s }, p0/z, [x1, x10, lsl #2]
    incw    x10
    ld1w    { z2.s }, p0/z, [x0, z1.s, uxtw #2]
    histcnt z3.s, p0/z, z1.s, z1.s
    mla z2.s, p0/m, z3.s, z0.s
    st1w    { z2.s }, p0, [x0, z1.s, uxtw #2]
    cmp x9, x10
    b.ne    .LBB0_4

We hope to enable this transformation by default in a later LLVM release, once the cost model is tested.

Tools improvements

Flang

By Kiran Chandramohan

Recent updates to Flang include expanded driver support for flags such as -B, config file options, loop unrolling (-f[no-]unroll-loops), OpenMP disabling (-fno-openmp), zero-initialization of global variables, and automatic ArmPL linkage with -fveclib=ArmPL. New intrinsics have been added, including GETUID, GETGID, MALLOC, FREE (for Cray pointers), a functional form of the SYSTEM intrinsic, and the non-standard LNBLNK.

Parser and semantics support was introduced for various OpenMP constructs such as scope, dispatch, error directives, map and declare mapper, atomic compare, and clauses including master, align, fail, grainsize, and num_tasks. These currently generate TODO messages during lowering. On the lowering and code generation front, support was added for pointers and allocatables in the copying clause, along with improvements to privatization and reduction handling. Several bug fixes were also made to strengthen OpenMP support in Flang.

In Fortran, marking a derived type with the VALUE attribute in a BIND(C) interface specifies that the argument or return value should be passed by value, akin to a C struct. Flang’s target-specific code has been updated to support both passing and returning such values in compliance with the AArch64 ABI, ensuring correct behavior and adherence to platform calling conventions.

Try LLVM 20 version from the upstream LLVM project https://github.com/llvm/llvm-project.

Arm Toolchain project https://github.com/arm/arm-toolchain toolchains for Embedded and Linux use cases.

0 comments
0 members are here

Tools, Software and IDEs blog

GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025
Running KleidiAI MatMul kernels in a bare-metal Arm environment

Paul Black

Benchmarking Arm®︎ KleidiAI MatMul kernels on bare-metal with AC6, GCC, and ATfE compilers.
- April 17, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

What is new in LLVM 20?

New architecture and CPU support

Armv9.6-A architecture support

Scalable Vector Extension 2.1 (SVE2.1) ACLE support

SME ACLE ZA-agnostic support

Performance improvements

Spec 2017 performance improvements

Code generation improvements

Function multiversioning improvements

Better SVE code generation for add reduction loops

Support for SVE2 complex dot-product vectorization

Base loop vectorization support for SVE2 histcnt instruction

Tools improvements

Flang

GitHub and Arm are transforming development on Windows for developers

What is new in LLVM 20?

Running KleidiAI MatMul kernels in a bare-metal Arm environment