What is new in LLVM 15?

February 27, 2023

5 minute read time.

LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Other than the regular architecture enablement work, Arm contributed several pieces of functionality. This includes support for frame chains in AArch32, SVE whole-loop scalable auto-vectorization and C/C++ operator support for ACLE types. The release also benefits from numerous performance improvements, both for A-profile and M-profile cores.

Architecture and IP support

LLVM 15 expands support for the A-profile architecture with the new Armv8.8-A and Armv9.3-A extensions. The most relevant change from a toolchain point of view is the addition of new instructions to improve the performance and portability of memcpy() and memset(). These can now be accessed through a series of ACLE intrinsics of the form __builtin_arm_mops*(). For more information on the full extensions, refer to the Arm A-Profile Architecture Developments 2021 blog post.

Another addition worth noting is support for the Arm Cortex-M85 processor. This is the highest-performance M-profile CPU to date, and is also the first CPU to allow the PACBTI Security Extension (optional). More info on the Cortex-M85 is available here. You can build for this CPU by adding -mcpu=cortex-m85 to your command line.

Performance improvements

LLVM 15 bundles several performance improvements. One of the main areas of work has been vectorization with SVE.

Tail folding for scalable auto-vectorization

This new feature allows the vectorizer to handle all iterations within a vectorized loop, removing the need for a scalar epilogue loop.

void foo(int* __restrict__ dst, int* __restrict__ src, int N) {
    #pragma nounroll
    for (int i = 0; i < 12; i++) {
        dst[i] = 1000 * src[i];
    }
}

Compile with -O3 -g0 -march=armv9-a.

Before LLVM 15, without tail folding, the compiler would refuse to vectorize:

foo(int*, int*, int): // @foo(int*, int*, int)
    mov x8, xzr
    mov w9, #1000
.LBB0_1: // =>This Inner Loop Header: Depth=1
    ldr w10, [x1, x8]
    mul w10, w10, w9
    str w10, [x0, x8]
    add x8, x8, #4
    cmp x8, #48
    b.ne .LBB0_1
    ret

Because of tail folding, vectorization does not require an epilogue. LLVM estimates the vector version of this loop to be faster than the scalar one, and the loop is successfully vectorized:

foo(int*, int*, int): // @foo(int*, int*, int)
    mov w9, #12
    mov w11, #1000
    mov x8, xzr
    cntw x10
    whilelo p0.s, xzr, x9
    mov z0.s, w11
.LBB0_1: // =>This Inner Loop Header: Depth=1
    ld1w { z1.s }, p0/z, [x1, x8, lsl #2]
    mul z1.s, z1.s, z0.s
    st1w { z1.s }, p0, [x0, x8, lsl #2]
    add x8, x8, x10
    whilelo p0.s, x8, x9
    b.mi .LBB0_1
    ret

The pass is enabled by default only in certain circumstances, such as with known trip counts. Otherwise, it can be enabled with option -mllvm -sve-tail-folding=<option>.

Store-pair sequence scheduling

Some AArch64 cores like Neoverse N1 can benefit from sequences of 256-byte STP being ordered by ascending order of offsets. This impacts performance of large memsets or any equivalent code with a hot path through a series of STP Qs. In LLVM 15, we added a scheduler to reorder such stores after register allocation for AArch64. For example:

void init_one (char *a) { __builtin_memset (a, 1, 128); }

With the older scheduling, LLVM generated:

init_one:
	movi v0.16b, #1
	stp q0, q0, [x0, #96]
	stp q0, q0, [x0, #64]
	stp q0, q0, [x0, #32]
	stp q0, q0, [x0]
	ret

The improved scheduler now generates:

init_one:
	movi v0.16b, #1
	stp q0, q0, [x0]
	stp q0, q0, [x0, #32]
	stp q0, q0, [x0, #64]
	stp q0, q0, [x0, #96]
	ret

Support for Arm ACLE features and ABI

LLVM 15 brings various new features for Arm users.

AAPCS32 frame chain support

LLVM 15 adds an option for ensuring the generation of AAPCS-compliant frame records in AArch32. Frame records allow applications to easily analyze and traverse the complete call stack starting from a frame pointer.

Without this option, the compiler might choose to use slightly different layouts and registers for the frame chain. In thumb1 mode, for example, using r11 (a high register) as the Frame Pointer can have negative impacts in performance and code size. The compiler chooses to use r7 instead as a default. The new options ensure that the compiled program is compatible with tools and applications that rely on the behavior specified in the AAPCS.

The AAPCS32 standard can be found here. You can turn on AAPCS-compliant frame chains by passing -mframe-chain=aapcs to the compiler.

C/C++ operator support for ACLE vector types

Support for macro __ARM_FEATURE_SVE_VECTOR_OPERATORS has been added to LLVM 15. This macro allows using the GNU vector extensions and length-agnostic vector types such as svint32_t. The language extension is described in more detail as part of the ACLE specification.

Among other uses, defining this macro allows the compiler to perform operations on length-agnostic vectors. The following example illustrates the addition of two vectors:

#include <arm_sve.h>

#if (__ARM_FEATURE_SVE_VECTOR_OPERATORS == 2)
int32_t foo(svint32_t x, svint32_t y) {
    return (x + y)[3];
}
#endif

Compile with -O3 -g0 -march=armv9-a.

foo(__SVInt32_t, __SVInt32_t): // @foo(__SVInt32_t, __SVInt32_t)
    add z0.s, z1.s, z0.s
    mov w0, v0.s[3]
    ret

Note that, although the element size of the vectors is 32 bits, the total length of the vector is unknown at compile time.

Flang updates

In the past months, Arm has been heavily involved in the main LLVM Fortran front end, Flang. Up until LLVM 14, the development of the Flang front-end was split between two repositories: llvm-project and f18-llvm-project. Having two repositories caused maintenance overhead, confusion with merging patches and inconsistencies between repositories. In preparation for LLVM 15, Arm participated in a community effort to merge the second repository into the llvm-project's main branch. The completion of this upstreaming initiative gives us a single repository that contains a mostly functional Fortran 95 compiler. Executables can be built with option flang-experimental-new.

The most significant parts of the move were the OpenMP lowering code, the Fortran loop-related constructs and intrinsics, and support for CMake. In the area of OpenMP, we contributed support for reductions, and we are now close to complete support for OpenMP 1.1.

Some features of Fortran 95 are not yet supported, especially in the area of derived type components such as arrays and strings. Performance is also an area for future improvement. We continue to lead and invest further in the Flang driver, and we look forward to delivering more improvements in future releases of LLVM.

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog