Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog What is new in LLVM 15?
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Toolchain
  • LLVM
  • SVE
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

What is new in LLVM 15?

Pablo Barrio
Pablo Barrio
February 27, 2023
5 minute read time.

LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Other than the regular architecture enablement work, Arm contributed several pieces of functionality. This includes support for frame chains in AArch32, SVE whole-loop scalable auto-vectorization and C/C++ operator support for ACLE types. The release also benefits from numerous performance improvements, both for A-profile and M-profile cores.

Architecture and IP support

LLVM 15 expands support for the A-profile architecture with the new Armv8.8-A and Armv9.3-A extensions. The most relevant change from a toolchain point of view is the addition of new instructions to improve the performance and portability of memcpy() and memset(). These can now be accessed through a series of ACLE intrinsics of the form __builtin_arm_mops*(). For more information on the full extensions, refer to the Arm A-Profile Architecture Developments 2021 blog post. 

Another addition worth noting is support for the Arm Cortex-M85 processor. This is the highest-performance M-profile CPU to date, and is also the first CPU to allow the PACBTI Security Extension (optional). More info on the Cortex-M85 is available here. You can build for this CPU by adding -mcpu=cortex-m85 to your command line.

Performance improvements

LLVM 15 bundles several performance improvements. One of the main areas of work has been vectorization with SVE.

Tail folding for scalable auto-vectorization

This new feature allows the vectorizer to handle all iterations within a vectorized loop, removing the need for a scalar epilogue loop.

void foo(int* __restrict__ dst, int* __restrict__ src, int N) {
    #pragma nounroll
    for (int i = 0; i < 12; i++) {
        dst[i] = 1000 * src[i];
    }
}

Compile with -O3 -g0 -march=armv9-a.

Before LLVM 15, without tail folding, the compiler would refuse to vectorize:

foo(int*, int*, int): // @foo(int*, int*, int)
    mov x8, xzr
    mov w9, #1000
.LBB0_1: // =>This Inner Loop Header: Depth=1
    ldr w10, [x1, x8]
    mul w10, w10, w9
    str w10, [x0, x8]
    add x8, x8, #4
    cmp x8, #48
    b.ne .LBB0_1
    ret

Because of tail folding, vectorization does not require an epilogue. LLVM estimates the vector version of this loop to be faster than the scalar one, and the loop is successfully vectorized:

foo(int*, int*, int): // @foo(int*, int*, int)
    mov w9, #12
    mov w11, #1000
    mov x8, xzr
    cntw x10
    whilelo p0.s, xzr, x9
    mov z0.s, w11
.LBB0_1: // =>This Inner Loop Header: Depth=1
    ld1w { z1.s }, p0/z, [x1, x8, lsl #2]
    mul z1.s, z1.s, z0.s
    st1w { z1.s }, p0, [x0, x8, lsl #2]
    add x8, x8, x10
    whilelo p0.s, x8, x9
    b.mi .LBB0_1
    ret

The pass is enabled by default only in certain circumstances, such as with known trip counts. Otherwise, it can be enabled with option -mllvm -sve-tail-folding=<option>.

Store-pair sequence scheduling

Some AArch64 cores like Neoverse N1 can benefit from sequences of 256-byte STP being ordered by ascending order of offsets. This impacts performance of large memsets or any equivalent code with a hot path through a series of STP Qs. In LLVM 15, we added a scheduler to reorder such stores after register allocation for AArch64. For example:

void init_one (char *a) { __builtin_memset (a, 1, 128); }

With the older scheduling, LLVM generated:

init_one:
	movi v0.16b, #1
	stp q0, q0, [x0, #96]
	stp q0, q0, [x0, #64]
	stp q0, q0, [x0, #32]
	stp q0, q0, [x0]
	ret

The improved scheduler now generates:

init_one:
	movi v0.16b, #1
	stp q0, q0, [x0]
	stp q0, q0, [x0, #32]
	stp q0, q0, [x0, #64]
	stp q0, q0, [x0, #96]
	ret

Support for Arm ACLE features and ABI

LLVM 15 brings various new features for Arm users.

AAPCS32 frame chain support

LLVM 15 adds an option for ensuring the generation of AAPCS-compliant frame records in AArch32. Frame records allow applications to easily analyze and traverse the complete call stack starting from a frame pointer.

Without this option, the compiler might choose to use slightly different layouts and registers for the frame chain. In thumb1 mode, for example, using r11 (a high register) as the Frame Pointer can have negative impacts in performance and code size. The compiler chooses to use r7 instead as a default. The new options ensure that the compiled program is compatible with tools and applications that rely on the behavior specified in the AAPCS.

The AAPCS32 standard can be found here. You can turn on AAPCS-compliant frame chains by passing -mframe-chain=aapcs to the compiler.

C/C++ operator support for ACLE vector types

Support for macro __ARM_FEATURE_SVE_VECTOR_OPERATORS has been added to LLVM 15. This macro allows using the GNU vector extensions and length-agnostic vector types such as svint32_t. The language extension is described in more detail as part of the ACLE specification.

Among other uses, defining this macro allows the compiler to perform operations on length-agnostic vectors. The following example illustrates the addition of two vectors:

#include <arm_sve.h>

#if (__ARM_FEATURE_SVE_VECTOR_OPERATORS == 2)
int32_t foo(svint32_t x, svint32_t y) {
    return (x + y)[3];
}
#endif

Compile with -O3 -g0 -march=armv9-a.

foo(__SVInt32_t, __SVInt32_t): // @foo(__SVInt32_t, __SVInt32_t)
    add z0.s, z1.s, z0.s
    mov w0, v0.s[3]
    ret

Note that, although the element size of the vectors is 32 bits, the total length of the vector is unknown at compile time.

Flang updates

In the past months, Arm has been heavily involved in the main LLVM Fortran front end, Flang. Up until LLVM 14, the development of the Flang front-end was split between two repositories: llvm-project and f18-llvm-project. Having two repositories caused maintenance overhead, confusion with merging patches and inconsistencies between repositories. In preparation for LLVM 15, Arm participated in a community effort to merge the second repository into the llvm-project's  main branch. The completion of this upstreaming initiative gives us a single repository that contains a mostly functional Fortran 95 compiler. Executables can be built with option flang-experimental-new.

The most significant parts of the move were the OpenMP lowering code, the Fortran loop-related constructs and intrinsics, and support for CMake. In the area of OpenMP, we contributed support for reductions, and we are now close to complete support for OpenMP 1.1.

Some features of Fortran 95 are not yet supported, especially in the area of derived type components such as arrays and strings. Performance is also an area for future improvement. We continue to lead and invest further in the Flang driver, and we look forward to delivering more improvements in future releases of LLVM.

Anonymous
Tools, Software and IDEs blog
  • Python on Arm: 2025 Update

    Diego Russo
    Diego Russo
    Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
    • August 21, 2025
  • Product update: Arm Development Studio 2025.0 now available

    Stephen Theobald
    Stephen Theobald
    Arm Development Studio 2025.0 now available with Arm Toolchain for Embedded Professional.
    • July 18, 2025
  • GCC 15: Continuously Improving

    Tamar Christina
    Tamar Christina
    GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
    • June 26, 2025