LLVM 14 - what's new and improved for Arm

March 30, 2022

9 minute read time.

Introduction

Here at Arm, we use LLVM a lot. It is an important toolchain in its own right, plus it forms the basis of many open source and commercial toolchains targeting Arm, including our own Arm Compiler for Embedded and Arm Compiler for Linux. It is used across the Arm ecosystem - from the development of code for tiny embedded processors, through mobile phone app development on Android, right up to High Performance Computing (HPC) development on the biggest systems in the world.

LLVM 14 has just been released, and has some great new enhancements for developers targeting Arm-based systems. This blog post will introduce some of those enhancements, and talk about how you might want to make use of them.

LLVM auto-vectorizes to SVE and SVE2 by default

Background

SVE and SVE2 are vector enhancements to the Arm A–profile architecture. Many readers will be aware of them in detail already, but to learn more please visit our newly updated SVE/SVE2 Programmer's Guide.

Side note: for the sake of brevity, I'll use 'SVE2' for the rest of this article to refer to both SVE and SVE2. The majority of the work is applicable to both architecture features.

For the past few years, our LLVM compiler development team - along with our ecosystem partners - have been adding enhancements to allow LLVM to make use of SVE2 instructions. The primary mechanism in LLVM for making use of SVE2 is via Scalable auto-vectorisation. This allows the compiler to turn scalar loops into SVE2 vector instructions that will execute correctly and performantly on CPUs with any SVE2 vector length, up to 2048 bits.

In LLVM 14, for the first time, the vectorizer uses Scalable auto-vectorization by default, to generate SVE2 instructions when compiling for targets that support it.

Along the way, the team have made significant improvements to SVE2 code quality, enhanced the LLVM cost–model, and added enhancements to a number of compiler features, such as stack protection, to make them aware of SVE2.

Flags to enable SVE and SVE2 scalable auto-vectorization

Making use of scalable auto-vectorization is easy. You just need to:

1. Ensure Auto-vectorization is enabled

typically by giving an optimisation flag of –02 or higher. –03 or –0fast is recommended where appropriate.

2. Ensure SVE or SVE2 is enabled

typically by specifying an appropriate CPU target, such as -mcpu=neoverse-v1

(alternatively, this can be achieved using a suitable -march flag)

For suitable workloads, we have found that LLVM 14 gives performance improvements over NEON-only auto-vectorization on SVE-capable systems, such as those using the Neoverse V1 CPU.

There's more work coming in this space - we intend to make significant further enhancements to LLVM SVE2 code quality in LLVM 15, later this year.

Code examples

We can see an example of SVE auto-vectorisation in practice, by looking at the following code snippet:

void foo(short * __restrict__ dst, short *src, int N) {
  for (int i = 0; i < N; i++)
    dst[i] = src[i] + 42;
}

Note: In the code examples below, I have disabled unrolling by adding -fno-unroll-loops, purely to make the resulting assembly code easier to read. By default, when targeting Neoverse V1 systems, the compiler will choose to unroll this code to improve runtime performance.

If we compile this with -mcpu=neoverse-v1 -O3 -g0 -fno-unroll-loops, we see code that looks like this:

_Z3fooPsS_i: // @_Z3fooPsS_i
  .cfi_startproc
// %bb.0:
  cmp w2, #1
  b.lt .LBB0_8
// %bb.1:
  mov w8, w2
  cnth x10
  cmp x10, x8
  b.ls .LBB0_3
// %bb.2:
  mov x9, xzr
  b .LBB0_6
.LBB0_3:
  udiv x9, x8, x10
  mov x11, xzr
  ptrue p0.h
  mul x9, x9, x10
  sub x12, x8, x9
  .p2align 5, 0x0, 16
.LBB0_4: // =>This Inner Loop Header: Depth=1
  ld1h { z0.h }, p0/z, [x1, x11, lsl #1]
  add z0.h, z0.h, #42 // =0x2a
  st1h { z0.h }, p0, [x0, x11, lsl #1]
  add x11, x11, x10
  cmp x11, x9
  b.ne .LBB0_4
// %bb.5:
  cbz x12, .LBB0_8
.LBB0_6:
  lsl x11, x9, #1
  sub x8, x8, x9
  add x10, x0, x11
  add x11, x1, x11
  .p2align 5, 0x0, 16
.LBB0_7: // =>This Inner Loop Header: Depth=1
  ldrh w9, [x11], #2
  subs x8, x8, #1
  add w9, w9, #42
  strh w9, [x10], #2
  b.ne .LBB0_7
.LBB0_8:
  ret

In the middle of this code block, you can see a section labelled .LBB0_4 . This is the SVE scalable vectorized loop.

Another new capability in LLVM 14 is to fold the scalar tail (shown in sections .LBB0_6 and .LBB0_7) into the main loop.

Using a non-production flag to the compiler, we can see it the effect this has on the code above. If we compile this with -mcpu=neoverse-v1 -O3 -g0 -mllvm -prefer-predicate-over-epilogue=predicate-dont-vectorize -fno-unroll-loops, we see code that looks like this:

_Z3fooPsS_i: // @_Z3fooPsS_i
  .cfi_startproc
// %bb.0:
  cmp w2, #1
  b.lt .LBB0_3
// %bb.1:
  mov w9, w2
  cnth x10
  mov x8, xzr
  add x11, x10, x9
  sub x11, x11, #1
  udiv x11, x11, x10
  mul x11, x11, x10
  .p2align 5, 0x0, 16
.LBB0_2: // =>This Inner Loop Header: Depth=1
  whilelo p0.h, x8, x9
  ld1h { z0.h }, p0/z, [x1, x8, lsl #1]
  add z0.h, z0.h, #42 // =0x2a
  st1h { z0.h }, p0, [x0, x8, lsl #1]
  add x8, x8, x10
  cmp x8, x11
  b.ne .LBB0_2
.LBB0_3:
  ret

Here, we can see the compiler making full use of the predication features of SVE, to allow the scalar tail to be folded into the main loop body. Although I have 'forced' this using a non-production flag, the compiler will apply this optimization automatically where it believes it to be beneficial. Further work in LLVM 15 will increase the code quality of this optimization.

SVE2 fixed length auto-vectorization improvements

We expect most users of SVE2 to use the vector length agnostic auto-vectorisation described above. However, in some specialised fields, such as HPC, Arm has partners with LLVM–based commercial toolchains that can safely assume a specific SVE vector width. To help support these partners, LLVM supports a style of auto vectorisation known as vector-length specific auto-vectorization. By invoking the compiler with -msve-vector-bits=256 or similar, the user gives the compiler freedom to assume that this code will only ever be executed on systems with exactly 256 bit SVE vector registers.

In LLVM 14, the code quality of fixed length auto-vectorization has been significantly improved, with support for interleaved accesses when using wider-than-NEON vectors, and better code quality when vectorising loops containing mixed datatypes.

Production-ready SVE ACLE

The Arm C Language extensions (or ACLE) for SVE provide an API to enable C/C++ programmers to exploit the Arm architecture with minimal restrictions on source code portability. The language extensions have two main purposes: to provide a set of types and accessors for SVE vectors and predicates, and to provide a function interface for all relevant SVE and SVE2 instructions. For more information on the ACLE, see our ACLE home page.

A new feature of the ACLE, known as the NEON/SVE bridge, has recently been added to the ACLE specification, and implemented in LLVM 14. The bridge allows for conversions between a NEON vector type and an SVE type. Since the NEON and SVE register files overlap, this conversion is typically free, reusing the existing in-register data without causing copies via memory.

As a side-note: the Arm ACLE specification is now open source, and the above addition was added via our public contribution process. Further contributions are very welcome!

Additionally, we have made a number of improvements to LLVM's implementation of the SVE ACLE:

Transitions between fixed length and scalable svbool_t types no longer go via memory
SVE primitives are lowered to LLVM IR to benefit from existing LLVM optimization infrastructure
We detect more PTEST removal opportunities
We utilise SVE's ASRD instruction to speed up integer divides
We utilise SVE's SUBR, ABD, FABD and floating point literal instructions.
Float-multiply-add sequences now follow IEEE correct handling for floating point negation
SVE predicated unary instructions no longer create false dependency chains

In combination, we believe these improvements mean that LLVM 14 is ready for production-use of the SVE2 ACLE. Please let us know if you have any issues.

Architecture and IP support

LLVM 14 expands support for both Arm architectural features and Arm Cortex CPUs.

On the architecture side, this includes:

Complete support for Armv9-A, up to and including Armv9.2-A
Complete support for the 2020 Armv8.7-A architecture extension
Support for the 2021 A-Profile architecture extensions (Armv8.8-A and Armv9.3-A)
Support for Armv8-R AArch64
Support for the Armv8.1-M PACBTI extension

For more information on architectural features, including the new Architectural Exploration tools for A-profile, please see our CPU Architecture homepage.

The following CPUs are now supported in LLVM:

CPU (follow link for more information)	Example invocation in Clang
Cortex-A510	`-mcpu=cortex-a510`
Cortex-A710	`-mcpu=cortex-a710`
Cortex-X2	`-mcpu=cortex-x2`
Cortex-X1C	`-mcpu=cortex-x1c`

Workload optimization

SPEC CPU 2017 benchmark improvements

Since LLVM 11, the compiler team at Arm have been steadily improving the performance of SPECINT 2017 for big AArch64 cores, such as those seen in Neoverse and high-end mobile Cortex CPUs, by around 2% per release. In LLVM 14, we achieved a 1.5% improvement on SPEC2017 geomean intrate score over LLVM.

This is primarily down to two pieces of work:

Conditional store optimization, showing a ~6% improvement in 541.leela_r
Vector code generation improvements to 525.x264_r

We expect to continue to make improvements to SPECINT in LLVM 15 and beyond.

General improvements

In addition to our focus on the SPEC CPU benchmark, we have made a number of other improvements, including:

Improvements to code generation for unaligned accesses
Removal of redundant extend operations
Efficient implementations of a number of memory manipulation routines in LLVM libc
Improvements to Cortex-R82 instruction scheduling
Improvements to the default instruction scheduling, leading to improved performance when code is executed on little cores such as Cortex-A510 (with no negative consequences when the same code is executed on larger cores)

LLDB enhancements

In partnership with Linaro, the LLDB debugger has a number of new features in LLVM 14, including:

Improved handling of "non-address bits" in various interactive commands. This improves developer experience when debugging code using the PAuth, MTE and TBI architectural extensions
Memory tags can now be displayed in line with memory contents
Improvements to Windows on Arm testing

Contributors

I'd like to end this post with a shout-out to our awesome development team. The following engineers at Arm contributed to LLVM 14:

Aaron DeBattista, Alban Bridonneau, Alexandros Lamprineas, Anastasia Stulova, Andre Simoes Dias Vieira, Andrzej Warzynski, Bradley Smith, Caroline Concatto, Cullen Rhodes, Daniel Kiss, David Candler, David Green, David Sherwood, David Spickett, David Truby, Graham Hunter, Igor Kirillov, Ivan Zhechev, James Greenhalgh, Javier Setoain, Jingu Kang, John Brawn, Jolanta Jensen, Josh Mottley, Justas Janickas, Karl Meakin, Keith Walker, Kerry McLaughlin, Kevin Cheng, Kevin Petit, Kiran Chandramohan, Kristof Beyls, Kyrylo Tkachov, Lucas Prates, Maciej Gabka, Malhar Jajoo, Mark Murray, Mats Petersson, Matthew Devereau, Mikhail Maltsev, Momchil Velikov, Mubashar Ahmad, Nicholas Guy, Ole Marius Strohm, Oliver Stannard, Paul Walker, Pawel Osmialowski, Peter Smith, Peter Waller, Ranjeet Singh, Richard Barton, Rosie Sumpter, Sam Elliott, Sam Parker-Haynes, Sander de Smalen, Simon Tatham, Sjoerd Meijer, Son Tuan Vu, Steve Suzuki, Stuart Brady, Stuart Ellis, Suraj Sudhir, Sven van Haastregt, Ties Stuij, Tomas Matheson, Victor Campos, Volodymyr Turanskyy

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog