Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Architectures and Processors blog What is new in LLVM 15?
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Toolchain
  • LLVM
  • SVE
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

What is new in LLVM 15?

Pablo Barrio
Pablo Barrio
February 27, 2023
5 minute read time.

LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Other than the regular architecture enablement work, Arm contributed several pieces of functionality. This includes support for frame chains in AArch32, SVE whole-loop scalable auto-vectorization and C/C++ operator support for ACLE types. The release also benefits from numerous performance improvements, both for A-profile and M-profile cores.

Architecture and IP support

LLVM 15 expands support for the A-profile architecture with the new Armv8.8-A and Armv9.3-A extensions. The most relevant change from a toolchain point of view is the addition of new instructions to improve the performance and portability of memcpy() and memset(). These can now be accessed through a series of ACLE intrinsics of the form __builtin_arm_mops*(). For more information on the full extensions, refer to the Arm A-Profile Architecture Developments 2021 blog post. 

Another addition worth noting is support for the Arm Cortex-M85 processor. This is the highest-performance M-profile CPU to date, and is also the first CPU to allow the PACBTI Security Extension (optional). More info on the Cortex-M85 is available here. You can build for this CPU by adding -mcpu=cortex-m85 to your command line.

Performance improvements

LLVM 15 bundles several performance improvements. One of the main areas of work has been vectorization with SVE.

Tail folding for scalable auto-vectorization

This new feature allows the vectorizer to handle all iterations within a vectorized loop, removing the need for a scalar epilogue loop.

void foo(int* __restrict__ dst, int* __restrict__ src, int N) {
    #pragma nounroll
    for (int i = 0; i < 12; i++) {
        dst[i] = 1000 * src[i];
    }
}

Compile with -O3 -g0 -march=armv9-a.

Before LLVM 15, without tail folding, the compiler would refuse to vectorize:

foo(int*, int*, int): // @foo(int*, int*, int)
    mov x8, xzr
    mov w9, #1000
.LBB0_1: // =>This Inner Loop Header: Depth=1
    ldr w10, [x1, x8]
    mul w10, w10, w9
    str w10, [x0, x8]
    add x8, x8, #4
    cmp x8, #48
    b.ne .LBB0_1
    ret

Because of tail folding, vectorization does not require an epilogue. LLVM estimates the vector version of this loop to be faster than the scalar one, and the loop is successfully vectorized:

foo(int*, int*, int): // @foo(int*, int*, int)
    mov w9, #12
    mov w11, #1000
    mov x8, xzr
    cntw x10
    whilelo p0.s, xzr, x9
    mov z0.s, w11
.LBB0_1: // =>This Inner Loop Header: Depth=1
    ld1w { z1.s }, p0/z, [x1, x8, lsl #2]
    mul z1.s, z1.s, z0.s
    st1w { z1.s }, p0, [x0, x8, lsl #2]
    add x8, x8, x10
    whilelo p0.s, x8, x9
    b.mi .LBB0_1
    ret

The pass is enabled by default only in certain circumstances, such as with known trip counts. Otherwise, it can be enabled with option -mllvm -sve-tail-folding=<option>.

Store-pair sequence scheduling

Some AArch64 cores like Neoverse N1 can benefit from sequences of 256-byte STP being ordered by ascending order of offsets. This impacts performance of large memsets or any equivalent code with a hot path through a series of STP Qs. In LLVM 15, we added a scheduler to reorder such stores after register allocation for AArch64. For example:

void init_one (char *a) { __builtin_memset (a, 1, 128); }

With the older scheduling, LLVM generated:

init_one:
	movi v0.16b, #1
	stp q0, q0, [x0, #96]
	stp q0, q0, [x0, #64]
	stp q0, q0, [x0, #32]
	stp q0, q0, [x0]
	ret

The improved scheduler now generates:

init_one:
	movi v0.16b, #1
	stp q0, q0, [x0]
	stp q0, q0, [x0, #32]
	stp q0, q0, [x0, #64]
	stp q0, q0, [x0, #96]
	ret

Support for Arm ACLE features and ABI

LLVM 15 brings various new features for Arm users.

AAPCS32 frame chain support

LLVM 15 adds an option for ensuring the generation of AAPCS-compliant frame records in AArch32. Frame records allow applications to easily analyze and traverse the complete call stack starting from a frame pointer.

Without this option, the compiler might choose to use slightly different layouts and registers for the frame chain. In thumb1 mode, for example, using r11 (a high register) as the Frame Pointer can have negative impacts in performance and code size. The compiler chooses to use r7 instead as a default. The new options ensure that the compiled program is compatible with tools and applications that rely on the behavior specified in the AAPCS.

The AAPCS32 standard can be found here. You can turn on AAPCS-compliant frame chains by passing -mframe-chain=aapcs to the compiler.

C/C++ operator support for ACLE vector types

Support for macro __ARM_FEATURE_SVE_VECTOR_OPERATORS has been added to LLVM 15. This macro allows using the GNU vector extensions and length-agnostic vector types such as svint32_t. The language extension is described in more detail as part of the ACLE specification.

Among other uses, defining this macro allows the compiler to perform operations on length-agnostic vectors. The following example illustrates the addition of two vectors:

#include <arm_sve.h>

#if (__ARM_FEATURE_SVE_VECTOR_OPERATORS == 2)
int32_t foo(svint32_t x, svint32_t y) {
    return (x + y)[3];
}
#endif

Compile with -O3 -g0 -march=armv9-a.

foo(__SVInt32_t, __SVInt32_t): // @foo(__SVInt32_t, __SVInt32_t)
    add z0.s, z1.s, z0.s
    mov w0, v0.s[3]
    ret

Note that, although the element size of the vectors is 32 bits, the total length of the vector is unknown at compile time.

Flang updates

In the past months, Arm has been heavily involved in the main LLVM Fortran front end, Flang. Up until LLVM 14, the development of the Flang front-end was split between two repositories: llvm-project and f18-llvm-project. Having two repositories caused maintenance overhead, confusion with merging patches and inconsistencies between repositories. In preparation for LLVM 15, Arm participated in a community effort to merge the second repository into the llvm-project's  main branch. The completion of this upstreaming initiative gives us a single repository that contains a mostly functional Fortran 95 compiler. Executables can be built with option flang-experimental-new.

The most significant parts of the move were the OpenMP lowering code, the Fortran loop-related constructs and intrinsics, and support for CMake. In the area of OpenMP, we contributed support for reductions, and we are now close to complete support for OpenMP 1.1.

Some features of Fortran 95 are not yet supported, especially in the area of derived type components such as arrays and strings. Performance is also an area for future improvement. We continue to lead and invest further in the Flang driver, and we look forward to delivering more improvements in future releases of LLVM.

Anonymous
Architectures and Processors blog
  • What is new in LLVM 15?

    Pablo Barrio
    Pablo Barrio
    LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Arm contributed support for new Arm extensions and CPUs.
    • February 27, 2023
  • Apache Arrow optimization on Arm

    Yibo Cai
    Yibo Cai
    This blog introduces Arm optimization practices with two solid examples from Apache Arrow project.
    • February 23, 2023
  • Optimizing TIFF image processing using AARCH64 (64-bit) Neon

    Ramin Zaghi
    Ramin Zaghi
    This guest blog shows how 64-bit Neon technology can be used to improve performance in image processing applications.
    • October 13, 2022