Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog What is new in LLVM 16?
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • CPU Architecture
  • performance
  • LLVM
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

What is new in LLVM 16?

Pablo Barrio
Pablo Barrio
May 1, 2023
13 minute read time.

LLVM 16 was announced on March 17, 2022. As usual, Arm added support for new architectures and CPUs, and significant performance improvements. This time around, we also brought exciting new functionality such as function multi-versioning and full support for strict floating point, and several existing features have been improved. llvm-objdump is now a better substitute for GNU objdump. We fixed support for the older Armv4 architecture, and improvements to the Fortran front-end mean we can now build SPEC2017.

Many thanks to all the people who contributed content to this blog post. Most notably Daniel Kiss, Kyrylo Tkachov, Paul Walker, Kiran Chandramohan, David Green, Simon Tatham, John Brawn, and Ties Stuij.

If you want to know more about the previous release, you can read the blog about what is new in LLVM 15.

New architecture and CPU support

The Armv8.9-A and Armv9.4-A extensions are now supported in LLVM. You can learn more about the new extensions in the announcement blog.

Other than the standard support for this year's architecture, we finished assembly support for the Scalable Matrix Extension (SME and SME2). On the CPU side, this release extends the line-up of Armv9-A cores with support for our Cortex-A715 and Cortex-X3 CPUs.

A-profile 2022 updates: Armv8.9-A and Armv9.4-A

Assembly and disassembly is now available for all extensions except for the Guarded Call Stacks (GCS), which will be supported in the next LLVM release. The Arm C Language Extensions (ACLE) have also been extended with two new intrinsics, __rsr128 and __wsr128; these make the new 128-bit System registers easier to access. These intrinsics are now supported in LLVM.

The Translation Hardening Extension (THE) is one of the main security improvements coming with Armv9.4-A and it is part of the Virtual Memory System Architecture (VMSA). Its purpose is to prevent arbitrary changes to the virtual memory's translation tables in situations where an attacker has gained kernel privileges. The new Read-Check-Write (RCW) instructions have been added to the architecture to allow controlled modification of such tables while disabling ordinary writes.

Even though these are intended for kernel rather than user-space developers, the RCW instructions map nicely to various atomic operations on 128-bit datatypes in C++. More specifically, fetch_and, fetch_or, and exchange can be implemented directly with these instructions. This functionality is useful for anyone using atomics, so we added code generation support in LLVM 16. In targets where the LRCPC3 and LSE2 extensions are also available, these specialized instructions are directly generated from C++ code without the need of assembly or intrinsics. The following code is an example for std::atomic::fetch_and:

#include <atomic>

std::atomic<__uint128_t> global;

void sink(__uint128_t);

void ldclrpal_example(__uint128_t x) {
    __uint128_t res = global.fetch_and(x);
    sink(res);
}

void ldclrp_example(__uint128_t x) {
    __uint128_t res = global.fetch_and(x, std::memory_order_relaxed);
    sink(res);
}

Compiling with -march=armv9.4a+lse128+rcpc3 -O3, the resulting assembly shows the new instructions being generated:

ldclrpal_example(unsigned __int128):
        mvn     x1, x1
        mvn     x0, x0
        adrp    x8, global
        add     x8, x8, :lo12:global
        ldclrpal        x0, x1, [x8]
        b       sink(unsigned __int128)
ldclrp_example(unsigned __int128):
        mvn     x1, x1
        mvn     x0, x0
        adrp    x8, global
        add     x8, x8, :lo12:global
        ldclrp  x0, x1, [x8]
        b       sink(unsigned __int128)

Function multi-versioning

Nowadays, many platforms have a single-binary deployment model: each application is distributed through exactly one binary. This makes it hard for developers to target multiple architectural features. To solve this problem, LLVM 16 provides a convenient way to target specific architectural features without the need to deal with feature detection and other details. This new feature is called function multi-versioning.

A new macro __HAVE_FUNCTION_MULTI_VERSIONING is provided to detect the availability of the feature. If present, we can ask the compiler to generate multiple versions of the given function by marking it with __attribute__((target_clones()). The most appropriate version of the function will be called at runtime.

In the below example, a function has been marked to be built for Advanced SIMD (Neon) and SVE. The SVE version is used if SVE is available on the target.

#ifdef __HAVE_FUNCTION_MULTI_VERSIONING
__attribute__((target_clones("sve", "simd")))
#endif
float foo(float *a, float *b) {
   // 
}

In some cases, developers want to provide different code for each feature. This is also possible by using __attribute__((target_version())). In the following example, we provide two versions for the same function. Again, the SVE version will be called if SVE is available. Macro __HAVE_FUNCTION_MULTI_VERSIONING allows writing code compatible with compilers with and without function multi-versioning.

#ifdef __HAVE_FUNCTION_MULTI_VERSIONING
__attribute__((target_version("sve")))
static void foo(void) {
    printf("FMV uses SVE\n");
}
#endif

// this attribute is optional
// __attribute__((target_version("default")))
static void foo(void) {    
    printf("FMV default\n");
    return;
}

This feature depends on compiler-rt (-rtlib=compiler-rt) and is enabled by default, but it can be disabled with flag  -mno-fmv. Be aware that function multi-versioning is still in beta state. Feedback is very welcome on the ACLE spec, either by opening a new issue or by creating a pull request.

Performance improvements

Complex number autovectorization

LLVM 16 includes support for autovectorization of common operations on complex numbers. These make use of instructions available in the Advanced SIMD (Neon) and MVE instruction sets for the Armv8-A and Armv8-M architectures, respectively. For example, the code:

#include <complex.h>
#define N 512

void fma (_Complex float a[restrict N], _Complex float b[restrict N],
           _Complex float c[restrict N]) {
  for (int i=0; i < N; i++)
    c[i] = a[i] * b[i];
} 

results in the following assembly:

fma: // @fma
  mov x8, xzr
.LBB0_1: // =>This Inner Loop Header: Depth=1
  add x9, x0, x8
  add x10, x1, x8
  movi v2.2d, #0000000000000000
  movi v3.2d, #0000000000000000
  ldp q1, q0, [x9]
  add x9, x2, x8
  add x8, x8, #32
  cmp x8, #1, lsl #12 // =4096
  ldp q5, q4, [x10]
  fcmla v3.4s, v1.4s, v5.4s, #0
  fcmla v2.4s, v0.4s, v4.4s, #0
  fcmla v3.4s, v1.4s, v5.4s, #90
  fcmla v2.4s, v0.4s, v4.4s, #90
  stp q3, q2, [x9]
  b.ne .LBB0_1
  ret

Note the use of the FCMLA instruction, which performs a fused-multiply-add vector operation with an optional complex rotation on vectors of complex numbers.

Function specialization enabled by default and SPEC2017 intrate improvements

Specialization of functions has been enabled by default at all optimization levels when optimizing for speed. The optimization heuristics and compile-time properties of the pass have been improved and is deemed to be generally beneficial enough to be enabled by default. This optimization particularly improves the 505.mcf_r benchmark in SPEC2017 intrate by about 10% on various AArch64 platforms. This contributes to an improvement of the SPEC2017 intrate C/C++ benchmarks by an estimated 3% geomean on AArch64. Note that the SPEC2017 performance uplift is also aided by tuning and enabling by default of the SelectOpt pass and other advanced pattern recognition.

 LLVM 16 vs LLVM 15 performance

Improvements to SVE and autovectorization

Autovectorization with SVE has been a very active area of development. For example, up until now, vectorization of pointers accessed in different branches of a conditional was very basic: most of the times, it would be computed as too high cost. Now, basic arithmetic on the pointer is included in the vectorizer's cost model. This means the following code is now vectorized when it is profitable to do so:

void foo(float *dst, float *src, int *cond, long disp) {
  for (long i=0; i<1024; i++) {
    if (cond[i] != 0) {
      dst[i] = src[i];
    } else {
      dst[i] = src[i+disp];
    }
  }
}

This said, hitting the right circumstances to make vectorization profitable is tricky on a synthetic example, and the generated code is very long. If you want to see what the vectorized code looks like, you can tweak the cost model. Compile the previous example with -march=v9a -O3 -Rpass=loop-vectorize -mllvm -force-target-instruction-cost=1.

Vectorization of tail-folded loops has also been improved by reducing the need for explicit merging operations. For example, the following code:

float foo(float *a, float *b) {
  float sum = 0.0;
  for (int i = 0; i < 1024; ++i)
    sum += a[i] * b[i];
  return sum;
}

compiled with -march=armv9-a -Ofast -mllvm -sve-tail-folding=all shows that a predicated FMLA is now emitted:

.LLVM_15_LOOP:
    ld1w    { z2.s }, p1/z, [x0, x8, lsl #2]
    ld1w    { z3.s }, p1/z, [x1, x8, lsl #2]
    add    x8, x8, x10
    fmul    z2.s, z3.s, z2.s
    sel    z2.s, p1, z2.s, z0.s
    whilelo    p1.s, x8, x9
    fadd    z1.s, z1.s, z2.s
    b.mi    .LLVM_15_LOOP
 
.LLVM_16_LOOP:
    ld1w    { z1.s }, p1/z, [x0, x8, lsl #2]
    ld1w    { z2.s }, p1/z, [x1, x8, lsl #2]
    add    x8, x8, x10
    fmla    z0.s, p1/m, z2.s, z1.s
    whilelo    p1.s, x8, x9
    b.mi    .LLVM_16_LOOP

Also, vectorization of loops with reverse iteration counts is improved by reducing the need for explicit reverse operations. Take this loop as an example:

void foo(int *a, int *b, int* c) {
  for (int i = 1024; i >= 0; --i) {
    if (c[i] > 10)
      a[i] = b[i] + 5;
  }
}

Compiled with -march=armv9-a -O3, the LLVM 16 output no longer reverses the loaded data nor the predicate used for the conditional:

.LLVM_15_LOOP:
    ld1w    { z0.s }, p0/z, [x16, x9, lsl #2]
    ld1w    { z1.s }, p0/z, [x17, x9, lsl #2]
    rev    z0.s, z0.s
    rev    z1.s, z1.s
    cmpgt    p1.s, p0/z, z0.s, #10
    cmpgt    p2.s, p0/z, z1.s, #10
    rev    p1.s, p1.s
    rev    p2.s, p2.s
    ld1w    { z0.s }, p1/z, [x14, x9, lsl #2]
    ld1w    { z1.s }, p2/z, [x15, x9, lsl #2]
    add    z0.s, z0.s, #5                  // =0x5
    add    z1.s, z1.s, #5                  // =0x5
    st1w    { z0.s }, p1, [x12, x9, lsl #2]
    st1w    { z1.s }, p2, [x13, x9, lsl #2]
    sub    x9, x9, x10
    cmp    x18, x9
    b.ne    .LLVM_15_LOOP
 
.LLVM_16_LOOP:
    ld1w    { z0.s }, p0/z, [x13, x9, lsl #2]
    ld1w    { z1.s }, p0/z, [x14, x9, lsl #2]
    cmpgt    p1.s, p0/z, z0.s, #10
    cmpgt    p2.s, p0/z, z1.s, #10
    ld1w    { z0.s }, p1/z, [x15, x9, lsl #2]
    ld1w    { z1.s }, p2/z, [x16, x9, lsl #2]
    add    z0.s, z0.s, #5                  // =0x5
    add    z1.s, z1.s, #5                  // =0x5
    st1w    { z0.s }, p1, [x17, x9, lsl #2]
    st1w    { z1.s }, p2, [x18, x9, lsl #2]
    sub    x9, x9, x10
    cmp    x12, x9
    b.ne    .LLVM_16_LOOP

Other performance improvements to SVE on LLVM 16 include:

  • The use of DUP has been greatly improved in various scenarios, especially for 128-bit LD1RQ variants.
  • Multiply-add and multiply-sub instructions can be used more extensively.
  • The need for PTEST instructions has been greatly reduced.
  • Extended loop load elimination is now type-agnostic and so detects more cases.
  • The SLP cost model has been improved.

Spec2017 builds with Flang

Last December, we met the milestone of all Fortran rate benchmarks working at O3 with LLVM/Flang. The main focus has been to enable four benchmarks (521.wrf_r, 527.cam4_r, 549.fotonik3d_r, 554.roms_r) that were failing. One of the main improvements was removing the dependency on external complex math libraries by using the complex dialect.

Also, some performance has been gained by improving information sharing between the front-end and LLVM, and by improving support for fast math.

You can build Flang by passing -DLLVM_ENABLE_PROJECTS="flang;clang;mlir" to CMake. The flang executable is called flang-new; make sure to pass option -flang-experimental-exec to generate executables.

Target-gated ACLE intrinsics

Initially sparked by the Highway library, the target("<string>") attributes have seen some improvements in the latest clang, aiming at bringing them in line with GCC's implementation. 

The supported formats are now:

  • arch=<arch> strings specify the architecture features for a function as per the -march=arch+feature command-line option.
  • cpu=<cpu> strings specify the target CPU and any implied attribute as per the -mcpu=cpu+feature command-line option.
  • tune=<cpu> strings specify the tune-cpu CPU for a function as per -mtune.
  • +<feature>, +no<feature> enable or disable the specific feature, for compatibility with GCC target attributes.
  • <feature>, no-<feature> enable or disable the specific feature, for backward compatibility with previous clang releases.

Along with the changes above, the implementation of ACLE intrinsics has been modified so that they are no longer based on preprocessor macros. Instead, they are enabled based on the current target. This allows making intrinsics available in individual functions without requiring the entire file to be compiled for the same target. The following example illustrates the use of the attributes on a function sve2_log :

#include <math.h>
#include <arm_sve.h>

void base_log(float *src, int *dst, int n) {
    for(int i = 0; i < n; i++)
        dst[i] = log2f(src[i]);
}

void __attribute__((target("sve2")))
sve2_log(float *src, int *dst, int n) {
    int i = 0;
    svbool_t p = svwhilelt_b32(i, n);
    while(svptest_any(svptrue_b32(), p)) {
        svfloat32_t d = svld1_f32(p, src+i);
        svint32_t l = svlogb_f32_z(p, d);
        svst1_s32(p, dst+i, l);
        i += svcntb();
        p = svwhilelt_b32(i, n);
    }
}

Improvements to llvm-objdump

In LLVM 16, the output of llvm-objdump for Arm targets has been improved for readability and correctness, making it a more suitable replacement to GNU objdump on LLVM-based toolchains.

Disassembly of big-endian object files now works correctly. Previously, each instruction word was accidentally byte-swapped and disassembled as something entirely different.

Also, unrecognized instructions encountered in disassembly are handled in a more useful manner. Previously, the disassembler would advance by just one byte, and try again from an odd-numbered address. This policy makes sense on architectures with variable-length instructions, but never on Arm. The new behavior is to advance a whole instruction so that the rest of the file will likely be disassembled correctly.

LLVM 16 includes other quality improvements on Arm architectures, including bug fixes around Thumb vs. Arm disassembly and .byte directives now including the right byte. Some readability improvements to instruction encodings have been added to make Arm and 32-bit Thumb easier to tell apart: now you see one 8-digit number for Arm instructions and two 4-digit numbers with a space in-between for Thumb.

Support for strict floating point on AArch64

Strict floating point semantics have been implemented for AArch64. The clang command-line option -ffp-model=strict is now accepted on AArch64 targets instead of being ignored with a warning. Take this example where an FP division is executed only if it is safe to do so:

float fn(int n, float x, float y) {
  if (n == 0) {
    x += 1;
  } else {
    x += y/n;
  }
  return x;
}

On LLVM 15, compiling with -O2 resulted in the following generated code:

fn(int, float, float):                               // @fn(int, float, float)
        scvtf   s3, w0
        fmov    s2, #1.00000000
        cmp     w0, #0
        fdiv    s1, s1, s3
        fadd    s1, s1, s0
        fadd    s0, s0, s2
        fcsel   s0, s1, s0, ne
        ret

which will execute both branches, including the divide, and select the right result afterwards in the fcsel. Although the functionality of the code is preserved, it results in a spurious FE_DIVBYZERO floating-point exception when n==0. On LLVM 16, compiling with -O2 -ffp-model=strict results in the following code:

fn(int, float, float):                               // @fn(int, float, float)
        cbz     w0, .LBB0_2
        scvtf   s2, w0
        fdiv    s1, s1, s2
        fadd    s0, s0, s1
        ret
.LBB0_2:
        mov     w8, #1
        scvtf   s1, w8
        fadd    s0, s0, s1
        ret

where the two different branches of execution are kept separate, preventing the FP exception from happening.

As a result of supporting strict FP, options -ftrapping-math and -frounding-math are now also accepted. On one side, -ftrapping-math ensures that code does not introduce or remove side effects that could be caused by any kind of FP exceptions. These include exceptions that software can detect asynchronously by inspecting the FPSR. Similarly, -frounding-math avoids applying optimizations that assume a specific FP rounding behavior.

Support for early Arm architectures in compiler-rt and LLD

LLD can now be used as a linker for ARMv4 and ARMv4T: it now emits thunks compatible with Armv4 and Armv4T instead of incompatible BX instructions for Armv4 or BLX instructions for either Armv4 or Armv4T.

On a related note, support for compiler-rt built-ins was added for ARMv4T, ARMv5TE, and ARMv6, unlocking runtime support for these architectures.

Thanks to this enabling work, it is now possible to have a full LLVM-based toolchain for these 32-bit Arm architectures. Therefore, the Linux kernel has now added support for building Clang with LLD, and Rust programs do not need to depend on the GNU linker anymore.

Anonymous
Tools, Software and IDEs blog
  • What is new in LLVM 16?

    Pablo Barrio
    Pablo Barrio
    Arm contributions from Arm to the new release include the usual architecture and CPU additions and new features such as, function multi-versioning and strict floating point support.
    • May 1, 2023
  • Product update: Arm Development Studio 2023.0 now available

    Ronan Synnott
    Ronan Synnott
    Arm Development Studio 2023.0 now available with support for Arm Neoverse V2 processor.
    • April 27, 2023
  • What is new in LLVM 15?

    Pablo Barrio
    Pablo Barrio
    LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Arm contributed support for new Arm extensions and CPUs.
    • February 27, 2023