Part 2: What is new in LLVM 18?

April 12, 2024

8 minute read time.

Code generation improvements

SME and SME2 support

By Kyrylo Tkachov

We are happy to announce that LLVM 18 contains an implementation of the ACLE and ABI support for SME and SME2. This major feature includes support for the intrinsics as well as the ABI and language extensions necessary to make use of the new Streaming SVE state and functionality.

#include <arm_sme.h>
 
void mopa(svfloat32_t v1, svbool_t p1,
          const float *src, int N) __arm_out("za")      // Ignores incoming ZA state, returns new ZA state.
                                   __arm_streaming {    // This functions enters and returns in streaming-SVE mode.
  // Initialize ZA by zeroing all tiles
  svzero_za();
     
  for (int i=0; i<N; ++i) {
    svbool_t p2 = svwhilelt_b32(0, N);
    svfloat32_t v2 = svld1_f32(p2, &src[i + N]);
   
    // Predicated outer product of v1 x v2. For 32-bit elements there
    // are 4 tiles (za0.s .. za3.s). This intrinsic stores its results in
    // ZA tile 0.
    svmopa_za32_m(/*tile*/0, p1, p2, v1, v2);
  }
 
  // ZA state is returned from this function.
}

Above is a C example utilizing the new intrinsics and attributes from the ACLE to write SME and Streaming SVE code.

In addition, LLDB now supports debugging of SME and SME2.

Function Multi Versioning FEAT_LRCPC3 and FEAT_MOPS support

By Pavel Iliin & Alexandros Lamprineas

Function Multi Versioning has been extended to support new instructions according to the beta ACLE specification:

Load-Acquire RCpc instructions v3 (rcpc3)
Memory Copy and Memory Set Acceleration instructions (mops)

Here is an example of using FEAT_MOPS:

#include <string.h>
__attribute__((target_clones("mops")))
void myMemCopy(void *s1, const void *s2, size_t n) {
  memcpy(s1, s2, n);
}
void caller(void *s1, const void *s2, size_t n) {
  myMemCopy(s1, s2, n);
}

Compiling the above example on AArch64 targets with --rtlib=compiler-rt generates myMemCopy.resolver which checks if FEAT_MOPS is supported on target and corresponding function myMemCopy._Mmops or myMemCopy.default is called.

myMemCopy(void*, void const*, unsigned long) (._Mmops):              // @myMemCopy(void*, void const*, unsigned long) (._Mmops)
        cpyfp   [x0]!, [x1]!, x2!
        cpyfm   [x0]!, [x1]!, x2!
        cpyfe   [x0]!, [x1]!, x2!
        ret
myMemCopy(void*, void const*, unsigned long) (.default):             // @myMemCopy(void*, void const*, unsigned long) (.default)
        b       memcpy
myMemCopy(void*, void const*, unsigned long) (.resolver):            // @myMemCopy(void*, void const*, unsigned long) (.resolver)
        str     x30, [sp, #-16]!                // 8-byte Folded Spill
        bl      __init_cpu_features_resolver
        adrp    x8, __aarch64_cpu_features+7
        adrp    x9, myMemCopy(void*, void const*, unsigned long) (._Mmops)
        add     x9, x9, :lo12:myMemCopy(void*, void const*, unsigned long) (._Mmops)
        ldrb    w8, [x8, :lo12:__aarch64_cpu_features+7]
        tst     w8, #0x8
        adrp    x8, myMemCopy(void*, void const*, unsigned long) (.default)
        add     x8, x8, :lo12:myMemCopy(void*, void const*, unsigned long) (.default)
        csel    x0, x8, x9, eq
        ldr     x30, [sp], #16                  // 8-byte Folded Reload
        ret
caller(void*, void const*, unsigned long):                        // @caller(void*, void const*, unsigned long)
        b       myMemCopy(void*, void const*, unsigned long)

Extended Guarded Control Stack (GCS) support

By John Brawn

clang now supports the -mbranch-protection=gcs option, enabled by default when using -mbranch-protection=standard, which marks the emitted object as being compatible with Guarded Control Stack (GCS) 2022 A-profile extension. When all input objects to a link are compatible with GCS this will cause lld to also mark the output executable as compatible with GCS. This does not cause any changes to code generation, as the code generated by clang is already compatible with GCS but is required for a GCS-aware operating system to enable GCS when executing the executable.

Stack clash protection support

By Momchil Velikov

To help the security hardening efforts ongoing in the ecosystem, LLVM 18 now supports stack clash protection for AArch64 targets.

A common approach to allow stack area allocated to a thread's stack to grow is to place a guard region at the end of the stack which is inaccessible to the thread. When a thread attempts to grow its stack (for example, when establishing the activation frame after a function call) the kernel/runtime system takes an exception and can extend the area or terminate the process.

It is possible that a thread allocates so much memory that it skips over and bypasses the guard area. In this case the stack effectively clashes with other areas (for example, with the heap) leading to potentially exploitable data corruption.

The stack clash protection mechanisms (-fstack-clash-protection) ensures that the guard area at the top of the stack cannot be bypassed as described above. This is done by emitting stack allocation instructions and emitting instructions to access the stack (stack probes) in such a way that:

At any time the stack pointer is above or inside the guard area
No single allocation is greater that a guard area size (configurable with -mstack-probe-size=N, default is 4KiB)
Any region in the stack no smaller than the guard area size contains at least one probe
The probes access the stack in decreasing address order

For example, when this source file, where we have a function which allocates a variable-length array:

int g(char *, unsigned);
 
int f(unsigned n) {
    char v[n];
    return g(v, n);
}

is compiled using clang -target aarch64-linux -O2 -fstack-clash-protection, the compiler will generate code like the following.

Save the old frame pointer and the link register and initialise the new frame pointer:

f:
  stp x29, x30, [sp, #-16]!
  mov x29, sp

Compute the target value for the SP, adding the allocation amount rounded up to a multiple of 16:

mov w9, w0
mov x8, sp
mov w1, w0
add x9, x9, #15
and x9, x9, #0x1fffffff0
sub x0, x8, x9

Stack allocation and probing loop, allocate stack area in 4KiB chunks, issuing stack probes along the way:

LBB0_1:
  sub sp, sp, #1, lsl #12 // =4096
  cmp sp, x0
  b.le .LBB0_3
  str xzr, [sp]  // stack probe
  b .LBB0_1
.LBB0_3:

Set the new value of the stack pointer and issue a final probe for the reminder (less than 4KiB) of the allocation:

  mov sp, x0
  ldr xzr, [sp] // stack probe

Rest of the function:

  bl g
  mov sp, x29
  ldp x29, x30, [sp], #16
  ret

Generation of clamp instructions in SVE

By Kyrylo Tkachov & Hassnaa Hamdi

Consider the following pattern:

Min(Max(element1, element3), element2);

This pattern could be optimized by replacing it by single instruction of clamp, when SVE 2.1 is available. This was implemented by adding the corresponding instruction selection pattern.

For example, take this code:

define <vscale x 16 x i8> @uclampi8(<vscale x 16 x i8> %c, <vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
  %min = tail call <vscale x 16 x i8> @llvm.umax.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b)
  %res = tail call <vscale x 16 x i8> @llvm.umin.nxv16i8(<vscale x 16 x i8> %min, <vscale x 16 x i8> %c)
  ret <vscale x 16 x i8> %res
}

The generated code before optimization was:

; CHECK-LABEL: uclampi8:
; CHECK:       // %bb.0:
; CHECK-NEXT:    ptrue p0.b
; CHECK-NEXT:    umax z1.b, p0/m, z1.b, z2.b
; CHECK-NEXT:    umin z0.b, p0/m, z0.b, z1.b
; CHECK-NEXT:    ret

The generated code after optimization is:

; CHECK-LABEL: uclampi8:
; CHECK: // %bb.0:
; CHECK-NEXT: uclamp z0.b, z1.b, z2.b
; CHECK-NEXT: ret

So a couple of min/max instructions are replaced by a single instruction of clamp.

eXecute-Only (XO) support for Armv6-M

By Ties Stuij

For security and performance reasons, some systems have set up hardware mechanisms to mark certain memory address ranges as eXecute-Only, aka XOM. So, a processor is only allowed to execute code from this memory, and not read it. Programs produced by compilers are normally not set up for this as code sections also contain data that instructions need to for example jump to certain addresses. Accessing this data requires read permission, in addition to execute.

LLVM could already handle XOM for Armv7-M and Armv8-M. These architectures have a number of instructions that can load immediate data with relative ease. So, for example instead of loading a branch target from a constant island, we can use the MOVT and MOVW instructions to materialize the address in two instructions. Other helpful instructions are the table branch instruction TBB or the wide branch instruction B.W. To generate this execute-only compliant code, you would pass -mexecute-only on the command line (or the GCC-compatible -mpure-code alias), which will put this code in an ELF section marked with the SHF_ARM_PURECODE attribute.

For LLVM 18, we implemented execute-only support for Armv6-M. For Armv6-M immediate branching is more involved, as it lacks instructions like MOVT, MOVW and TBB. The size of immediate values that we can instantiate with Armv6-M instructions is limited to a byte, so we need to do quite a bit of immediate loading and shifting to construct an address.

A typical sequence would be something like this:

   movs    r3, #:upper8_15:#.LC0
    lsls    r3, #8
    adds    r3, #:upper0_7:#.LC0
    lsls    r3, #8
    adds    r3, #:lower8_15:#.LC0
    lsls    r3, #8
    adds    r3, #:lower0_7:#.LC0
    ldr     r3, [r3]

The upperX_X:#<label> syntax will load bit X through Y from the memory address denoted by <label>. For this to work we also needed to implement the following relocations that correspond to these designators for LLVM and LLD:

Relocation	assembler operator
Relocation	assembler operator
R_ARM_THM_ALU_ABS_G0_NC	:lower0_7:
R_ARM_THM_ALU_ABS_G1_NC	:lower8_15:
R_ARM_THM_ALU_ABS_G2_NC	:upper0_7:
R_ARM_THM_ALU_ABS_G3	:upper8_15:

See the Arm 32-bit ELF ABI extension document for more details: https://github.com/ARM-software/abi-aa/blob/main/aaelf32/aaelf32.rst.

Besides this, some work was needed in the LLD linker and LLVM to not emit code-section data in, for example, switch tables or thunks.

Just as with Armv7-M and Armv8-M, to emit XOM-compliant code for Armv6-M pass -mexecute-only on the command-line.

Tools improvements

Exclusive-group feature in YAML multilib

By Simon Tatham

In LLVM 17, the clang driver added support for run-time YAML-based configuration of a "multilib" setup, that is, more than one set of headers and libraries selected by the compiler flags. Previously, the set of libraries had to be hard-coded into the clang driver at compile time.

LLVM 18 extends this system by providing a means of marking library subdirectories as mutually exclusive, so that even if two directories match the user's compile flags, only one of them will be selected. This streamlines the process of writing a multilib.yaml file, by allowing you to specify the compile settings that each library is compatible with in principle, which is often simpler than listing precisely the settings that it would be optimal for. Then you can sort your libraries by desirability (for example, performance), and the result will be that if multiple libraries are compatible with the settings, the most desirable of those will be chosen.

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog