By Kyrylo Tkachov
We are happy to announce that LLVM 18 contains an implementation of the ACLE and ABI support for SME and SME2. This major feature includes support for the intrinsics as well as the ABI and language extensions necessary to make use of the new Streaming SVE state and functionality.
SME
SME2
#include <arm_sme.h> void mopa(svfloat32_t v1, svbool_t p1, const float *src, int N) __arm_out("za") // Ignores incoming ZA state, returns new ZA state. __arm_streaming { // This functions enters and returns in streaming-SVE mode. // Initialize ZA by zeroing all tiles svzero_za(); for (int i=0; i<N; ++i) { svbool_t p2 = svwhilelt_b32(0, N); svfloat32_t v2 = svld1_f32(p2, &src[i + N]); // Predicated outer product of v1 x v2. For 32-bit elements there // are 4 tiles (za0.s .. za3.s). This intrinsic stores its results in // ZA tile 0. svmopa_za32_m(/*tile*/0, p1, p2, v1, v2); } // ZA state is returned from this function. }
Above is a C example utilizing the new intrinsics and attributes from the ACLE to write SME and Streaming SVE code.
In addition, LLDB now supports debugging of SME and SME2.
By Pavel Iliin & Alexandros Lamprineas
Function Multi Versioning has been extended to support new instructions according to the beta ACLE specification:
rcpc3
mops
Here is an example of using FEAT_MOPS:
FEAT_MOPS
#include <string.h> __attribute__((target_clones("mops"))) void myMemCopy(void *s1, const void *s2, size_t n) { memcpy(s1, s2, n); } void caller(void *s1, const void *s2, size_t n) { myMemCopy(s1, s2, n); }
Compiling the above example on AArch64 targets with --rtlib=compiler-rt generates myMemCopy.resolver which checks if FEAT_MOPS is supported on target and corresponding function myMemCopy._Mmops or myMemCopy.default is called.
--rtlib=compiler-rt
myMemCopy.resolver
myMemCopy._Mmops
myMemCopy.default
myMemCopy(void*, void const*, unsigned long) (._Mmops): // @myMemCopy(void*, void const*, unsigned long) (._Mmops) cpyfp [x0]!, [x1]!, x2! cpyfm [x0]!, [x1]!, x2! cpyfe [x0]!, [x1]!, x2! ret myMemCopy(void*, void const*, unsigned long) (.default): // @myMemCopy(void*, void const*, unsigned long) (.default) b memcpy myMemCopy(void*, void const*, unsigned long) (.resolver): // @myMemCopy(void*, void const*, unsigned long) (.resolver) str x30, [sp, #-16]! // 8-byte Folded Spill bl __init_cpu_features_resolver adrp x8, __aarch64_cpu_features+7 adrp x9, myMemCopy(void*, void const*, unsigned long) (._Mmops) add x9, x9, :lo12:myMemCopy(void*, void const*, unsigned long) (._Mmops) ldrb w8, [x8, :lo12:__aarch64_cpu_features+7] tst w8, #0x8 adrp x8, myMemCopy(void*, void const*, unsigned long) (.default) add x8, x8, :lo12:myMemCopy(void*, void const*, unsigned long) (.default) csel x0, x8, x9, eq ldr x30, [sp], #16 // 8-byte Folded Reload ret caller(void*, void const*, unsigned long): // @caller(void*, void const*, unsigned long) b myMemCopy(void*, void const*, unsigned long)
By John Brawn
clang now supports the -mbranch-protection=gcs option, enabled by default when using -mbranch-protection=standard, which marks the emitted object as being compatible with Guarded Control Stack (GCS) 2022 A-profile extension. When all input objects to a link are compatible with GCS this will cause lld to also mark the output executable as compatible with GCS. This does not cause any changes to code generation, as the code generated by clang is already compatible with GCS but is required for a GCS-aware operating system to enable GCS when executing the executable.
-mbranch-protection=gcs
-mbranch-protection=standard
lld
By Momchil Velikov
To help the security hardening efforts ongoing in the ecosystem, LLVM 18 now supports stack clash protection for AArch64 targets.
A common approach to allow stack area allocated to a thread's stack to grow is to place a guard region at the end of the stack which is inaccessible to the thread. When a thread attempts to grow its stack (for example, when establishing the activation frame after a function call) the kernel/runtime system takes an exception and can extend the area or terminate the process.It is possible that a thread allocates so much memory that it skips over and bypasses the guard area. In this case the stack effectively clashes with other areas (for example, with the heap) leading to potentially exploitable data corruption.The stack clash protection mechanisms (-fstack-clash-protection) ensures that the guard area at the top of the stack cannot be bypassed as described above. This is done by emitting stack allocation instructions and emitting instructions to access the stack (stack probes) in such a way that:
-fstack-clash-protection
-mstack-probe-size=N
For example, when this source file, where we have a function which allocates a variable-length array:
int g(char *, unsigned); int f(unsigned n) { char v[n]; return g(v, n); }
is compiled using clang -target aarch64-linux -O2 -fstack-clash-protection, the compiler will generate code like the following.
clang -target aarch64-linux -O2 -fstack-clash-protection
f: stp x29, x30, [sp, #-16]! mov x29, sp
mov w9, w0 mov x8, sp mov w1, w0 add x9, x9, #15 and x9, x9, #0x1fffffff0 sub x0, x8, x9
LBB0_1: sub sp, sp, #1, lsl #12 // =4096 cmp sp, x0 b.le .LBB0_3 str xzr, [sp] // stack probe b .LBB0_1 .LBB0_3:
mov sp, x0 ldr xzr, [sp] // stack probe
bl g mov sp, x29 ldp x29, x30, [sp], #16 ret
By Kyrylo Tkachov & Hassnaa Hamdi
Consider the following pattern:
Min(Max(element1, element3), element2);
This pattern could be optimized by replacing it by single instruction of clamp, when SVE 2.1 is available. This was implemented by adding the corresponding instruction selection pattern.
For example, take this code:
define <vscale x 16 x i8> @uclampi8(<vscale x 16 x i8> %c, <vscale x 16 x i8> %a, <vscale x 16 x i8> %b) { %min = tail call <vscale x 16 x i8> @llvm.umax.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) %res = tail call <vscale x 16 x i8> @llvm.umin.nxv16i8(<vscale x 16 x i8> %min, <vscale x 16 x i8> %c) ret <vscale x 16 x i8> %res }
The generated code before optimization was:
; CHECK-LABEL: uclampi8: ; CHECK: // %bb.0: ; CHECK-NEXT: ptrue p0.b ; CHECK-NEXT: umax z1.b, p0/m, z1.b, z2.b ; CHECK-NEXT: umin z0.b, p0/m, z0.b, z1.b ; CHECK-NEXT: ret
The generated code after optimization is:
; CHECK-LABEL: uclampi8: ; CHECK: // %bb.0: ; CHECK-NEXT: uclamp z0.b, z1.b, z2.b ; CHECK-NEXT: ret
So a couple of min/max instructions are replaced by a single instruction of clamp.
By Ties Stuij
For security and performance reasons, some systems have set up hardware mechanisms to mark certain memory address ranges as eXecute-Only, aka XOM. So, a processor is only allowed to execute code from this memory, and not read it. Programs produced by compilers are normally not set up for this as code sections also contain data that instructions need to for example jump to certain addresses. Accessing this data requires read permission, in addition to execute.
LLVM could already handle XOM for Armv7-M and Armv8-M. These architectures have a number of instructions that can load immediate data with relative ease. So, for example instead of loading a branch target from a constant island, we can use the MOVT and MOVW instructions to materialize the address in two instructions. Other helpful instructions are the table branch instruction TBB or the wide branch instruction B.W. To generate this execute-only compliant code, you would pass -mexecute-only on the command line (or the GCC-compatible -mpure-code alias), which will put this code in an ELF section marked with the SHF_ARM_PURECODE attribute.
MOVT
MOVW
TBB
B.W
-mexecute-only
-mpure-code
SHF_ARM_PURECODE
For LLVM 18, we implemented execute-only support for Armv6-M. For Armv6-M immediate branching is more involved, as it lacks instructions like MOVT, MOVW and TBB. The size of immediate values that we can instantiate with Armv6-M instructions is limited to a byte, so we need to do quite a bit of immediate loading and shifting to construct an address.
A typical sequence would be something like this:
movs r3, #:upper8_15:#.LC0 lsls r3, #8 adds r3, #:upper0_7:#.LC0 lsls r3, #8 adds r3, #:lower8_15:#.LC0 lsls r3, #8 adds r3, #:lower0_7:#.LC0 ldr r3, [r3]
The upperX_X:#<label> syntax will load bit X through Y from the memory address denoted by <label>. For this to work we also needed to implement the following relocations that correspond to these designators for LLVM and LLD:
upperX_X:#<label>
X
Y
<label>
See the Arm 32-bit ELF ABI extension document for more details: https://github.com/ARM-software/abi-aa/blob/main/aaelf32/aaelf32.rst.
Besides this, some work was needed in the LLD linker and LLVM to not emit code-section data in, for example, switch tables or thunks.
Just as with Armv7-M and Armv8-M, to emit XOM-compliant code for Armv6-M pass -mexecute-only on the command-line.
By Simon Tatham
In LLVM 17, the clang driver added support for run-time YAML-based configuration of a "multilib" setup, that is, more than one set of headers and libraries selected by the compiler flags. Previously, the set of libraries had to be hard-coded into the clang driver at compile time.
LLVM 18 extends this system by providing a means of marking library subdirectories as mutually exclusive, so that even if two directories match the user's compile flags, only one of them will be selected. This streamlines the process of writing a multilib.yaml file, by allowing you to specify the compile settings that each library is compatible with in principle, which is often simpler than listing precisely the settings that it would be optimal for. Then you can sort your libraries by desirability (for example, performance), and the result will be that if multiple libraries are compatible with the settings, the most desirable of those will be chosen.
multilib.yaml