LLVM 21.1.0 was released on August 26, 2025 including contributions from Arm engineers to improve performance across wide range of Arm targets. This post provides the key highlights.
To find out more about the previous LLVM release, you can read What is new in LLVM 20?
By Ties Stuij
LLVM 21 adds support for the Cortex-A320, an ultra-efficient AArch64 CPU that implements Armv9.2-A. LLVM support for the processor includes a scheduling model based upon the Cortex-A320 Software Optimization Guide.
By Kiran Chandramohan
The scheduling model for Neoverse V2 and Neoverse N2 CPUs was updated with better modelling of issue width. The new values match the numbers in the Software Optimization Guide (SWOG) and improve performance.
Unrolling small two-block loops, such as those used for searching through arrays, can significantly improve performance. A common example is a loop generated by std::find when searching for a specific value in the standard C++ library. In older versions of the GNU libstdc++, std::find was often manually unrolled in the source code. However, modern implementations increasingly rely on the compiler to apply optimizations like vectorization or unrolling automatically. LLVM 21 enables this unrolling for small multi-exit loops on AArch64, allowing the compiler to optimize these patterns more effectively. Performance improvements from this change are notable: 8% in G4 and 6% in G3 for the 523.xalancbmk SPEC 2017 benchmark when using LLVM libc++.
std
find
libstdc++
std::find
523.xalancbmk
libc++
By Sander De Smalen
The LoopVectorizer is now able to estimate the register pressure more accurately by evaluating each vectorization plan separately. A vectorization plan specifies a particular way to vectorize the loop. Each plan is costed separately for register pressure and operation costs, for different vectorization factors (number of lanes handled per vector iteration) and interleave counts. After analyzing all plans, the vectorizer chooses the best one.
Doing the register usage estimation per plan particularly benefits loops that can use DOT instructions where the operation (internally) zero-extends its inputs to a wider vector before doing the reduction. The AdvSIMD and SVE DOT instructions implement a partial reduction from a vector with K lanes into a vector with K/4 lanes. For example, zero-extend 16 x i8 -> 16 x i32, multiply and accumulate into a 4 x i32 vector). These instructions require a higher vectorization factor to benefit from their use.
Without a DOT instruction, zero-extending 16 x i8 elements to 16 x i32 elements requires a 512bit vector, which the compiler needs to legalize into 4 x 128bit vectors thus increasing the register usage in the loop.
With a DOT instruction the extends, and therefore also wider vectors, never materialize. They are handled as part of the instruction and should not be included in the register usage estimation.
We have seen improvements of up to 2.5x on certain ML workloads from the increased use of DOT instructions.
By Simon Tatham
LLVM 21 improves code generation when the BTI security feature is enabled, by eliminating two classes of unnecessary BTI instructions. This improves code size and performance (each BTI removed saves space and time), and strengthens security. Every piece of code with a BTI is a potential gadget for a ROP attacker, so the fewer of them, the better.
BTI is no longer generated at the start of a static function if the compiler can prove it is never indirectly called. Additionally, BTI is no longer generated at labels referred to by the asm goto language extension. That extension is heavily used in the Linux kernel to allow kernel tracing with minimal performance cost when the trace is turned off. Removing the unnecessary BTIs improves performance of the kernel when built with Clang.
LLVM 21 enables auto-vectorization of simple loops with uncountable (data-dependent) early exits. Meaning loops that may break based on loaded data rather than a known trip count. Practically, simple scans like find_first_zero now vectorize under Clang 21. The previous version, Clang 20, reported that auto-vectorization of uncountable early exits was not enabled.
<source>:6:3: remark: loop not vectorized: Auto-vectorization of loops with uncountable early exit is not enabled [-Rpass-analysis=loop-vectorize]
<source>:6:3: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize]
By Alexandros Lamprineas
Function Multi-versioning for AArch64 now supports FEAT_CSSC. Also, the target_clones attribute no longer generates a default version unless it is explicitly specified.
FEAT_CSSC
target_clones
By Luke Hutton
Tensor Operator Set Architecture (TOSA) provides a set of whole-tensor operations commonly employed by Deep Neural Networks. The goal is to enable a variety of implementations running on a diverse range of processors, while ensuring consistent results at the TOSA level. Applications or frameworks which target TOSA can be deployed on a wide range of different processors. These include SIMD CPUs, GPUs and custom hardware such as NPUs/TPUs, with defined accuracy and compatibility constraints.
Arm recently released the TOSA 1.0 specification and supporting software, including an MLIR dialect, a serialization format, and a reference implementation. This release marks an important milestone. It is the first release to enforce backwards compatibility guarantees between minor versions of the specification and software. This means TOSA artifacts produced in this version of the software must continue to work between minor versions of TOSA.
TOSA software overview:
LLVM 21 is the first release that supports the TOSA MLIR dialect aligned with version 1.0 of the TOSA specification.
Some notable improvements in the dialect since LLVM 20 include:
NaN
Updates argmax, max_pool2d, clamp, maximum, minimum, reduce_max and reduce_min operations to indicate whether NaN values should be propagated to or ignored from the output. For example:
argmax
max_pool2d
clamp
maximum
minimum
reduce_max
reduce_min
%0 = "tosa.const"() <{values = dense<[2.0, 0.0, 0x7fc00000]> : tensor<3xf32>}> : () -> tensor<3xf32> // where 0x7fc00000 is the bit-pattern for NaN tosa.clamp %0 {min_val = 0.0 : f32, max_val = 1.0: f32, nan_mode = "PROPAGATE"} : (tensor<13x21x3xf32>) -> tensor<13x21x3xf32> >>> [1.0, 0.0, nan] tosa.clamp %0 {min_val = 0.0 : f32, max_val = 1.0: f32, nan_mode = "IGNORE"} : (tensor<13x21x3xf32>) -> tensor<13x21x3xf32> >>> [1.0, 0.0, 0]
The dialect implements two levels of checks: verifiers and validation. Verifiers run before and after every pass and focus on identifying operations that are clearly invalid. Validation checks operations against the specification requirements. The dialect acts as a superset of the specification, allowing lowering pipelines to shape IR towards specification compliance.
$ mlir-opt test.mlir --verify-each $ mlir-opt test.mlir --tosa-validate="profile=pro_int extension=bf16,fft level=8k strict-op-spec-alignment"
This change allows their values to be variable when the extension EXT-DYNAMIC is supported. This functionality is useful when zero point values need to be configurable at runtime.
EXT-DYNAMIC
%input_zp = "tosa.const"() <{values = dense<0.0> : tensor<1xf8E4M3FN>}> : () -> tensor<1xf8E4M3FN> %weight_zp = "tosa.const"() <{values = dense<0.0> : tensor<1xf8E4M3FN>}> : () -> tensor<1xf8E4M3FN> %0 = tosa.conv2d %arg0, %arg1, %arg2, %input_zp, %weight_zp {acc_type = f16, dilation = array<i64: 1, 1>, pad = array<i64: 0, 0, 0, 0>, stride = array<i64: 1, 1>, local_bound = true} : (tensor<1x4x4x4xf8E4M3FN>, tensor<8x1x1x4xf8E4M3FN>, tensor<8xf16>, tensor<1xf8E4M3FN>, tensor<1xf8E4M3FN>) -> tensor<1x4x4x8xf16>
By Pavel Iliin, Paschalis Mpeis
BOLT can now optimize binaries using profiles with SPE branch data on systems with Linux Perf v6.14 or later, as explained in the updated guidance. Users may need to experiment with the sampling period. It may be possible to achieve performance close to instrumentation-based profiles, with lower profile collection overhead than cycle sampling.
BOLT NFC testing mode now runs post-merge tests only on commits that modify the llvm-bolt binary or relevant sources. Our new Arm-managed Buildbot for AArch64 already uses this to help improve code quality.
llvm-bolt
We collaborated with the Flang community to stabilize the OpenMP implementation by fixing bugs and improving error reporting. This work was coordinated through GitHub issue #110008. Flang now supports OpenMP 3.1 by default.
OpenMP standards allow tasks to be deferred for execution. Deferred execution requires careful handling to ensure that values referenced inside a task region remain valid when the task eventually runs. Flang now includes full support for deferred task execution.
Additional OpenMP 4.0 features have been implemented, including the cancel and cancellation point constructs, SIMD reductions and array expressions in task dependencies.
The -f[no]openmp-simd flag was added to only process the OpenMP SIMD construct.
-f[no]openmp-simd
Finally, support was added for the -mmacos-version-min flag in Flang, ensuring that code compiled with Flang is compatible with the specified minimum macOS version.
-mmacos-version-min
Performance improvements were obtained for Thornado and E3SM Atmosphere applications by inlining of copying of contiguous arrays of simple types and by enabling SLP vectorization in the Flang pipeline.