Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog What is new in LLVM 21?
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • A-Profile CPU
  • performance
  • LLVM
  • Machine Learning (ML)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

What is new in LLVM 21?

Volodymyr Turanskyy
Volodymyr Turanskyy
October 24, 2025
8 minute read time.

LLVM 21.1.0 was released on August 26, 2025 including contributions from Arm engineers to improve performance across wide range of Arm targets. This post provides the key highlights.

To find out more about the previous LLVM release, you can read What is new in LLVM 20?

New architecture and CPU support

By Ties Stuij

LLVM 21 adds support for the Cortex-A320, an ultra-efficient AArch64 CPU that implements Armv9.2-A. LLVM support for the processor includes a scheduling model based upon the  Cortex-A320 Software Optimization Guide.

Performance improvements

Improved scheduling model for Neoverse

By Kiran Chandramohan

The scheduling model for Neoverse V2 and Neoverse N2 CPUs was updated with better modelling of issue width. The new values match the numbers in the Software Optimization Guide (SWOG) and improve performance.

Unrolling search loops

By Kiran Chandramohan

Unrolling small two-block loops, such as those used for searching through arrays, can significantly improve performance. A common example is a loop generated by std::find when searching for a specific value in the standard C++ library. In older versions of the GNU libstdc++, std::find was often manually unrolled in the source code. However, modern implementations increasingly rely on the compiler to apply optimizations like vectorization or unrolling automatically. LLVM 21 enables this unrolling for small multi-exit loops on AArch64, allowing the compiler to optimize these patterns more effectively. Performance improvements from this change are notable: 8% in G4 and 6% in G3 for the 523.xalancbmk SPEC 2017 benchmark when using LLVM libc++.

Improved register pressure estimation in LLVM's loop vectorizer

By Sander De Smalen

The LoopVectorizer is now able to estimate the register pressure more accurately by evaluating each vectorization plan separately. A vectorization plan specifies a particular way to vectorize the loop. Each plan is costed separately for register pressure and operation costs, for different vectorization factors (number of lanes handled per vector iteration) and interleave counts. After analyzing all plans, the vectorizer chooses the best one.

Doing the register usage estimation per plan particularly benefits loops that can use DOT instructions where the operation (internally) zero-extends its inputs to a wider vector before doing the reduction. The AdvSIMD and SVE DOT instructions implement a partial reduction from a vector with K lanes into a vector with K/4 lanes. For example, zero-extend 16 x i8 -> 16 x i32, multiply and accumulate into a 4 x i32 vector). These instructions require a higher vectorization factor to benefit from their use.

Without a DOT instruction, zero-extending 16 x i8 elements to 16 x i32 elements requires a 512bit vector, which the compiler needs to legalize into 4 x 128bit vectors thus increasing the register usage in the loop.

With a DOT instruction the extends, and therefore also wider vectors, never materialize. They are handled as part of the instruction and should not be included in the register usage estimation.

We have seen improvements of up to 2.5x on certain ML workloads from the increased use of DOT instructions.

Code generation improvements

Unnecessary BTI instructions eliminated

By Simon Tatham

LLVM 21 improves code generation when the BTI security feature is enabled, by eliminating two classes of unnecessary BTI instructions. This improves code size and performance (each BTI removed saves space and time), and strengthens security. Every piece of code with a BTI is a potential gadget for a ROP attacker, so the fewer of them, the better.

BTI is no longer generated at the start of a static function if the compiler can prove it is never indirectly called. Additionally, BTI is no longer generated at labels referred to by the asm goto language extension. That extension is heavily used in the Linux kernel to allow kernel tracing with minimal performance cost when the trace is turned off. Removing the unnecessary BTIs improves performance of the kernel when built with Clang.

Early exit loop vectorization

By Kiran Chandramohan

LLVM 21 enables auto-vectorization of simple loops with uncountable (data-dependent) early exits. Meaning loops that may break based on loaded data rather than a known trip count. Practically, simple scans like find_first_zero now vectorize under Clang 21. The previous version, Clang 20, reported that auto-vectorization of uncountable early exits was not enabled.

Example remarks output:
  • Clang 20:

<source>:6:3: remark: loop not vectorized: Auto-vectorization of loops with uncountable early exit is not enabled [-Rpass-analysis=loop-vectorize]

  • Clang 21:

<source>:6:3: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize]

Function Multi-versioning improvements

By Alexandros Lamprineas

Function Multi-versioning for AArch64 now supports FEAT_CSSC. Also, the target_clones attribute no longer generates a default version unless it is explicitly specified.

Tensor Operator Set Architecture

By Luke Hutton

Tensor Operator Set Architecture (TOSA) provides a set of whole-tensor operations commonly employed by Deep Neural Networks. The goal is to enable a variety of implementations running on a diverse range of processors, while ensuring consistent results at the TOSA level. Applications or frameworks which target TOSA can be deployed on a wide range of different processors. These include SIMD CPUs, GPUs and custom hardware such as NPUs/TPUs, with defined accuracy and compatibility constraints.

Arm recently released the TOSA 1.0 specification and supporting software, including an MLIR dialect, a serialization format, and a reference implementation. This release marks an important milestone. It is the first release to enforce backwards compatibility guarantees between minor versions of the specification and software. This means TOSA artifacts produced in this version of the software must continue to work between minor versions of TOSA.

TOSA software overview:

 TOSA software architecture

LLVM 21 is the first release that supports the TOSA MLIR dialect aligned with version 1.0 of the TOSA specification.

Some notable improvements in the dialect since LLVM 20 include:

  • User-defined NaN propagation policy.

          Updates argmax, max_pool2d, clamp, maximum, minimum, reduce_max and reduce_min operations to indicate whether NaN values should be propagated to or ignored from the output. For example:

%0 = "tosa.const"() <{values = dense<[2.0, 0.0, 0x7fc00000]> : tensor<3xf32>}> : () -> tensor<3xf32>  // where 0x7fc00000 is the bit-pattern for NaN
tosa.clamp %0 {min_val = 0.0 : f32, max_val = 1.0: f32, nan_mode = "PROPAGATE"} : (tensor<13x21x3xf32>) -> tensor<13x21x3xf32>
>>> [1.0,  0.0,  nan]
tosa.clamp %0 {min_val = 0.0 : f32, max_val = 1.0: f32, nan_mode = "IGNORE"} : (tensor<13x21x3xf32>) -> tensor<13x21x3xf32>
>>> [1.0,  0.0,  0]

  • Improved verification of supported / unsupported operations.

          The dialect implements two levels of checks: verifiers and validation. Verifiers run before and after every pass and focus on identifying operations that are clearly invalid. Validation checks operations against the specification requirements. The dialect acts as a superset of the specification, allowing lowering pipelines to shape IR towards specification compliance.

$ mlir-opt test.mlir --verify-each
$ mlir-opt test.mlir --tosa-validate="profile=pro_int extension=bf16,fft level=8k strict-op-spec-alignment"

  • Zero-points are now inputs, rather than attributes.

          This change allows their values to be variable when the extension EXT-DYNAMIC is supported. This functionality is useful when zero point values need to be configurable at runtime.

%input_zp = "tosa.const"() <{values = dense<0.0> : tensor<1xf8E4M3FN>}> : () -> tensor<1xf8E4M3FN>
%weight_zp = "tosa.const"() <{values = dense<0.0> : tensor<1xf8E4M3FN>}> : () -> tensor<1xf8E4M3FN>
%0 = tosa.conv2d %arg0, %arg1, %arg2, %input_zp, %weight_zp {acc_type = f16, dilation = array<i64: 1, 1>, pad = array<i64: 0, 0, 0, 0>, stride = array<i64: 1, 1>, local_bound = true} : (tensor<1x4x4x4xf8E4M3FN>, tensor<8x1x1x4xf8E4M3FN>, tensor<8xf16>, tensor<1xf8E4M3FN>, tensor<1xf8E4M3FN>) -> tensor<1x4x4x8xf16>

Tools improvements

BOLT improvements

By Pavel Iliin, Paschalis Mpeis

BOLT can now optimize binaries using profiles with SPE branch data on systems with Linux Perf v6.14 or later, as explained in the updated guidance. Users may need to experiment with the sampling period. It may be possible to achieve performance close to instrumentation-based profiles, with lower profile collection overhead than cycle sampling.

BOLT NFC testing mode now runs post-merge tests only on commits that modify the llvm-bolt binary or relevant sources. Our new Arm-managed Buildbot for AArch64 already uses this to help improve code quality.

Flang improvements

By Kiran Chandramohan

We collaborated with the Flang community to stabilize the OpenMP implementation by fixing bugs and improving error reporting. This work was coordinated through GitHub issue #110008. Flang now supports OpenMP 3.1 by default.

OpenMP standards allow tasks to be deferred for execution. Deferred execution requires careful handling to ensure that values referenced inside a task region remain valid when the task eventually runs. Flang now includes full support for deferred task execution.

Additional OpenMP 4.0 features have been implemented, including the cancel and cancellation point constructs, SIMD reductions and array expressions in task dependencies.

The -f[no]openmp-simd flag was added to only process the OpenMP SIMD construct.

Finally, support was added for the -mmacos-version-min flag in Flang, ensuring that code compiled with Flang is compatible with the specified minimum macOS version.

Performance improvements were obtained for Thornado and E3SM Atmosphere applications by inlining of copying of contiguous arrays of simple types and by enabling SLP vectorization in the Flang pipeline.

Anonymous
Tools, Software and IDEs blog
  • What is new in LLVM 21?

    Volodymyr Turanskyy
    Volodymyr Turanskyy
    LLVM 21.1.0 was released on August 26, 2025 with contributions from Arm engineers to improve performance. This post provides the key highlights.
    • October 24, 2025
  • CPython Core Dev Sprint 2025 at Arm Cambridge: The biggest one yet

    Diego Russo
    Diego Russo
    For one week, Arm’s Cambridge HQ became the heart of Python development. Contributors globally came together for the CPython Core Developer Sprint.
    • October 9, 2025
  • Python on Arm: 2025 Update

    Diego Russo
    Diego Russo
    Python powers applications across Machine Learning (ML), automation, data science, DevOps, web development, and developer tooling.
    • August 21, 2025