Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

October 15, 2025

3 minute read time.

This post is published on behalf of Aditya Tewari, Nikhil Gupta and Arm’s PyTorch Engineering Team

As part of the new PyTorch 2.9 release, Arm contributed key enhancements to ensure seamless performance and stability on Arm platforms. This includes updated library support via oneDNN and OpenBLAS optimizations and AArch64 reliability fixes. These advances align with the release’s broader focus on expanding hardware compatibility and strengthening deployment reliability across the ecosystem.

Accelerating PyTorch on Arm

The PyTorch 2.9 release brings significant performance and reliability improvements across the Arm CPU backend by strengthening end-to-end performance.

Arm’s engineering teams focused on four key pillars of improvement:

Operator Enablement: Extended coverage and enhanced performance for convolution, activation, and quantized operators on AArch6
Performance Optimizations: Leveraged Arm Neoverse features such as Scalable Vector Extensions (SVE/SVE2) to increase computational throughput and efficiency.
Compiler and Runtime Enhancements: Strengthened TorchInductor and vectorization correctness for Arm backends.
Ecosystem Integration: Expanded Arm continuous integration (CI) coverage and improved testing infrastructure for long-term stability and maintainability.

Together, these updates ensure that models running on Arm CPUs benefit from optimized math libraries, improved kernel performance, and consistent compiler behavior across both training and inference.

Extending operator coverage and performance

A key focus for this release was expanding and optimizing the set of PyTorch operators available on AArch64. These included:

Convolution and Activation Improvements: Core kernels were tuned for better cache utilization and vectorized math execution, improving convolutional neural network (CNN) and image-based workloads.
Quantized Operator Expansion: Enhanced quantized operator implementations improve inference speed while maintaining precision.
Runtime Consistency and Correctness: Updates across the operator stack ensure consistent numerical results between eager and compiled execution paths.

These efforts collectively improve PyTorch’s native performance footprint on Arm CPUs.

Boosting math performance with SVE and SVE2

Arm’s SVE and SVE2 enables flexible vector lengths (up to 2048 bits), allowing code to scale efficiently across different CPUs. PyTorch 2.9 introduces optimizations that better leverage this hardware capability:

Enhanced vectorized math routines in oneDNN and OpenBLAS.
Optimized GEMM and reduction kernels for improved throughput.
Better vector alignment and correctness handling in TorchInductor for Arm targets.

The result is improved throughput and efficiency across transformer, CNN, and mixed-precision workloads.

Compiler and runtime enhancements for Arm

PyTorch 2.9 includes multiple compiler and runtime updates that deliver more efficient execution on Arm CPUs:

AOTInductor Vectorization: Improved automatic vectorization and graph lowering for Arm architectures.
Optimized Runtime Configurations: Runtime backend cpp_wrapper was tuned for better scheduling and reduced CPU overhead.
Mixed-Precision Handling: Improved fusion and precision handling in compiler passes ensure consistent results and faster execution.

These updates make PyTorch’s compiler stack more robust for Arm targets, providing more general optimizations. Transformer-based models such as Bert and Llama, which spend over 40% of their runtime in GEMM operations (e.g. torch.Linear) see speedups of up to 2.5×.

Ecosystem integration and continuous testing

Long-term reliability and maintainability are ensured through stronger integration and validation in PyTorch’s CI and testing infrastructure.

Expanded CI Coverage: CI now tests additional to validate upstream changes automatically.
Smarter Tensor Handling: Better test coverage for edge cases ensures correctness across various tensor shapes and datatypes.
Upgraded Math Libraries: The integration of OpenBLAS 0.3.30 introduces architecture-aware optimizations for matrix operations, boosting linear algebra performance.

OpenBLAS 0.3.30: Notable performance gains

The integration of OpenBLAS 0.3.30 brings measurable end-to-end improvements across key benchmark suites. The chart below shows geometric mean speedups observed across TorchBench, HuggingFace, and TIMM model families on Arm CPUs:

Inductor Config	TorchBench	HuggingFace	TIMM Models
aot_inductor	1.03× → 1.18×	0.53× → 0.99×
cpp_wrapper	0.91× → 1.13×	0.49× → 1.04×	0.99× → 1.01×
default	0.92× → 1.12×	0.49× → 1.03×	1.06× → 1.05×

The results demonstrate solid performance improvements — particularly for TorchBench and TIMM workloads where matrix multiplication and convolution operations dominate. Even memory-bound HuggingFace transformer models show measurable efficiency gains, reflecting the benefits of the updated OpenBLAS kernels and runtime tuning.

Looking ahead

With PyTorch 2.9, Arm and the PyTorch community continue to demonstrate how to deliver high-performance AI on Arm CPUs.

Future work will focus on:

Expanding operator-level optimizations with deeper SVE/SVE2 integration.
Enhancing TorchInductor’s scheduling and code generation on the Arm architecture.
Further strengthening Arm-native CI coverage to maintain upstream reliability.

As the PyTorch ecosystem continues to evolve, Arm remains committed to enabling open, efficient, and scalable AI performance across the global developer community.

AI blog

Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

Ashok Bhat

As part of the new PyTorch 2.9 release, Arm contributed key enhancements to ensure seamless performance and stability on Arm platforms. Learn more about the enhancements in this blog post.
- October 15, 2025
Are you attending PyTorch Conference 2025?

Michelle Yung

Join us on site at the PyTorch Conference 2025 on October 22-23 to learn how Arm empowers developers to build and deploy AI applications easily using PyTorch and ExecuTorch.
- October 15, 2025
Unlocking AI Potential with Kleidi: Seamless Acceleration Workshop Recap

Parichay Das

Explore takeaways from our Kleidi AI workshop led by Arm Ambassador Parichay Das, where participants tackled performance gaps and future AI needs.
- September 25, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

Accelerating PyTorch on Arm

Extending operator coverage and performance

Boosting math performance with SVE and SVE2

Ecosystem integration and continuous testing

OpenBLAS 0.3.30: Notable performance gains

Looking ahead

Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

Are you attending PyTorch Conference 2025?

Unlocking AI Potential with Kleidi: Seamless Acceleration Workshop Recap