This post is published on behalf of Aditya Tewari, Nikhil Gupta and Arm’s PyTorch Engineering Team
As part of the new PyTorch 2.9 release, Arm contributed key enhancements to ensure seamless performance and stability on Arm platforms. This includes updated library support via oneDNN and OpenBLAS optimizations and AArch64 reliability fixes. These advances align with the release’s broader focus on expanding hardware compatibility and strengthening deployment reliability across the ecosystem.
The PyTorch 2.9 release brings significant performance and reliability improvements across the Arm CPU backend by strengthening end-to-end performance.
Arm’s engineering teams focused on four key pillars of improvement:
Together, these updates ensure that models running on Arm CPUs benefit from optimized math libraries, improved kernel performance, and consistent compiler behavior across both training and inference.
A key focus for this release was expanding and optimizing the set of PyTorch operators available on AArch64. These included:
These efforts collectively improve PyTorch’s native performance footprint on Arm CPUs.
Arm’s SVE and SVE2 enables flexible vector lengths (up to 2048 bits), allowing code to scale efficiently across different CPUs. PyTorch 2.9 introduces optimizations that better leverage this hardware capability:
The result is improved throughput and efficiency across transformer, CNN, and mixed-precision workloads.
Compiler and runtime enhancements for Arm
PyTorch 2.9 includes multiple compiler and runtime updates that deliver more efficient execution on Arm CPUs:
These updates make PyTorch’s compiler stack more robust for Arm targets, providing more general optimizations. Transformer-based models such as Bert and Llama, which spend over 40% of their runtime in GEMM operations (e.g. torch.Linear) see speedups of up to 2.5×.
Long-term reliability and maintainability are ensured through stronger integration and validation in PyTorch’s CI and testing infrastructure.
The integration of OpenBLAS 0.3.30 brings measurable end-to-end improvements across key benchmark suites. The chart below shows geometric mean speedups observed across TorchBench, HuggingFace, and TIMM model families on Arm CPUs:
Inductor Config
TorchBench
HuggingFace
TIMM Models
aot_inductor
1.03× → 1.18×
0.53× → 0.99×
cpp_wrapper
0.91× → 1.13×
0.49× → 1.04×
0.99× → 1.01×
default
0.92× → 1.12×
0.49× → 1.03×
1.06× → 1.05×
The results demonstrate solid performance improvements — particularly for TorchBench and TIMM workloads where matrix multiplication and convolution operations dominate. Even memory-bound HuggingFace transformer models show measurable efficiency gains, reflecting the benefits of the updated OpenBLAS kernels and runtime tuning.
With PyTorch 2.9, Arm and the PyTorch community continue to demonstrate how to deliver high-performance AI on Arm CPUs.
Future work will focus on:
As the PyTorch ecosystem continues to evolve, Arm remains committed to enabling open, efficient, and scalable AI performance across the global developer community.