Today marks an exciting milestone with the official general availability (GA) release of ExecuTorch 1.0, a lightweight, production-ready runtime from the PyTorch ecosystem. This GA release empowers developers to deploy AI models efficiently across a wide variety of devices, unlocking new possibilities in performance, portability, and efficiency. Especially for those building on Arm CPUs now and in the future.
ExecuTorch is a framework designed to export and execute PyTorch models on resource-constrained devices. With PyTorch’s standard tooling, developers can export their models and run them natively on edge AI and IoT devices, smartphones, PCs and laptops, or even cloud CPUs using the same model.
One of the most exciting additions for developers targeting Arm CPUs is the inclusion of optimized AI routines for Scalable Matrix Extension 2 (SME2) through XNNPACK via the Arm KleidiAI integration.
SME2 is a technology introduced in the Armv9-A CPU architecture. It builds on SVE2 (Scalable Vector Extension 2), extending its versatility with advanced features that enhance performance in areas such as generative AI, computer vision (CV), and linear algebra. All of this while preserving the programmability and flexibility of existing Arm technologies like Neon and SVE/SVE2.
At the heart of SME2 lies the MOPA (Matrix Outer Product Accumulate) instruction, which accelerates matrix operations by efficiently performing outer products.
For more insights into SME2 and KleidiAI’s integration with XNNPACK, see this detailed blog post on Arm Community.
Thanks to this integration, ExecuTorch automatically leverages SME2-optimized kernels in XNNPACK whenever SME2 is available at runtime. This optimization enhances key operators on Arm CPUs, including:
The full potential of ExecuTorch 1.0 is being showcased at the 2025 PyTorch Conference (22-23 October), where the Stable Audio Open Small text-to-audio model is running entirely on SME2-enabled Arm CPUs via KleidiAI optimizations and delivering remarkable performance. These include:
This performance highlights how ExecuTorch, combined with Arm’s scalable hardware architecture, achieves exceptional results. Without requiring code changes or hardware-specific tuning.
The story is not just about SME2. It is about a fundamental advantage of the Arm ecosystem: “Optimize once, deploy everywhere.”
The same model and code run efficiently across the cloud, mobile, and edge. From Arm Neoverse CPUs in data centers to smartphones and embedded systems powered by the range of Arm CPUs.
For developers, no modifications are needed, with performance scaling seamlessly with the underlying Arm architecture.
The tables below show the Neon-only performance results for Stable Audio Open Small with ExecuTorch and KleidiAI: on mobile devices using 1, 2, and 4 cores, and on the Arm CPU of a Graviton 4 system using 1, 2, 4, 8, and 16 cores.
Mobile: 1x @3.25 GHz Arm Cortex-X4 & 3x @2.85 GHz Arm Cortex-X4 CPU.
1 core - s
2 cores - s
4 cores - s
16.6
11.6
8.4
Cloud (Graviton 4): Arm Neoverse V2 CPU @2.8GHz
8 cores - s
16 cores - s
17.4
9.2
5.1
3.2
2.2
This makes AI development and deployment simpler and faster across all Arm-based devices.
The release of ExecuTorch 1.0 represents a significant leap forward in enabling efficient, scalable on-device AI for everyone. With SME2 support via KleidiAI integrations, optimized operators, and proven real-world results, it gives developers the power to deploy state-of-the-art AI models across the entire Arm ecosystem.