ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI

October 22, 2025

3 minute read time.

Today marks an exciting milestone with the official general availability (GA) release of ExecuTorch 1.0, a lightweight, production-ready runtime from the PyTorch ecosystem. This GA release empowers developers to deploy AI models efficiently across a wide variety of devices, unlocking new possibilities in performance, portability, and efficiency. Especially for those building on Arm CPUs now and in the future.

What is ExecuTorch?

ExecuTorch is a framework designed to export and execute PyTorch models on resource-constrained devices. With PyTorch’s standard tooling, developers can export their models and run them natively on edge AI and IoT devices, smartphones, PCs and laptops, or even cloud CPUs using the same model.

Optimized AI on Arm: SME2 and KleidiAI

One of the most exciting additions for developers targeting Arm CPUs is the inclusion of optimized AI routines for Scalable Matrix Extension 2 (SME2) through XNNPACK via the Arm KleidiAI integration.

SME2 is a technology introduced in the Armv9-A CPU architecture. It builds on SVE2 (Scalable Vector Extension 2), extending its versatility with advanced features that enhance performance in areas such as generative AI, computer vision (CV), and linear algebra. All of this while preserving the programmability and flexibility of existing Arm technologies like Neon and SVE/SVE2.

At the heart of SME2 lies the MOPA (Matrix Outer Product Accumulate) instruction, which accelerates matrix operations by efficiently performing outer products.

For more insights into SME2 and KleidiAI’s integration with XNNPACK, see this detailed blog post on Arm Community.

Thanks to this integration, ExecuTorch automatically leverages SME2-optimized kernels in XNNPACK whenever SME2 is available at runtime. This optimization enhances key operators on Arm CPUs, including:

MatMul f32
MatMul f16
MatMul int8
MatMul int8 (dynamic quantization)

Demonstrating the power: Stable Audio Open Small on Arm

The full potential of ExecuTorch 1.0 is being showcased at the 2025 PyTorch Conference (22-23 October), where the Stable Audio Open Small text-to-audio model is running entirely on SME2-enabled Arm CPUs via KleidiAI optimizations and delivering remarkable performance. These include:

11 seconds of audio generated in just 7 to 8 seconds on a broad range of Arm-based CPUs.
Generation time dropped to under 4 seconds on SME2-enabled devices, like the Mac Mini and MacBook Pro.

This performance highlights how ExecuTorch, combined with Arm’s scalable hardware architecture, achieves exceptional results. Without requiring code changes or hardware-specific tuning.

Optimize once, deploy everywhere

The story is not just about SME2. It is about a fundamental advantage of the Arm ecosystem: “Optimize once, deploy everywhere.”

The same model and code run efficiently across the cloud, mobile, and edge. From Arm Neoverse CPUs in data centers to smartphones and embedded systems powered by the range of Arm CPUs.

For developers, no modifications are needed, with performance scaling seamlessly with the underlying Arm architecture.

The tables below show the Neon-only performance results for Stable Audio Open Small with ExecuTorch and KleidiAI: on mobile devices using 1, 2, and 4 cores, and on the Arm CPU of a Graviton 4 system using 1, 2, 4, 8, and 16 cores.

Mobile: 1x @3.25 GHz Arm Cortex-X4 & 3x @2.85 GHz Arm Cortex-X4 CPU.

1 core - s	2 cores - s	4 cores - s
16.6	11.6	8.4

Cloud (Graviton 4): Arm Neoverse V2 CPU @2.8GHz

1 core - s	2 cores - s	4 cores - s	8 cores - s	16 cores - s
17.4	9.2	5.1	3.2	2.2

This makes AI development and deployment simpler and faster across all Arm-based devices.

Final thoughts

The release of ExecuTorch 1.0 represents a significant leap forward in enabling efficient, scalable on-device AI for everyone. With SME2 support via KleidiAI integrations, optimized operators, and proven real-world results, it gives developers the power to deploy state-of-the-art AI models across the entire Arm ecosystem.

Resources

AI blog

Ethos-U and Beyond: How ExecuTorch 1.0 powers AI at the edge

Per Åstrand

AI meets the edge: ExecuTorch 1.0 brings PyTorch performance and portability to Arm’s tiniest, most efficient devices.
- October 22, 2025
Arm neural technology in ExecuTorch 1.0

Robert Elliott

With the announcement of Arm neural technology, Arm is enabling neural networks and a new class of neural graphics capabilities.
- October 22, 2025
ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI

Gian Marco Iodice

Today marks an exciting milestone with the official general availability (GA) release of ExecuTorch 1.0, a lightweight, production-ready runtime from the PyTorch ecosystem.
- October 22, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI

What is ExecuTorch?

Optimized AI on Arm: SME2 and KleidiAI

Demonstrating the power: Stable Audio Open Small on Arm

Optimize once, deploy everywhere

Final thoughts

Resources

Ethos-U and Beyond: How ExecuTorch 1.0 powers AI at the edge

Arm neural technology in ExecuTorch 1.0

ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI