Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
AI blog ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • SME2
  • Artificial Intelligence (AI)
  • KleidiAI
  • SVE
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI

Gian Marco Iodice
Gian Marco Iodice
October 22, 2025
3 minute read time.

Today marks an exciting milestone with the official general availability (GA) release of ExecuTorch 1.0, a lightweight, production-ready runtime from the PyTorch ecosystem. This GA release empowers developers to deploy AI models efficiently across a wide variety of devices, unlocking new possibilities in performance, portability, and efficiency. Especially for those building on Arm CPUs now and in the future.

What is ExecuTorch?

ExecuTorch is a framework designed to export and execute PyTorch models on resource-constrained devices. With PyTorch’s standard tooling, developers can export their models and run them natively on edge AI and IoT devices, smartphones, PCs and laptops, or even cloud CPUs using the same model.

Optimized AI on Arm: SME2 and KleidiAI

One of the most exciting additions for developers targeting Arm CPUs is the inclusion of optimized AI routines for Scalable Matrix Extension 2 (SME2) through XNNPACK via the Arm KleidiAI integration.

SME2 is a technology introduced in the Armv9-A CPU architecture. It builds on SVE2 (Scalable Vector Extension 2), extending its versatility with advanced features that enhance performance in areas such as generative AI, computer vision (CV), and linear algebra. All of this while preserving the programmability and flexibility of existing Arm technologies like Neon and SVE/SVE2.

At the heart of SME2 lies the MOPA (Matrix Outer Product Accumulate) instruction, which accelerates matrix operations by efficiently performing outer products.

For more insights into SME2 and KleidiAI’s integration with XNNPACK, see this detailed blog post on Arm Community.

Thanks to this integration, ExecuTorch automatically leverages SME2-optimized kernels in XNNPACK whenever SME2 is available at runtime. This optimization enhances key operators on Arm CPUs, including:

  • MatMul f32
  • MatMul f16
  • MatMul int8
  • MatMul int8 (dynamic quantization)

Demonstrating the power: Stable Audio Open Small on Arm

The full potential of ExecuTorch 1.0 is being showcased at the 2025 PyTorch Conference (22-23 October), where the Stable Audio Open Small text-to-audio model is running entirely on SME2-enabled Arm CPUs via KleidiAI optimizations and delivering remarkable performance. These include:

  • 11 seconds of audio generated in just 7 to 8 seconds on a broad range of Arm-based CPUs.
  • Generation time dropped to under 4 seconds on SME2-enabled devices, like the Mac Mini and MacBook Pro.

This performance highlights how ExecuTorch, combined with Arm’s scalable hardware architecture, achieves exceptional results. Without requiring code changes or hardware-specific tuning.

Optimize once, deploy everywhere

The story is not just about SME2. It is about a fundamental advantage of the Arm ecosystem: “Optimize once, deploy everywhere.”

The same model and code run efficiently across the cloud, mobile, and edge. From Arm Neoverse CPUs in data centers to smartphones and embedded systems powered by the range of Arm CPUs.

For developers, no modifications are needed, with performance scaling seamlessly with the underlying Arm architecture.

The tables below show the Neon-only performance results for Stable Audio Open Small with ExecuTorch and KleidiAI: on mobile devices using 1, 2, and 4 cores, and on the Arm CPU of a Graviton 4 system using 1, 2, 4, 8, and 16 cores.

Mobile: 1x @3.25 GHz Arm Cortex-X4 & 3x @2.85 GHz Arm Cortex-X4 CPU.

1 core - s

2 cores - s

4 cores - s

16.6

11.6

8.4

Cloud (Graviton 4): Arm Neoverse V2 CPU @2.8GHz

1 core - s

2 cores - s

4 cores - s

8 cores - s

16 cores - s

17.4

9.2

5.1

3.2

2.2

This makes AI development and deployment simpler and faster across all Arm-based devices.

Final thoughts

The release of ExecuTorch 1.0 represents a significant leap forward in enabling efficient, scalable on-device AI for everyone. With SME2 support via KleidiAI integrations, optimized operators, and proven real-world results, it gives developers the power to deploy state-of-the-art AI models across the entire Arm ecosystem.

Resources

  • ExecuTorch Landing Page on PyTorch site.
  • ExecuTorch 1.0 Getting started docs.
  • ExecuTorch 1.0 download page.
  • ExecuTorch on Arm learning paths.
  • Arm Code Along and Expert Q&A: Build an Android Chat App with Llama, Arm KleidiAI, ExecuTorch, and XNNPACK.
Anonymous
AI blog
  • Ethos-U and Beyond: How ExecuTorch 1.0 powers AI at the edge

    Per Åstrand
    Per Åstrand
    AI meets the edge: ExecuTorch 1.0 brings PyTorch performance and portability to Arm’s tiniest, most efficient devices.
    • October 22, 2025
  • Arm neural technology in ExecuTorch 1.0

    Robert Elliott
    Robert Elliott
    With the announcement of Arm neural technology, Arm is enabling neural networks and a new class of neural graphics capabilities.
    • October 22, 2025
  • ExecuTorch 1.0 is here and with SME2 optimizations through KleidiAI

    Gian Marco Iodice
    Gian Marco Iodice
    Today marks an exciting milestone with the official general availability (GA) release of ExecuTorch 1.0, a lightweight, production-ready runtime from the PyTorch ecosystem.
    • October 22, 2025