Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
AI blog Bringing Generative AI to the masses with ExecuTorch and KleidiAI
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • CPU Architecture
  • Android
  • Artificial Intelligence (AI)
  • KleidiAI
  • Edge Computing
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Bringing Generative AI to the masses with ExecuTorch and KleidiAI

Gian Marco Iodice
Gian Marco Iodice
August 13, 2025
5 minute read time.

This blog post was written by Gian Marco Iodice (GenAI Engineering Lead, Arm), Mary Bennion (Director of Ecosystem, Arm), and Digant Desai (Software Engineer, Meta).

Key takeaways

• ExecuTorch 0.7 now enables KleidiAI by default, delivering automatic acceleration on Arm CPUs with zero integration effort.
• GenAI is now performant on millions of existing devices—including 3–5 year old smartphones and Raspberry Pi 5—thanks to Arm CPU features like SDOT and I8MM.
• On-device use cases like private voice assistants, message summarization, and local code/gen AI copilots are now possible—without the cloud, and without the battery drain.

With Arm’s recent SME2 announcement, the role of Arm KleidiAI is increasingly clear as Arm’s AI accelerator layer powering the next wave of AI. By embedding into widely-used Edge AI frameworks like XNNPack, MediaPipe, MNN, ONNX Runtime, and even llama.cpp, KleidiAI has delivered substantial performance improvements with no code changes required by developers. That foundation leads directly to the upcoming ExecuTorch 0.7 beta, where KleidiAI will be enabled by default—bringing automatic acceleration to devices built on the latest Arm CPU architecture, as well as a vast base of existing phones built on earlier generations.

Android and cross-platform developers—whether first- or third-party—gain instant access to KleidiAI AI performance optimizations via ExecuTorch and XNNPack. The result? Faster model startups, lower latency, leaner memory footprints—and no integration hurdles. What previously required custom tuning is now turn-key performance, ready out of the box. This efficiency unlocks new possibilities—not just for the latest high-end devices, but for a much broader range of hardware.

When we consider running Generative AI (GenAI) on mobile devices, it is easy to envision the latest flagship smartphones equipped with powerful CPUs, GPUs, and NPUs. But what if we told you that GenAI experiences—like running large language models (LLMs)—can also be brought to devices that are 3, 4, or even 5 years old? Or even to the Raspberry Pi 5?

Well, this is no longer just a vision, but a practical reality. Thanks to the Arm SDOT CPU feature, which has been available in Arm CPUs since 2015.

What is SDOT?

The SDOT (Signed Dot Product) instruction, introduced in the Armv8.2 architecture and later CPUs, enables efficient dot product operations on vectors of 8-bit signed integers. The following image illustrates the behavior of one such SDOT instruction available on Arm CPUs:

Shows the behavior of one such SDOT instruction available on Arm CPUs

As shown above, the instruction produces four 32-bit integer outputs, each resulting from the dot product of corresponding groups of four int8 elements from the left-hand side (LHS) and right-hand side (RHS) vector registers.

This instruction can be utilized to accelerate matrix multiplication routines—the core computational workload behind every LLM—when using Int8 or lower-bit precision formats, such as Int4. These operations typically involve numerous dot products between individual rows of the left-hand side matrix and corresponding columns of the right-hand side matrix.

The SDOT instruction is already widely supported across a diverse range of devices, opening the door for GenAI use cases to reach a significantly larger smartphone audience. As of today, Arm CPUs in approximately 3 billion Arm-based devices include this capability—enabling powerful on-device GenAI experiences for the majority of users. In fact, 72% of all devices now support this instruction.

Thanks to ExecuTorch, we are now enabling models like Llama 3.2 to run efficiently on the majority of Android devices as well as edge devices like the Raspberry Pi 5.

KleidiAI + ExecuTorch: Bringing it all together

For the quantized Llama 3.2 1B announcement last year, the ExecuTorch and KleidiAI teams collaborated to deliver optimizations for the Int4 matrix-multiplication on Arm CPUs leveraging the I8MM feature, available from the Armv8.6 architecture onwards. As highlighted in a previous blog post, ExecuTorch with KleidiAI achieves over 20% higher prefill performance on the Galaxy S24+ compared to non-KleidiAI kernels.

This translates to more than 350 tokens per second during the prefill phase and over 40 tokens per second during the decode phase. This level of performance is sufficient to enable on-device tasks, such as summarizing unread messages, with a smooth user experience using only Arm CPUs. For context, summarizing around 50 unread messages typically involves processing approximately 600 tokens.

This year, the ExecuTorch and KleidiAI teams have focused on optimizing Int4 matrix multiplication performance by leveraging the SDOT instruction, aiming to broaden adoption.

See the XNNPack PR

While LLM performance on Arm CPUs with only the SDOT extension may not match the latest flagship smartphones, it still enables impressive capabilities for on-device generative AI. In fact, in many scenarios, the decode phase is faster than the average human reading speed—highlighting that even older Arm CPUs can support practical and meaningful GenAI use cases.

For example, when combined with speech-to-text and text-to-speech models, a local LLM of this kind enables the creation of a fully private smart assistant that operates entirely offline, eliminating concerns about data privacy while still offering rich voice-based interactions. Such a device could seamlessly interact with your connected devices, ensuring users have peace of mind with their data.

Another compelling use case for running Llama 3.2 1B is context-aware text completion in local text editors. As you type, the model provides intelligent, real-time suggestions to streamline writing or coding workflows—all without requiring an internet connection.

These are just a few examples, and they only scratch the surface of what is possible with on-device GenAI.

Conclusion: GenAI for everyone

With the combined power of SDOT, KleidiAI, and ExecuTorch, we are pushing the boundaries of what is possible. Bringing Generative AI beyond high-end flagship devices and making it accessible on billions of Arm-based devices already in use.

Now it is your turn—we are excited to see what you will create. To help you get started, check out Arm’s learning path, designed to guide you through developing your own applications with LLMs using ExecuTorch and KleidiAI.

Arm Learning Path

Anonymous
AI blog
  • Bringing Generative AI to the masses with ExecuTorch and KleidiAI

    Gian Marco Iodice
    Gian Marco Iodice
    With the recent Arm SME2 announcement, the role of Arm KleidiAI is increasingly clear as Arm’s AI accelerator layer powering the next wave of AI.
    • August 13, 2025
  • Yellow Teaming on Arm: A look inside our responsible AI workshop

    Annie Tallund
    Annie Tallund
    Led a hands-on Yellow Teaming workshop at WeAreDevelopers, exploring Responsible AI and LLMs on Arm-powered tech.
    • July 28, 2025
  • Arm at KubeCon and CloudNativeCon China 2025: Powering the future of Cloud Native AI

    Fei Xiang
    Fei Xiang
    Arm energized KubeCon + CloudNativeCon China 2025, driving record dev engagement and showcasing cloud-native AI innovation on Arm-based infrastructure.
    • July 21, 2025