Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
AI blog One year of Arm KleidiAI in XNNPack: Seamless and transparent AI performance
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Neural Network
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • KleidiAI
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

One year of Arm KleidiAI in XNNPack: Seamless and transparent AI performance

Gian Marco Iodice
Gian Marco Iodice
July 10, 2025
5 minute read time.

Arm KleidiAI + XNNPack: A Year of Seamless ML Acceleration 

It’s hard to believe, but it's already been a full year since Arm KleidiAI, the highly optimized library used to accelerate AI inference on Arm CPUs, was first integrated into XNNPACK. In that time, we’ve made incredible performance improvements in XNNPack—starting with our announcement of Int4 matmul optimizations to boost Gemma 2—and continuing with many more enhancements under the hood. 

The best part? Developers didn’t need to change a thing. 

All of these improvements are completely transparent, requiring no code changes and no additional dependencies. Just build and run your application with XNNPack as usual, and you’ll benefit automatically from the latest low-level optimizations we’ve introduced through KleidiAI. 

Let’s take a closer look at the latest wave of enhancements. 

New KleidiAI Optimizations in XNNPack

Matmul F32 x Int8 for SDOT and I8MM 

Building on our previous Int4 optimizations, these optimizations focused on accelerating Int8 matrix multiplication with dynamic quantization—broadening the scope of our performance improvements to support a wide spectrum of AI models. From convolutional neural networks to cutting-edge generative AI models, like Stable Audio Open Small unveiled in May 2025, this optimization delivers tangible gains. In fact, it boosted the performance of the diffusion module by more than 30%! 

These Int8 optimizations, like the earlier Int4 enhancements, use SDOT and Int8 Matrix Multiply (I8MM) instructions to improve dynamic quantization performance across a range of CPUs. 

SME2 Optimizations for Matmul F32, F16, and Int8 

One of the most exciting recent developments is our support for SME2 (Scalable Matrix Extension 2) on Armv9 architecture. This enables a significant leap in performance for F32 (float32), F16 (float16), and Int8 matrix multiplications, opening the door to an entirely new class of high-performance applications. As a result, both current and future AI workloads can be seamlessly accelerated from day one, with zero additional effort required. 

What is SME2?

SME2 is a new Arm technology introduced in the Armv9-A CPU architecture. 

SME2 is built on the SVE2 (Scalable Vector Extension 2) technology and expands its utility with features that benefit a wide range of domains, including AI, computer vision, linear algebra, and more. 

A standout feature of SME2 is the MOPA (Matrix Outer Product Accumulate) instruction, which enables efficient outer product operations. 

As illustrated in the following figure, the outer product differs from the dot product: while the dot product yields a scalar result, the outer product produces a matrix from two input vectors: 

Figure shows the matrix with 2 input vectors.

Therefore, to see this in action, let’s examine the following matrix multiplication example: 

Figure shows an example of matrix multiplication.

This matrix multiplication can be decomposed into a series of outer products, as visually represented below: 

Figure shows a matrix multiplication with outer products.

Now that the concept is clear, we can explore the SME2 assembly instruction that forms the core of optimized matrix multiplication routines: 

FMOPA za0.s, p0/m, p1/m, z1.s, z3.s 

Here’s what each operand represents: 

  • FMOPA: The Floating-point Matrix Outer Product Accumulate instruction. 
  • ZA0.s: The ZA register tile, which stores and accumulates the outer product result 
  • p0/m and p1/m: Predicate registers, which define valid lanes of computation (masking). 
  • z1.s and z3.s: The input vectors involved in the outer product. 

This instruction is available for a variety of data types—from floating-point formats like float32 and float16 to integer types such as int8. And thanks to the use of SVE, it is vector-length agnostic, meaning it automatically scales with different hardware vector sizes—no code changes required. 

To illustrate the performance potential of SME2, consider its impact on accelerating Int4 matrix multiplications in the Gemma 3 model using Int8 outer product instructions. When deployed on hardware with SME2 support, Google’s Gemma 3 model achieves up to a 6× speedup in AI response times for chatbot use cases—compared to running on the same device without SME2 enabled. 

Moreover, with SME2 acceleration on a single CPU core, Gemma 3 can begin text summarization of a four-paragraph page in under one second, demonstrating both the responsiveness and efficiency unlocked by this architecture. 

Why this matters

With these updates, XNNPack becomes the first AI inference library to support SME2, opening the door to unprecedented performance on Arm CPUs. 

Whether you're working on Generative AI or CNN-based neural networks, you will see measurable improvements in your application without changing a single line code. 

Looking ahead with Arm KleidiAI

The past year has proven that transparent acceleration is not only possible—it’s practical. With Arm KleidiAI continuing to push the boundaries of what’s possible in XNNPack, developers can stay focused on building great AI experiences while the runtime just keeps getting faster. 

Stay tuned—this is only the beginning. 

Anonymous
AI blog
  • Advancing PyTorch Performance on Arm: Key Enhancements in the 2.9 Release

    Ashok Bhat
    Ashok Bhat
    As part of the new PyTorch 2.9 release, Arm contributed key enhancements to ensure seamless performance and stability on Arm platforms. Learn more about the enhancements in this blog post.
    • October 15, 2025
  • Are you attending PyTorch Conference 2025?

    Michelle Yung
    Michelle Yung
    Join us on site at the PyTorch Conference 2025 on October 22-23 to learn how Arm empowers developers to build and deploy AI applications easily using PyTorch and ExecuTorch.
    • October 15, 2025
  • Unlocking AI Potential with Kleidi: Seamless Acceleration Workshop Recap

    Parichay Das
    Parichay Das
    Explore takeaways from our Kleidi AI workshop led by Arm Ambassador Parichay Das, where participants tackled performance gaps and future AI needs.
    • September 25, 2025