Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
AI blog One year of Arm KleidiAI in XNNPack: Seamless and transparent AI performance
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Neural Network
  • Artificial Intelligence (AI)
  • Machine Learning (ML)
  • KleidiAI
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

One year of Arm KleidiAI in XNNPack: Seamless and transparent AI performance

Gian Marco Iodice
Gian Marco Iodice
July 10, 2025
5 minute read time.

Arm KleidiAI + XNNPack: A Year of Seamless ML Acceleration 

It’s hard to believe, but it's already been a full year since Arm KleidiAI, the highly optimized library used to accelerate AI inference on Arm CPUs, was first integrated into XNNPACK. In that time, we’ve made incredible performance improvements in XNNPack—starting with our announcement of Int4 matmul optimizations to boost Gemma 2—and continuing with many more enhancements under the hood. 

The best part? Developers didn’t need to change a thing. 

All of these improvements are completely transparent, requiring no code changes and no additional dependencies. Just build and run your application with XNNPack as usual, and you’ll benefit automatically from the latest low-level optimizations we’ve introduced through KleidiAI. 

Let’s take a closer look at the latest wave of enhancements. 

New KleidiAI Optimizations in XNNPack

Matmul F32 x Int8 for SDOT and I8MM 

Building on our previous Int4 optimizations, these optimizations focused on accelerating Int8 matrix multiplication with dynamic quantization—broadening the scope of our performance improvements to support a wide spectrum of AI models. From convolutional neural networks to cutting-edge generative AI models, like Stable Audio Open Small unveiled in May 2025, this optimization delivers tangible gains. In fact, it boosted the performance of the diffusion module by more than 30%! 

These Int8 optimizations, like the earlier Int4 enhancements, use SDOT and Int8 Matrix Multiply (I8MM) instructions to improve dynamic quantization performance across a range of CPUs. 

SME2 Optimizations for Matmul F32, F16, and Int8 

One of the most exciting recent developments is our support for SME2 (Scalable Matrix Extension 2) on Armv9 architecture. This enables a significant leap in performance for F32 (float32), F16 (float16), and Int8 matrix multiplications, opening the door to an entirely new class of high-performance applications. As a result, both current and future AI workloads can be seamlessly accelerated from day one, with zero additional effort required. 

What is SME2?

SME2 is a new Arm technology introduced in the Armv9-A CPU architecture. 

SME2 is built on the SVE2 (Scalable Vector Extension 2) technology and expands its utility with features that benefit a wide range of domains, including AI, computer vision, linear algebra, and more. 

A standout feature of SME2 is the MOPA (Matrix Outer Product Accumulate) instruction, which enables efficient outer product operations. 

As illustrated in the following figure, the outer product differs from the dot product: while the dot product yields a scalar result, the outer product produces a matrix from two input vectors: 

Figure shows the matrix with 2 input vectors.

Therefore, to see this in action, let’s examine the following matrix multiplication example: 

Figure shows an example of matrix multiplication.

This matrix multiplication can be decomposed into a series of outer products, as visually represented below: 

Figure shows a matrix multiplication with outer products.

Now that the concept is clear, we can explore the SME2 assembly instruction that forms the core of optimized matrix multiplication routines: 

FMOPA za0.s, p0/m, p1/m, z1.s, z3.s 

Here’s what each operand represents: 

  • FMOPA: The Floating-point Matrix Outer Product Accumulate instruction. 
  • ZA0.s: The ZA register tile, which stores and accumulates the outer product result 
  • p0/m and p1/m: Predicate registers, which define valid lanes of computation (masking). 
  • z1.s and z3.s: The input vectors involved in the outer product. 

This instruction is available for a variety of data types—from floating-point formats like float32 and float16 to integer types such as int8. And thanks to the use of SVE, it is vector-length agnostic, meaning it automatically scales with different hardware vector sizes—no code changes required. 

To illustrate the performance potential of SME2, consider its impact on accelerating Int4 matrix multiplications in the Gemma 3 model using Int8 outer product instructions. When deployed on hardware with SME2 support, Google’s Gemma 3 model achieves up to a 6× speedup in AI response times for chatbot use cases—compared to running on the same device without SME2 enabled. 

Moreover, with SME2 acceleration on a single CPU core, Gemma 3 can begin text summarization of a four-paragraph page in under one second, demonstrating both the responsiveness and efficiency unlocked by this architecture. 

Why this matters

With these updates, XNNPack becomes the first AI inference library to support SME2, opening the door to unprecedented performance on Arm CPUs. 

Whether you're working on Generative AI or CNN-based neural networks, you will see measurable improvements in your application without changing a single line code. 

Looking ahead with Arm KleidiAI

The past year has proven that transparent acceleration is not only possible—it’s practical. With Arm KleidiAI continuing to push the boundaries of what’s possible in XNNPack, developers can stay focused on building great AI experiences while the runtime just keeps getting faster. 

Stay tuned—this is only the beginning. 

Anonymous
AI blog
  • Sign language translation using machine learning

    Lizzie Salter
    Lizzie Salter
    In this blog post, learn how the Arm Developer Advocacy team is exploring how machine learning can enable a sign-to-speech translator for video conferencing.
    • August 15, 2025
  • Bringing Generative AI to the masses with ExecuTorch and KleidiAI

    Gian Marco Iodice
    Gian Marco Iodice
    With the recent Arm SME2 announcement, the role of Arm KleidiAI is increasingly clear as Arm’s AI accelerator layer powering the next wave of AI.
    • August 13, 2025
  • Yellow Teaming on Arm: A look inside our responsible AI workshop

    Annie Tallund
    Annie Tallund
    Led a hands-on Yellow Teaming workshop at WeAreDevelopers, exploring Responsible AI and LLMs on Arm-powered tech.
    • July 28, 2025