Arm KleidiAI + XNNPack: A Year of Seamless ML Acceleration
It’s hard to believe, but it's already been a full year since Arm KleidiAI, the highly optimized library used to accelerate AI inference on Arm CPUs, was first integrated into XNNPACK. In that time, we’ve made incredible performance improvements in XNNPack—starting with our announcement of Int4 matmul optimizations to boost Gemma 2—and continuing with many more enhancements under the hood.
Int4 matmul
The best part? Developers didn’t need to change a thing.
All of these improvements are completely transparent, requiring no code changes and no additional dependencies. Just build and run your application with XNNPack as usual, and you’ll benefit automatically from the latest low-level optimizations we’ve introduced through KleidiAI.
Let’s take a closer look at the latest wave of enhancements.
Matmul F32 x Int8 for SDOT and I8MM
Matmul F32 x Int8
Building on our previous Int4 optimizations, these optimizations focused on accelerating Int8 matrix multiplication with dynamic quantization—broadening the scope of our performance improvements to support a wide spectrum of AI models. From convolutional neural networks to cutting-edge generative AI models, like Stable Audio Open Small unveiled in May 2025, this optimization delivers tangible gains. In fact, it boosted the performance of the diffusion module by more than 30%!
These Int8 optimizations, like the earlier Int4 enhancements, use SDOT and Int8 Matrix Multiply (I8MM) instructions to improve dynamic quantization performance across a range of CPUs.
SME2 Optimizations for Matmul F32, F16, and Int8
Matmul F32
F16
Int8
One of the most exciting recent developments is our support for SME2 (Scalable Matrix Extension 2) on Armv9 architecture. This enables a significant leap in performance for F32 (float32), F16 (float16), and Int8 matrix multiplications, opening the door to an entirely new class of high-performance applications. As a result, both current and future AI workloads can be seamlessly accelerated from day one, with zero additional effort required.
SME2 is a new Arm technology introduced in the Armv9-A CPU architecture.
SME2 is built on the SVE2 (Scalable Vector Extension 2) technology and expands its utility with features that benefit a wide range of domains, including AI, computer vision, linear algebra, and more.
A standout feature of SME2 is the MOPA (Matrix Outer Product Accumulate) instruction, which enables efficient outer product operations.
As illustrated in the following figure, the outer product differs from the dot product: while the dot product yields a scalar result, the outer product produces a matrix from two input vectors:
Therefore, to see this in action, let’s examine the following matrix multiplication example:
This matrix multiplication can be decomposed into a series of outer products, as visually represented below:
Now that the concept is clear, we can explore the SME2 assembly instruction that forms the core of optimized matrix multiplication routines:
FMOPA za0.s, p0/m, p1/m, z1.s, z3.s
Here’s what each operand represents:
This instruction is available for a variety of data types—from floating-point formats like float32 and float16 to integer types such as int8. And thanks to the use of SVE, it is vector-length agnostic, meaning it automatically scales with different hardware vector sizes—no code changes required.
To illustrate the performance potential of SME2, consider its impact on accelerating Int4 matrix multiplications in the Gemma 3 model using Int8 outer product instructions. When deployed on hardware with SME2 support, Google’s Gemma 3 model achieves up to a 6× speedup in AI response times for chatbot use cases—compared to running on the same device without SME2 enabled.
Moreover, with SME2 acceleration on a single CPU core, Gemma 3 can begin text summarization of a four-paragraph page in under one second, demonstrating both the responsiveness and efficiency unlocked by this architecture.
With these updates, XNNPack becomes the first AI inference library to support SME2, opening the door to unprecedented performance on Arm CPUs.
Whether you're working on Generative AI or CNN-based neural networks, you will see measurable improvements in your application without changing a single line code.
The past year has proven that transparent acceleration is not only possible—it’s practical. With Arm KleidiAI continuing to push the boundaries of what’s possible in XNNPack, developers can stay focused on building great AI experiences while the runtime just keeps getting faster.
Stay tuned—this is only the beginning.