Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
AI blog Optimize Llama.cpp with Arm I8MM instruction
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Neoverse N2
  • Llama
  • NEON
  • AI Inference
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Optimize Llama.cpp with Arm I8MM instruction

Yibo Cai
Yibo Cai
June 27, 2025
3 minute read time.

Optimize Llama.cpp with Arm I8MM instruction

This blog post introduces our practice in optimizing Llama.cpp Q6_K and Q4_K quantized model inference with Arm I8MM featured instructions. More specifically, the signed 8-bit integer matrix multiply accumulate instruction smmla. 

Understanding Llama.cpp quantization 

Llama.cpp is an open-source C++ library for running large language models, optimized for fast CPU inference. It leverages techniques like quantization (for example, 8-bit or 4-bit integer formats) to reduce memory usage and accelerate computations, enabling efficient model deployment on consumer and server-grade hardware. 

Llama.cpp supports different kinds of quantization. Quantization balances model accuracy and performance. Smaller data sizes improve inference speed but can reduce accuracy by increasing perplexity. 

For example, Q8_0 uses 8-bit integer to represent one datapoint, while Q6_K reduces data size to 6 bits. 

Quantization is done in blocks. Data points in a block share a single scale factor. 

For example, Q8_0 is processed in blocks of 32 datapoints, as explained here: 

  • Retrieve 32 floating-point values from the original data, write down as f[0:32] 
  • Calculate maximal of absolute values, that is, mf = max(abs(f[0:32])) 
  • Calculate scale_factor = mf / (max(int8)) = mf / 127 
  • Quantization: q[i] = round(f[i] / scale_factor) 
  • Dequantization: v[i] = q[i] * scale_factor 

Q6_K is more complicated. As shown in the picture below, datapoints are organized in two levels: 

  • One super block of 256 datapoints. One super block scale factor in floating point format. 
  • Each super block consists of 16 blocks. There are 16 datapoints in each block, which share one block level scale factor in integer format. 

 Llama.cpp Q6_K quantization

Figure1: Llama.cpp Q6_K quantization

Optimize Llama.cpp with Arm I8MM instruction

Like most AI workloads, for LLM inference, most CPU cycles are cost in matrix multiplication. Arm I8MM, more specifically, the smmla instruction accelerates 8-bit integer matrix multiplication.

To explain what smmla does and why it’s efficient, assume we want to multiply two matrices in the following figure.

Figure 2: matrix multiplication

Following the textbook approach, we can calculate the four scalars in the output matrix one by one. That is, the first output scalar is the inner product of the first row of matrix x with the first column of matrix y. Four inner product operations are required.

There’s a more efficient way to do it with outer product. As shown in the following figure, we can instead multiply the first column of matrix x with the first row of matrix y to get the four partial output scalars in one shot. Sum the two partial outputs lead to the result. Only two outer product operations are necessary.


Figure 3: outer product

smmla implements vector level outer product, as the picture below shows. Please note vmmlaq_s32 is the compiler intrinsic implements smmla instruction.

  • Each input vector (int8x16) is divided into two int8x8 vectors
  • Calculate inner product of the four int8x8 vector pairs
  • Store results to the four lanes in the output vector (int32x4)

 smmla instruction

Figure 4: smmla instruction

Armed with smmla instruction, we can accelerate matrix multiplication by progressing in two rows and two columns. As explained in the following figure, the computation steps are:

  • Load two rows (int8x16) from matrix x to vx0 and vx1, two columns from y to vy0 and vy1.
  • Zip vx0, vx1 to combine bottom halves of the two vectors to one vector, and upper halves to another one. This is necessary for smmla to work correctly. Same for vy0, vy1.
  • Calculate four temporary scalar results with two smmla instructions.
  • Progress to next blocks and accumulate temporary results until all data is consumed

 matrix multiplication with smmla

Figure 5: matrix multiplication with smmla

We optimized Llama.cpp Q6_K and Q4_K matrix multiplication kernels with smmla and observed big performance uplift. Picture below compares Llama.cpp performance before and after Q6_K optimization:

  • S_TG: speed of token generation, higher the better
  • S_PP: speed of prompt prefills, higher the better

The test platform is Arm Neoverse-N2.

 Arm I8MM improves Llama.cpp Q6_K model performance

Figure 6: Arm I8MM improves Llama.cpp Q6_K model performance

Upstream patches for reference:

  • Optimize Llama.cpp Q6_K kernel with Arm I8MM
  • Optimize Llama.cpp Q4_K kernel with Arm I8MM
Anonymous
AI blog
  • Sign language translation using machine learning

    Lizzie Salter
    Lizzie Salter
    In this blog post, learn how the Arm Developer Advocacy team is exploring how machine learning can enable a sign-to-speech translator for video conferencing.
    • August 15, 2025
  • Bringing Generative AI to the masses with ExecuTorch and KleidiAI

    Gian Marco Iodice
    Gian Marco Iodice
    With the recent Arm SME2 announcement, the role of Arm KleidiAI is increasingly clear as Arm’s AI accelerator layer powering the next wave of AI.
    • August 13, 2025
  • Yellow Teaming on Arm: A look inside our responsible AI workshop

    Annie Tallund
    Annie Tallund
    Led a hands-on Yellow Teaming workshop at WeAreDevelopers, exploring Responsible AI and LLMs on Arm-powered tech.
    • July 28, 2025