Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog Running KleidiAI MatMul kernels in a bare-metal Arm environment
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Arm Development Studio
  • Arm Compiler 6
  • Arm Compiler
  • Artificial Intelligence (AI)
  • KleidiAI
  • Fixed Virtual Platforms (FVPs)
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Running KleidiAI MatMul kernels in a bare-metal Arm environment

Paul Black
Paul Black
April 17, 2025
11 minute read time.

If you’ve not encountered Arm ® KleidiAI yet, it’s a groundbreaking software library designed to elevate Artificial Intelligence (AI) performance on Arm CPUs. There’s an overview in this blog post, including a link to this guide, which provides step-by-step instructions to getting some Arm KleidiAI matrix multiplication (matmul) micro-kernels running in a Linux environment. That guide is great, and super-easy to follow, but I wanted to look at getting Arm Kleidi AI kernels running in a bare-metal environment. I also wanted to experiment with a selection of C/C++ compilers, to see which one generated faster code. 

This blog post outlines a process for getting some Arm KleidiAI micro-kernels running in a bare-metal environment and performing some basic benchmarking for different compilers at different optimization levels. The blog post uses components of Arm® Development Studio including a Fixed Virtual Platform (FVP) and of course the license for Arm® Compiler for Embedded (AC6). There’s also some information at the end of the blog about reviewing the optimizations that a compiler has (or hasn’t) exploited. 

Setting up the bare-metal projects 

The three compilers I wanted to assess are: 

  • Arm® Compiler for Embedded (AC6), better known as AC6 
  • The Arm GNU toolchain, GCC 
  • The next generation Arm embedded compiler, Arm® Toolchain for Embedded (ATfE). At time of writing this is at beta quality 

To get KleidiAI kernels working in a bare-metal project I followed the instructions in the Kleidi guide. As a base, I used a C++ example project from Arm® Development Studio: startup_Armv8-Ax1_AC6_CPP is the AC6 version, startup_Armv8-Ax1_GCC_CPP is the GCC version, and there’s a ported version for ATfE included in the ATfE beta download. The project is functionally the same for all three compilers but includes necessary changes to the Makefile and linker script. This blog post looks at the changes necessary to port the example project from GCC to ATfE. 

Fixes and changes for each toolchain 

After pasting in the code given in the Kleidi guide, you must make some simple changes to all three projects to get them working: 

  • Include the float.h header to define FLT_MAX 
  • Add include paths for KleidiAI headers 
  • Change the architecture to armv8.2-a+dotprod+i8mm 
  • To run the code, we need an Arm core with the I8MM extension. This was optional from Armv8.2-A to Armv8.5-A and mandatory in later cores providing Advanced SIMD instructions, so Neoverse  -V1 makes a good choice. There’s a Neoverse-V1 Fixed Virtual Platform (FVP) provided as part of Arm® Development Studio, and I used options -C cluster0.NUM_CORES=1 -C bp.secure_memory=false -C cache_state_modelled=0 
  • In the startup code there’s a read-modify-write sequence to set SMPEN, but for the Neoverse  -V1 FVP that caused problems. In re-using Cortex-A boot code for a Neoverse core it’s expected I would need to make a few changes, and in this case removing the sequence clears the problem. Ideally, I would want to review the boot code against the requirements of Neoverse cores, but for this investigation just getting the code running is good enough 
  • I added some code to fill the matrices with random data. This might not be necessary as the memory was already filled with a repeating non-zero pattern 

There were also a few changes I needed to make to individual projects. This is not so surprising, the example projects focus on core bring-up and weren’t intended to run significant post-boot loads: 

  • In the ATfE project the RAM size is given as 0x80000, and that is small enough that the heap and stack collide. This is easy to fix, even the default config of the FVP provides significantly more RAM than that. So, we can give a bigger RAM size in the linker script 
  • In the GCC project the .init_array section is given an address of 0x80100000, which is low enough to create a clash with the .eh_frame section. Removing the address fixes the problem 

And that’s it! That gets some KleidiAI kernels working bare-metal with three different toolchains. It’s time to do some performance testing. 

Benchmarking method and results 

As a measure of performance, I used the FVP cycle counter. It’s not a perfect performance measure, but for this investigation it’s good enough. It’s the same workload for all three compilers, so any inaccuracies will be about the same and in the same places. So, as an indicative measure of performance, FVP cycle count is good enough for what we need for this investigation. I measured the cycle count for all three compilers at optimization level -00, -01, -02 and -03 to boot the core, set up the matrices, and execute the KleidiAI kernels: 

Compiler  -O0  -O1  -O2  -O3 
AC6  176,819  99,299  99,383  98,425 
GCC  202,282  122,331  117,884  117,209 
ATfE  156,741  80,503  80,466  79,532 

There’s two interesting things here. Firstly, the bulk of the optimization happens at -O1. There are small gains at -O2 and -O3, more so for GCC, but nothing like the gain at -O1. This is not so surprising: the KleidiAI kernels are already optimized with a significant weight of hand-coded assembly instructions, and the code I’ve added around the kernels is short and simple. I’ll look deeper at the optimizations used later in this blog post. 

Secondly, it appears that ATfE is significantly faster than either AC6 or GCC. It is of course great that the next-generation Arm embedded compiler seems to stack up so well against AC6, but the gain is big enough to make me want to look deeper.  

The assembler, compiler, and C++ library components of both AC6 and ATfE are based on LLVM, the major difference between the toolchains is in the linker and C library (proprietary for AC6, open source for ATfE). So, I’m curious about a ~20% performance gap between the two. I need to make sure that any performance and benchmarking information is applicable to real-world projects, so I need more information about where the ATfE speed-up comes from. 

Looking deeper 

I simultaneously made my performance testing simpler and more complex. I simplified by looking only at -O1, as there’s where the bulk of the optimization happens. But I increased the granularity by splitting the code into three sections: 

  • Boot: All startup code, to the entry to main() 
  • Prep: Allocating memory for the matrices, filling the matrices with random data 
  • Execute: Running the Kleidi kernels 

Cycle counts are as follows: 

Compiler Boot Prep Execute
AC6 71,147 3,465 24,687
GCC 89,778 7,211 25,974
ATfE 52,078 3,962 24,428

In terms of time taken to execute the KleidiAI kernels, the three compilers are tightly grouped although ATfE is slightly ahead of AC6 (by about 1%) and GCC is trailing a little. I re-ran the tests at -O2 and -O3: GCC pulls slightly ahead by -O3, as part of the high optimization level gain, I pointed out earlier. 

In the Prep section ATfE and AC6 are again quite close, and GCC lags. Again, I re-ran at -O2 and -O3: at those optimization levels GCC closed the gap a little. It looks like different compilers include some optimization passes in different optimization levels. 

The big speedup though, and the source of the quicker end-to-end time for ATfE, is coming from the boot section. I’m suspecting that the C library setup is lighter weight for Picolibc (used in ATfE) than for ArmCLib (used in AC6) or newlib (used in GCC). The location of the big speedup for ATfE means that the initial performance comparison is skewed because the test project doesn’t contain much code: if I enlarged the workload the boot code would not represent such a big percentage of the overall run time. 

Analyzing compiler optimizations 

To take a look at the optimization passes that ATfE has (or hasn’t) used, we can use the -Rpass (or -Rpass-missed) compiler options. Both of these take either =.* for all optimization passes, or =<optimization> for individual passes. For example, we might use -Rpass=inline to look at which calls have been inlined or -Rpass-missed=inline to see which calls haven’t. -Rpass-missed can provide valuable information about how C/C++ code could be tweaked to make it easier for the compiler to optimize. 

More out of curiosity than anything, I took a quick look at what the ATfE optimization passes were doing at -O0, -O1, -O2 and -O3. Here’s what I found: 

Even at -O0 the compiler inlined some Arm C Language Extensions (ACLE) intrinsics calls, for example vaddq_s16 (vector add). This makes sense as the call is a single instruction, so there’s no trade-off between performance (from removed function call overhead) and increased size (from replicated code). 

At -O1 the compiler did a lot of function inlining, particularly of small functions (such as the random number generator implementation). It also “Hoisted” instructions and expressions to take them outside loops, if there was no reason for them to be re-evaluated on each loop pass. 

At -O2 there was some loop vectorization, although some was postponed until –O3. The compiler uses heuristics to balance the benefit and cost of each optimization. As with inlining, it’s interesting to see different vectorization choices made for different loops at the same optimization level. At -O3, the compiler also unrolled a few loops. 

The hoisting is interesting enough to take a closer look. Given this,significantly shortened, chunk of code from one of the KleidiAI source files: 

 

 for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { 
        for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { 
            const size_t block_idx = dst_byte_idx / block_length_in_bytes; 
            const size_t nr_idx = block_idx % nr; 
            const size_t n0_idx = dst_row_idx * nr + nr_idx; 


The compiler notices that the multiplication part of the n0_idx calculation doesn’t need to be in the inner loop, because both dst_row_idx and nr are constant in the inner loop: 

src/kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0.c:96:47: remark: hoisting mul [-Rpass=licm] 
   96 |             const size_t n0_idx = dst_row_idx * nr + nr_idx; 
      |                                               ^ 


The compiler “hoists” the multiplication from the inner loop to the outer loop, something like this: 

for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { 
        const size_t hoist_temp = dst_row_idx * nr; 
        for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { 
            const size_t block_idx = dst_byte_idx / block_length_in_bytes; 
            const size_t nr_idx = block_idx % nr; 
            const size_t n0_idx = hoist_temp + nr_idx; 


The developer could have done that, but it risks making the code less concise, less clear, and less easy to follow and maintain. The compiler thinks about these things so that the developer can focus on the function, clarity, and maintainability of the code. 

There’s a lot of information in the output of the ATfE -Rpass options, both for optimization passes that were taken and passes that were not. This information can be a great help to a developer in understanding how the compiler has optimized code, and in looking at code tweaks that help the compiler optimize better. It’s a large subject, and I’ll leave a deep dive until another blog post. 

Conclusion 

Arm® Development Studio provides a suite of items that are useful for experimentation with KleidiAI kernels in a bare-metal setting including example projects for a quick start, Fixed Virtual Platforms (FVPs) for testing, and a license for Arm® Compiler for Embedded (AC6) (and soon, for Arm ® Toolchain for Embedded Professional ATfEP). As with all software development, care is needed to capture all relevant data when assessing things like compiler performance: in this instance it would have been easy to assume that projects built with ATfE would run around 20% faster than when built with AC6. ATfE makes heuristics-based optimization decisions using the cost and benefit of each potential optimization and provides useful options for reviewing which optimizations have and haven’t been used. Information from these options might be used to tweak code to enable the compiler to exploit additional optimizations. 

Anonymous
Tools, Software and IDEs blog
  • GitHub and Arm are transforming development on Windows for developers

    Pareena Verma
    Pareena Verma
    Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
    • May 20, 2025
  • What is new in LLVM 20?

    Volodymyr Turanskyy
    Volodymyr Turanskyy
    Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
    • April 29, 2025
  • Running KleidiAI MatMul kernels in a bare-metal Arm environment

    Paul Black
    Paul Black
    Benchmarking Arm®︎ KleidiAI MatMul kernels on bare-metal with AC6, GCC, and ATfE compilers.
    • April 17, 2025