Running KleidiAI MatMul kernels in a bare-metal Arm environment

April 17, 2025

11 minute read time.

If you’ve not encountered Arm ® KleidiAI yet, it’s a groundbreaking software library designed to elevate Artificial Intelligence (AI) performance on Arm CPUs. There’s an overview in this blog post, including a link to this guide, which provides step-by-step instructions to getting some Arm KleidiAI matrix multiplication (matmul) micro-kernels running in a Linux environment. That guide is great, and super-easy to follow, but I wanted to look at getting Arm Kleidi AI kernels running in a bare-metal environment. I also wanted to experiment with a selection of C/C++ compilers, to see which one generated faster code.

This blog post outlines a process for getting some Arm KleidiAI micro-kernels running in a bare-metal environment and performing some basic benchmarking for different compilers at different optimization levels. The blog post uses components of Arm® Development Studio including a Fixed Virtual Platform (FVP) and of course the license for Arm® Compiler for Embedded (AC6). There’s also some information at the end of the blog about reviewing the optimizations that a compiler has (or hasn’t) exploited.

Setting up the bare-metal projects

The three compilers I wanted to assess are:

Arm® Compiler for Embedded (AC6), better known as AC6
The Arm GNU toolchain, GCC
The next generation Arm embedded compiler, Arm® Toolchain for Embedded (ATfE). At time of writing this is at beta quality

To get KleidiAI kernels working in a bare-metal project I followed the instructions in the Kleidi guide. As a base, I used a C++ example project from Arm® Development Studio: startup_Armv8-Ax1_AC6_CPP is the AC6 version, startup_Armv8-Ax1_GCC_CPP is the GCC version, and there’s a ported version for ATfE included in the ATfE beta download. The project is functionally the same for all three compilers but includes necessary changes to the Makefile and linker script. This blog post looks at the changes necessary to port the example project from GCC to ATfE.

Fixes and changes for each toolchain

After pasting in the code given in the Kleidi guide, you must make some simple changes to all three projects to get them working:

Include the float.h header to define FLT_MAX
Add include paths for KleidiAI headers
Change the architecture to armv8.2-a+dotprod+i8mm
To run the code, we need an Arm core with the I8MM extension. This was optional from Armv8.2-A to Armv8.5-A and mandatory in later cores providing Advanced SIMD instructions, so Neoverse -V1 makes a good choice. There’s a Neoverse-V1 Fixed Virtual Platform (FVP) provided as part of Arm® Development Studio, and I used options -C cluster0.NUM_CORES=1 -C bp.secure_memory=false -C cache_state_modelled=0
In the startup code there’s a read-modify-write sequence to set SMPEN, but for the Neoverse -V1 FVP that caused problems. In re-using Cortex-A boot code for a Neoverse core it’s expected I would need to make a few changes, and in this case removing the sequence clears the problem. Ideally, I would want to review the boot code against the requirements of Neoverse cores, but for this investigation just getting the code running is good enough
I added some code to fill the matrices with random data. This might not be necessary as the memory was already filled with a repeating non-zero pattern

There were also a few changes I needed to make to individual projects. This is not so surprising, the example projects focus on core bring-up and weren’t intended to run significant post-boot loads:

In the ATfE project the RAM size is given as 0x80000, and that is small enough that the heap and stack collide. This is easy to fix, even the default config of the FVP provides significantly more RAM than that. So, we can give a bigger RAM size in the linker script
In the GCC project the .init_array section is given an address of 0x80100000, which is low enough to create a clash with the .eh_frame section. Removing the address fixes the problem

And that’s it! That gets some KleidiAI kernels working bare-metal with three different toolchains. It’s time to do some performance testing.

Benchmarking method and results

As a measure of performance, I used the FVP cycle counter. It’s not a perfect performance measure, but for this investigation it’s good enough. It’s the same workload for all three compilers, so any inaccuracies will be about the same and in the same places. So, as an indicative measure of performance, FVP cycle count is good enough for what we need for this investigation. I measured the cycle count for all three compilers at optimization level -00, -01, -02 and -03 to boot the core, set up the matrices, and execute the KleidiAI kernels:

Compiler	-O0	-O1	-O2	-O3
AC6	176,819	99,299	99,383	98,425
GCC	202,282	122,331	117,884	117,209
ATfE	156,741	80,503	80,466	79,532

There’s two interesting things here. Firstly, the bulk of the optimization happens at -O1. There are small gains at -O2 and -O3, more so for GCC, but nothing like the gain at -O1. This is not so surprising: the KleidiAI kernels are already optimized with a significant weight of hand-coded assembly instructions, and the code I’ve added around the kernels is short and simple. I’ll look deeper at the optimizations used later in this blog post.

Secondly, it appears that ATfE is significantly faster than either AC6 or GCC. It is of course great that the next-generation Arm embedded compiler seems to stack up so well against AC6, but the gain is big enough to make me want to look deeper.

The assembler, compiler, and C++ library components of both AC6 and ATfE are based on LLVM, the major difference between the toolchains is in the linker and C library (proprietary for AC6, open source for ATfE). So, I’m curious about a ~20% performance gap between the two. I need to make sure that any performance and benchmarking information is applicable to real-world projects, so I need more information about where the ATfE speed-up comes from.

Looking deeper

I simultaneously made my performance testing simpler and more complex. I simplified by looking only at -O1, as there’s where the bulk of the optimization happens. But I increased the granularity by splitting the code into three sections:

Boot: All startup code, to the entry to main()
Prep: Allocating memory for the matrices, filling the matrices with random data
Execute: Running the Kleidi kernels

Cycle counts are as follows:

Compiler	Boot	Prep	Execute
AC6	71,147	3,465	24,687
GCC	89,778	7,211	25,974
ATfE	52,078	3,962	24,428

In terms of time taken to execute the KleidiAI kernels, the three compilers are tightly grouped although ATfE is slightly ahead of AC6 (by about 1%) and GCC is trailing a little. I re-ran the tests at -O2 and -O3: GCC pulls slightly ahead by -O3, as part of the high optimization level gain, I pointed out earlier.

In the Prep section ATfE and AC6 are again quite close, and GCC lags. Again, I re-ran at -O2 and -O3: at those optimization levels GCC closed the gap a little. It looks like different compilers include some optimization passes in different optimization levels.

The big speedup though, and the source of the quicker end-to-end time for ATfE, is coming from the boot section. I’m suspecting that the C library setup is lighter weight for Picolibc (used in ATfE) than for ArmCLib (used in AC6) or newlib (used in GCC). The location of the big speedup for ATfE means that the initial performance comparison is skewed because the test project doesn’t contain much code: if I enlarged the workload the boot code would not represent such a big percentage of the overall run time.

Analyzing compiler optimizations

To take a look at the optimization passes that ATfE has (or hasn’t) used, we can use the -Rpass (or -Rpass-missed) compiler options. Both of these take either =.* for all optimization passes, or =<optimization> for individual passes. For example, we might use -Rpass=inline to look at which calls have been inlined or -Rpass-missed=inline to see which calls haven’t. -Rpass-missed can provide valuable information about how C/C++ code could be tweaked to make it easier for the compiler to optimize.

More out of curiosity than anything, I took a quick look at what the ATfE optimization passes were doing at -O0, -O1, -O2 and -O3. Here’s what I found:

Even at -O0 the compiler inlined some Arm C Language Extensions (ACLE) intrinsics calls, for example vaddq_s16 (vector add). This makes sense as the call is a single instruction, so there’s no trade-off between performance (from removed function call overhead) and increased size (from replicated code).

At -O1 the compiler did a lot of function inlining, particularly of small functions (such as the random number generator implementation). It also “Hoisted” instructions and expressions to take them outside loops, if there was no reason for them to be re-evaluated on each loop pass.

At -O2 there was some loop vectorization, although some was postponed until –O3. The compiler uses heuristics to balance the benefit and cost of each optimization. As with inlining, it’s interesting to see different vectorization choices made for different loops at the same optimization level. At -O3, the compiler also unrolled a few loops.

The hoisting is interesting enough to take a closer look. Given this,significantly shortened, chunk of code from one of the KleidiAI source files:

Fullscreen

1
2
3
4
5
 for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { 
        for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { 
            const size_t block_idx = dst_byte_idx / block_length_in_bytes; 
            const size_t nr_idx = block_idx % nr; 
            const size_t n0_idx = dst_row_idx * nr + nr_idx; 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

 for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { 
        for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { 
            const size_t block_idx = dst_byte_idx / block_length_in_bytes; 
            const size_t nr_idx = block_idx % nr; 
            const size_t n0_idx = dst_row_idx * nr + nr_idx;

The compiler notices that the multiplication part of the n0_idx calculation doesn’t need to be in the inner loop, because both dst_row_idx and nr are constant in the inner loop:

Fullscreen

1
2
3
src/kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0.c:96:47: remark: hoisting mul [-Rpass=licm] 
   96 |             const size_t n0_idx = dst_row_idx * nr + nr_idx; 
      |                                               ^ 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

src/kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0.c:96:47: remark: hoisting mul [-Rpass=licm] 
   96 |             const size_t n0_idx = dst_row_idx * nr + nr_idx; 
      |                                               ^

The compiler “hoists” the multiplication from the inner loop to the outer loop, something like this:

Fullscreen

1
2
3
4
5
6
for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { 
        const size_t hoist_temp = dst_row_idx * nr; 
        for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { 
            const size_t block_idx = dst_byte_idx / block_length_in_bytes; 
            const size_t nr_idx = block_idx % nr; 
            const size_t n0_idx = hoist_temp + nr_idx; 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { 
        const size_t hoist_temp = dst_row_idx * nr; 
        for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { 
            const size_t block_idx = dst_byte_idx / block_length_in_bytes; 
            const size_t nr_idx = block_idx % nr; 
            const size_t n0_idx = hoist_temp + nr_idx;

The developer could have done that, but it risks making the code less concise, less clear, and less easy to follow and maintain. The compiler thinks about these things so that the developer can focus on the function, clarity, and maintainability of the code.

There’s a lot of information in the output of the ATfE -Rpass options, both for optimization passes that were taken and passes that were not. This information can be a great help to a developer in understanding how the compiler has optimized code, and in looking at code tweaks that help the compiler optimize better. It’s a large subject, and I’ll leave a deep dive until another blog post.

Conclusion

Arm® Development Studio provides a suite of items that are useful for experimentation with KleidiAI kernels in a bare-metal setting including example projects for a quick start, Fixed Virtual Platforms (FVPs) for testing, and a license for Arm® Compiler for Embedded (AC6) (and soon, for Arm ® Toolchain for Embedded Professional ATfEP). As with all software development, care is needed to capture all relevant data when assessing things like compiler performance: in this instance it would have been easy to assume that projects built with ATfE would run around 20% faster than when built with AC6. ATfE makes heuristics-based optimization decisions using the cost and benefit of each potential optimization and provides useful options for reviewing which optimizations have and haven’t been used. Information from these options might be used to tweak code to enable the compiler to exploit additional optimizations.

0 comments
0 members are here

Tools, Software and IDEs blog

What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025
Running KleidiAI MatMul kernels in a bare-metal Arm environment

Paul Black

Benchmarking Arm®︎ KleidiAI MatMul kernels on bare-metal with AC6, GCC, and ATfE compilers.
- April 17, 2025
Migrating a project from GCC to Arm Toolchain for Embedded

Paul Black

Learn about migrating software projects to Arm Toolchain for Embedded in this blog post.
- March 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Running KleidiAI MatMul kernels in a bare-metal Arm environment

Setting up the bare-metal projects

Fixes and changes for each toolchain

Benchmarking method and results

Looking deeper

Analyzing compiler optimizations

Conclusion

What is new in LLVM 20?

Running KleidiAI MatMul kernels in a bare-metal Arm environment

Migrating a project from GCC to Arm Toolchain for Embedded