If you’ve not encountered Arm ® KleidiAI yet, it’s a groundbreaking software library designed to elevate Artificial Intelligence (AI) performance on Arm CPUs. There’s an overview in this blog post, including a link to this guide, which provides step-by-step instructions to getting some Arm KleidiAI matrix multiplication (matmul) micro-kernels running in a Linux environment. That guide is great, and super-easy to follow, but I wanted to look at getting Arm Kleidi AI kernels running in a bare-metal environment. I also wanted to experiment with a selection of C/C++ compilers, to see which one generated faster code.
This blog post outlines a process for getting some Arm KleidiAI micro-kernels running in a bare-metal environment and performing some basic benchmarking for different compilers at different optimization levels. The blog post uses components of Arm® Development Studio including a Fixed Virtual Platform (FVP) and of course the license for Arm® Compiler for Embedded (AC6). There’s also some information at the end of the blog about reviewing the optimizations that a compiler has (or hasn’t) exploited.
The three compilers I wanted to assess are:
To get KleidiAI kernels working in a bare-metal project I followed the instructions in the Kleidi guide. As a base, I used a C++ example project from Arm® Development Studio: startup_Armv8-Ax1_AC6_CPP is the AC6 version, startup_Armv8-Ax1_GCC_CPP is the GCC version, and there’s a ported version for ATfE included in the ATfE beta download. The project is functionally the same for all three compilers but includes necessary changes to the Makefile and linker script. This blog post looks at the changes necessary to port the example project from GCC to ATfE.
startup_Armv8-Ax1_AC6_CPP
startup_Armv8-Ax1_GCC_CPP
After pasting in the code given in the Kleidi guide, you must make some simple changes to all three projects to get them working:
armv8.2-a+dotprod+i8mm
-C cluster0.NUM_CORES=1 -C bp.secure_memory=false -C
cache_state_modelled=0
There were also a few changes I needed to make to individual projects. This is not so surprising, the example projects focus on core bring-up and weren’t intended to run significant post-boot loads:
And that’s it! That gets some KleidiAI kernels working bare-metal with three different toolchains. It’s time to do some performance testing.
As a measure of performance, I used the FVP cycle counter. It’s not a perfect performance measure, but for this investigation it’s good enough. It’s the same workload for all three compilers, so any inaccuracies will be about the same and in the same places. So, as an indicative measure of performance, FVP cycle count is good enough for what we need for this investigation. I measured the cycle count for all three compilers at optimization level -00, -01, -02 and -03 to boot the core, set up the matrices, and execute the KleidiAI kernels:
There’s two interesting things here. Firstly, the bulk of the optimization happens at -O1. There are small gains at -O2 and -O3, more so for GCC, but nothing like the gain at -O1. This is not so surprising: the KleidiAI kernels are already optimized with a significant weight of hand-coded assembly instructions, and the code I’ve added around the kernels is short and simple. I’ll look deeper at the optimizations used later in this blog post.
Secondly, it appears that ATfE is significantly faster than either AC6 or GCC. It is of course great that the next-generation Arm embedded compiler seems to stack up so well against AC6, but the gain is big enough to make me want to look deeper.
The assembler, compiler, and C++ library components of both AC6 and ATfE are based on LLVM, the major difference between the toolchains is in the linker and C library (proprietary for AC6, open source for ATfE). So, I’m curious about a ~20% performance gap between the two. I need to make sure that any performance and benchmarking information is applicable to real-world projects, so I need more information about where the ATfE speed-up comes from.
I simultaneously made my performance testing simpler and more complex. I simplified by looking only at -O1, as there’s where the bulk of the optimization happens. But I increased the granularity by splitting the code into three sections:
Cycle counts are as follows:
In terms of time taken to execute the KleidiAI kernels, the three compilers are tightly grouped although ATfE is slightly ahead of AC6 (by about 1%) and GCC is trailing a little. I re-ran the tests at -O2 and -O3: GCC pulls slightly ahead by -O3, as part of the high optimization level gain, I pointed out earlier.
In the Prep section ATfE and AC6 are again quite close, and GCC lags. Again, I re-ran at -O2 and -O3: at those optimization levels GCC closed the gap a little. It looks like different compilers include some optimization passes in different optimization levels.
The big speedup though, and the source of the quicker end-to-end time for ATfE, is coming from the boot section. I’m suspecting that the C library setup is lighter weight for Picolibc (used in ATfE) than for ArmCLib (used in AC6) or newlib (used in GCC). The location of the big speedup for ATfE means that the initial performance comparison is skewed because the test project doesn’t contain much code: if I enlarged the workload the boot code would not represent such a big percentage of the overall run time.
To take a look at the optimization passes that ATfE has (or hasn’t) used, we can use the -Rpass (or -Rpass-missed) compiler options. Both of these take either =.* for all optimization passes, or =<optimization> for individual passes. For example, we might use -Rpass=inline to look at which calls have been inlined or -Rpass-missed=inline to see which calls haven’t. -Rpass-missed can provide valuable information about how C/C++ code could be tweaked to make it easier for the compiler to optimize.
-Rpass
-Rpass-missed
=.*
=<optimization>
-Rpass=inline
-Rpass-missed=inline
More out of curiosity than anything, I took a quick look at what the ATfE optimization passes were doing at -O0, -O1, -O2 and -O3. Here’s what I found:
Even at -O0 the compiler inlined some Arm C Language Extensions (ACLE) intrinsics calls, for example vaddq_s16 (vector add). This makes sense as the call is a single instruction, so there’s no trade-off between performance (from removed function call overhead) and increased size (from replicated code).
Even at -O0
At -O1 the compiler did a lot of function inlining, particularly of small functions (such as the random number generator implementation). It also “Hoisted” instructions and expressions to take them outside loops, if there was no reason for them to be re-evaluated on each loop pass.
At -O1
At -O2 there was some loop vectorization, although some was postponed until –O3. The compiler uses heuristics to balance the benefit and cost of each optimization. As with inlining, it’s interesting to see different vectorization choices made for different loops at the same optimization level. At -O3, the compiler also unrolled a few loops.
At -O2
At -O3
The hoisting is interesting enough to take a closer look. Given this,significantly shortened, chunk of code from one of the KleidiAI source files:
for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { const size_t block_idx = dst_byte_idx / block_length_in_bytes; const size_t nr_idx = block_idx % nr; const size_t n0_idx = dst_row_idx * nr + nr_idx;
The compiler notices that the multiplication part of the n0_idx calculation doesn’t need to be in the inner loop, because both dst_row_idx and nr are constant in the inner loop:
src/kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0.c:96:47: remark: hoisting mul [-Rpass=licm] 96 | const size_t n0_idx = dst_row_idx * nr + nr_idx; | ^
The compiler “hoists” the multiplication from the inner loop to the outer loop, something like this:
for (size_t dst_row_idx = 0; dst_row_idx < dst_num_rows; ++dst_row_idx) { const size_t hoist_temp = dst_row_idx * nr; for (size_t dst_byte_idx = 0; dst_byte_idx < dst_num_bytes_per_row; ++dst_byte_idx) { const size_t block_idx = dst_byte_idx / block_length_in_bytes; const size_t nr_idx = block_idx % nr; const size_t n0_idx = hoist_temp + nr_idx;
The developer could have done that, but it risks making the code less concise, less clear, and less easy to follow and maintain. The compiler thinks about these things so that the developer can focus on the function, clarity, and maintainability of the code.
There’s a lot of information in the output of the ATfE -Rpass options, both for optimization passes that were taken and passes that were not. This information can be a great help to a developer in understanding how the compiler has optimized code, and in looking at code tweaks that help the compiler optimize better. It’s a large subject, and I’ll leave a deep dive until another blog post.
Arm® Development Studio provides a suite of items that are useful for experimentation with KleidiAI kernels in a bare-metal setting including example projects for a quick start, Fixed Virtual Platforms (FVPs) for testing, and a license for Arm® Compiler for Embedded (AC6) (and soon, for Arm ® Toolchain for Embedded Professional ATfEP). As with all software development, care is needed to capture all relevant data when assessing things like compiler performance: in this instance it would have been easy to assume that projects built with ATfE would run around 20% faster than when built with AC6. ATfE makes heuristics-based optimization decisions using the cost and benefit of each potential optimization and provides useful options for reviewing which optimizations have and haven’t been used. Information from these options might be used to tweak code to enable the compiler to exploit additional optimizations.