Link Time Optimization (LTO) is a form of interprocedural optimization which, as the name suggests, is performed at the time of linking a program. This is particularly useful when building an image from multiple source files that are compiled separately. The compiler does not have complete visibility across all compilation units while compiling individual source file, therefore it misses out on many optimization opportunities which it would have had if the entire code have been a part of a single file.
Other interprocedural optimizations like multi-file compilation and Whole Program Optimization help address the lack of visibility across compilation units, it enables the compiler to perform cross-file inlining and removal of unused functions. However, LLVM is capable of performing idle and runtime optimizations along with whole program analysis and aggressive restructuring transformations. This infrastructure is leveraged by LTO to achieve higher levels of optimizations and we will see how this is done, in the implementation section of this article.
Before we can dive into the details of LTO in ARM Compiler 6, please keep in mind this features is still currently under development and is currently at [ALPHA] quality (Refer to the "Support level definitions" section in ARM Compiler 6 documentation). It will be a fully supported feature in a future release of ARM Compiler 6.
The key to implementing LTO is the generation of bitcode (also known as bytecode) files which are used to describe an intermediate representation of the source code. Bitcode contains more information about the source file(s) than an ELF object which enables the linker to generate a more optimized image.
When armclang is invoked with the -flto option, it generates bitcode files for each of the source files being compiled with this option and passes them to the linker. The linker then processes the bitcode files to emit an optimized ELF object which can be linked with the library objects.
When LTO is enabled, the compiler and linker perform the following steps:
Figure 1: This block diagram is a visual representation of the steps involved in Link Time Optimization
This example is derived from the example code available on the LLVM website. Other relevant information about the build process is given below:
Consider the following C source files :
int fn1(void); void fn2(void); int fn3(void); #define VINT volatile int VINT *msg_buffer = (VINT*)0x32000000; void fn4(void) { *(msg_buffer++) = 0x000C0DE4; } int lto() { return fn1(); }
void fn4(void); static signed int i = 0; void fn2(void) { i = -1; } static int fn3(void) { fn4(); return 10; } int fn1(void) { int ret_val = 0; if (i < 0) ret_val = fn3(); ret_val = ret_val + 64; return (ret_val); }
The source code above can be represented using the following diagram:
Figure 2: Expected Program Flow
By analysing the example code we can make the following observations:
Keeping this in mind we will use the example to compare code generated with and without LTO in the following ways:
This will help us better understand the implementation and benefits of LTO in ARM Compiler 6.
Before we move ahead it would be beneficial for you to acquaint yourself to some commonly used optimization techniques and terminologies by reading the knowledge article.
In this example LTO is not enabled for any of the source files. This means that the bitcode files are not generated by the compiler and no link lime optimizations are performed. The compiler directly generates object files that are linked by armlink to generate an executable image. It’s important to note that both source files in this case have been compiled with –O2 to keep the comparison as close as possible to the compilation with LTO enabled. When LTO is enabled the default optimization level selected is –O2.
armclang --target=arm-arm-none-eabi -c foo.c -o foo.o -O2 -march=armv8-m.main
armclang --target=arm-arm-none-eabi -c lto.c -o lto.o -O2 -march=armv8-m.main
armlink --lto foo.o lto.o -o lto.axf --entry=lto --cpu=8-M.main
fromelf -cd lto.axf -o nolto_ac6.s
At –O2 the compiler performs the following optimizations:
In this example one of the two files (foo.c) is compiled with LTO enabled. This means that the bitcode file is generated only for foo.c allowing the llvm-lto to apply the optimizations only on a part of the source code.
armclang --target=arm-arm-none-eabi -flto -c foo.c -o foo.bc -march=armv8-m.main
armlink --lto foo.bc lto.o -o lto.axf --entry=lto --cpu=8-M.Main
fromelf -cd lto.axf -o lto_sel_ac6.s
Generated Assembly code
Besides the optimizations enabled by compiling at optimization level –O2, enabling LTO in only foo.c leads the following additional optimizations:
In this example all the input source files are compiled with LTO enabled.
armclang --target=arm-arm-none-eabi -flto -c lto.c -o lto.bc -march=armv8-m.main
armlink --lto foo.bc lto.bc -o lto.axf --entry=lto --cpu=8-M.Main
fromelf -cd lto.axf -o lto_full_ac6.s
Along with optimizations mentioned earlier (in the selective link time optimization section), ARM Compiler 6 is able to perform additional interprocedural optimizations when LTO is enabled for all source files:
At this point it’s worth comparing the improvement in interprocedural optimizations in ARM Compiler 6 as compared to ARM Compiler 5.
The example below shows the code generated by using all the available interprocedural optimizations available in ARM Compiler 5.
armcc -c -O3 -OSpace --split_sections --multifile --whole_program --feedback fb.txt --cpu=Cortex-M7 foo.c lto.c -o lto_mf.o
armlink lto_mf.o --list fbout.txt --feedback fb.txt -o lto_mf.axf --cpu=Cortex-M7 --entry=lto
armlink lto_mf.o --list fbout2.txt -o lto_mf.axf --cpu=Cortex-M7 --entry=lto
fromelf -cdv lto_mf.axf -o lto_AC5.s
[Note: The reason for specifying ARMv7-M based Cortex-M7 as the cpu option is that ARM Compiler 5 does not support ARMv8-M targets]+
The commands listed above need to be run twice. Once to generate the feedback file that contains function usage information. The second time to make use of the generated feedback file to remove the unused functions/sections based on the first compile.
In this compilation the compiler has been able to perform only the following two optimizations:
As mentioned earlier LTO is currently at [ALPHA] quality and there are some limitations and restrictions on its usage at the moment. The armclang compiler in ARM Compiler 6 uses the armlink for the linking process, because LLVM Clang doesn’t have its own integrated linker (LLVM clang has a different linker llvm-link for bitcode files and lld to link standard object file). Using armlink as the linker makes it easier to link objects built with ARM Compiler 5 and ARM Compiler 6 and also be able to leverage all the benefits that armclang brings. Currently there are a few limitations of how LTO can be used which will be overcome as the tool chain matures.
Link Time Optimization is a very promising optimization technique that is achieved by having tighter integration between the ARM compiler and linker. It currently has a few limitations which will be overcome in the future, and even in its present state it is extremely powerful and can generate code that’s highly optimized for size, which can also improve performance. This example shows the maximum benefits in code size and performance that can be achieved with LTO. It is important to keep in mind the mileage you may get with LTO may vary based on the nature of the source code it is applied to.
I will try and publish a more comprehensive code size and performance comparison of using LTO with industry standard benchmarks in the future. In the meantime I strongly encourage you to experiment and use this at your end and if possible provide feedback based on your results.