Link Time Optimization in ARM Compiler 6

December 3, 2015

9 minute read time.

Introduction

Link Time Optimization (LTO) is a form of interprocedural optimization which, as the name suggests, is performed at the time of linking a program. This is particularly useful when building an image from multiple source files that are compiled separately. The compiler does not have complete visibility across all compilation units while compiling individual source file, therefore it misses out on many optimization opportunities which it would have had if the entire code have been a part of a single file.

Other interprocedural optimizations like multi-file compilation and Whole Program Optimization help address the lack of visibility across compilation units, it enables the compiler to perform cross-file inlining and removal of unused functions. However, LLVM is capable of performing idle and runtime optimizations along with whole program analysis and aggressive restructuring transformations. This infrastructure is leveraged by LTO to achieve higher levels of optimizations and we will see how this is done, in the implementation section of this article.

Before we can dive into the details of LTO in ARM Compiler 6, please keep in mind this features is still currently under development and is currently at [ALPHA] quality (Refer to the "Support level definitions" section in ARM Compiler 6 documentation). It will be a fully supported feature in a future release of ARM Compiler 6.

Design and Implementation

The key to implementing LTO is the generation of bitcode (also known as bytecode) files which are used to describe an intermediate representation of the source code. Bitcode contains more information about the source file(s) than an ELF object which enables the linker to generate a more optimized image.

When armclang is invoked with the -flto option, it generates bitcode files for each of the source files being compiled with this option and passes them to the linker. The linker then processes the bitcode files to emit an optimized ELF object which can be linked with the library objects.

When LTO is enabled, the compiler and linker perform the following steps:

The compiler translates source code into an intermediate representation called bitcode. This also contains module dependency information.
The linker processes these bitcode files along with other ELF object files and extracts the module dependency information from them before passing them to the link time optimizer (llvm-lto) utility.
The dependency information of the modules allows the link time optimizer to retain all the necessary modules and remove the rest, therefore creating a highly optimized ELF object file.
The link time optimized object file is linked with other ELF object files and pre-compiled libraries to generate the final executable image.

Figure 1: This block diagram is a visual representation of the steps involved in Link Time Optimization

Example

This example is derived from the example code available on the LLVM website. Other relevant information about the build process is given below:

The compilation tools used to build this example is the ARM Compiler 6.3.
It was built on a 64-bit Windows platform (the results are platform independent).
The examples are targeted at the ARMv8-M architecture.

Consider the following C source files :

/* ------------ lto.c ------------ */	/* ------------ foo.c ------------ */
int fn1(void); void fn2(void); int fn3(void); #define VINT volatile int VINT msg_buffer = (VINT)0x32000000; void fn4(void) { *(msg_buffer++) = 0x000C0DE4; } int lto() { return fn1(); }	void fn4(void); static signed int i = 0; void fn2(void) { i = -1; } static int fn3(void) { fn4(); return 10; } int fn1(void) { int ret_val = 0; if (i < 0) ret_val = fn3(); ret_val = ret_val + 64; return (ret_val); }

/* ------------ lto.c ------------ */

/* ------------ foo.c ------------ */

int fn1(void);
void fn2(void);
int fn3(void);
#define VINT volatile int

VINT *msg_buffer = (VINT*)0x32000000;

void fn4(void) {
  *(msg_buffer++) = 0x000C0DE4;
}

int lto() {
      return fn1();
}

void fn4(void);
static signed int i = 0;

void fn2(void) {
  i = -1;
}

static int fn3(void) {  
  fn4();
  return 10;
}

int fn1(void) {
  int ret_val = 0;

  if (i < 0)
    ret_val = fn3();

  ret_val = ret_val + 64;
  return (ret_val);
}

The source code above can be represented using the following diagram:

Figure 2: Expected Program Flow

By analysing the example code we can make the following observations:

Function fn2() is not referenced by any function in the source code.
Function fn3() calls fn4().
Function fn1() conditionally calls fn3().
Function lto() calls fn1().
fn3() is only called by fn1() if value of i<0.
Calling fn2() would be the only way to make the value of i<0.
Variables defined as a static so it can only be modified by functions within the same translation unit.
Because fn2() is never called, the condition under which fn3() is executed is never satisfied.
- This means fn3() will never be called in fn1().
- This implies fn4() will never be called as it is called by fn3().

Keeping this in mind we will use the example to compare code generated with and without LTO in the following ways:

Without Link Time Optimization (using ARM Compiler 6.3)
With selective Link Time Optimization (using ARM Compiler 6.3)
With full Link Time Optimization (using ARM Compiler 6.3)
With all available Inter-procedural optimizations in ARM Compiler 5

This will help us better understand the implementation and benefits of LTO in ARM Compiler 6.

Before we move ahead it would be beneficial for you to acquaint yourself to some commonly used optimization techniques and terminologies by reading the knowledge article.

Compiling without Link Time Optimization

In this example LTO is not enabled for any of the source files. This means that the bitcode files are not generated by the compiler and no link lime optimizations are performed. The compiler directly generates object files that are linked by armlink to generate an executable image. It’s important to note that both source files in this case have been compiled with –O2 to keep the comparison as close as possible to the compilation with LTO enabled. When LTO is enabled the default optimization level selected is –O2.

Build Commands:
armclang --target=arm-arm-none-eabi -c foo.c -o foo.o -O2 -march=armv8-m.main armclang --target=arm-arm-none-eabi -c lto.c -o lto.o -O2 -march=armv8-m.main armlink --lto foo.o lto.o -o lto.axf --entry=lto --cpu=8-M.main fromelf -cd lto.axf -o nolto_ac6.s

Build Commands:

armclang --target=arm-arm-none-eabi -c foo.c -o foo.o -O2 -march=armv8-m.main

armclang --target=arm-arm-none-eabi -c lto.c -o lto.o -O2 -march=armv8-m.main

armlink --lto foo.o lto.o -o lto.axf --entry=lto --cpu=8-M.main

fromelf -cd lto.axf -o nolto_ac6.s

Generated Assembly code

Optimizations

At –O2 the compiler performs the following optimizations:

Function foo3() has been inlined into its caller function foo1().
A Tail-call optimization applied to lto() for the call to foo1().

Compiling with selective Link Time Optimization

In this example one of the two files (foo.c) is compiled with LTO enabled. This means that the bitcode file is generated only for foo.c allowing the llvm-lto to apply the optimizations only on a part of the source code.

Build Commands:
armclang --target=arm-arm-none-eabi -flto -c foo.c -o foo.bc -march=armv8-m.main armclang --target=arm-arm-none-eabi -c lto.c -o lto.o -O2 -march=armv8-m.main armlink --lto foo.bc lto.o -o lto.axf --entry=lto --cpu=8-M.Main fromelf -cd lto.axf -o lto_sel_ac6.s

Build Commands:

armclang --target=arm-arm-none-eabi -flto -c foo.c -o foo.bc -march=armv8-m.main

armclang --target=arm-arm-none-eabi -c lto.c -o lto.o -O2 -march=armv8-m.main

armlink --lto foo.bc lto.o -o lto.axf --entry=lto --cpu=8-M.Main

fromelf -cd lto.axf -o lto_sel_ac6.s

Generated Assembly code

Optimizations

Besides the optimizations enabled by compiling at optimization level –O2, enabling LTO in only foo.c leads the following additional optimizations:

The compiler removes fn2() as it is not called by any of the other functions in the source files.
The llvm-lto can determine that value of i in fn1() will always be greater than 0 and removes the call to fn3().
This means that the value of ret_val1 is not modified by fn3() and the function fn1() can been reduced to just return the fixed value of 0x40 or 64.
The compiler removes fn3() but misses the optimization opportunity of removing fn4() as it is called by the removed function fn3(). This is because lto.c was not compiled with LTO enabled.

Compiling with full Link Time Optimization

In this example all the input source files are compiled with LTO enabled.

Build Commands:
armclang --target=arm-arm-none-eabi -flto -c foo.c -o foo.bc -march=armv8-m.main armclang --target=arm-arm-none-eabi -flto -c lto.c -o lto.bc -march=armv8-m.main armlink --lto foo.bc lto.bc -o lto.axf --entry=lto --cpu=8-M.Main fromelf -cd lto.axf -o lto_full_ac6.s

Build Commands:

armclang --target=arm-arm-none-eabi -flto -c foo.c -o foo.bc -march=armv8-m.main

armclang --target=arm-arm-none-eabi -flto -c lto.c -o lto.bc -march=armv8-m.main

armlink --lto foo.bc lto.bc -o lto.axf --entry=lto --cpu=8-M.Main

fromelf -cd lto.axf -o lto_full_ac6.s

Generated Assembly Code

Optimizations

Along with optimizations mentioned earlier (in the selective link time optimization section), ARM Compiler 6 is able to perform additional interprocedural optimizations when LTO is enabled for all source files:

The function fn1()is inlined into lto() even though fn1() is defined in a different compilation unit.
Similarly the compiler can determine that since fn3() will not be called by fn1() it can remove the definition of fn4() (this was not possible earlier as fn3() and fn4() are defined in different files).
This means the compiler can now reduce the entire source code into a single lto() function resulting in an extremely small and efficient code as shown above.

Interprocedural optimizations using ARM Compiler 5

At this point it’s worth comparing the improvement in interprocedural optimizations in ARM Compiler 6 as compared to ARM Compiler 5.

The example below shows the code generated by using all the available interprocedural optimizations available in ARM Compiler 5.

Build Commands:
armcc -c -O3 -OSpace --split_sections --multifile --whole_program --feedback fb.txt --cpu=Cortex-M7 foo.c lto.c -o lto_mf.o armlink lto_mf.o --list fbout.txt --feedback fb.txt -o lto_mf.axf --cpu=Cortex-M7 --entry=lto armcc -c -O3 -OSpace --split_sections --multifile --whole_program --feedback fb.txt --cpu=Cortex-M7 foo.c lto.c -o lto_mf.o armlink lto_mf.o --list fbout2.txt -o lto_mf.axf --cpu=Cortex-M7 --entry=lto fromelf -cdv lto_mf.axf -o lto_AC5.s

Build Commands:

armcc -c -O3 -OSpace --split_sections --multifile --whole_program --feedback fb.txt --cpu=Cortex-M7 foo.c lto.c -o lto_mf.o

armlink lto_mf.o --list fbout.txt --feedback fb.txt -o lto_mf.axf --cpu=Cortex-M7 --entry=lto

armcc -c -O3 -OSpace --split_sections --multifile --whole_program --feedback fb.txt --cpu=Cortex-M7 foo.c lto.c -o lto_mf.o

armlink lto_mf.o --list fbout2.txt -o lto_mf.axf --cpu=Cortex-M7 --entry=lto

fromelf -cdv lto_mf.axf -o lto_AC5.s

[Note: The reason for specifying ARMv7-M based Cortex-M7 as the cpu option is that ARM Compiler 5 does not support ARMv8-M targets]+

The commands listed above need to be run twice. Once to generate the feedback file that contains function usage information. The second time to make use of the generated feedback file to remove the unused functions/sections based on the first compile.

Generated Assembly Code

Optimizations:

In this compilation the compiler has been able to perform only the following two optimizations:

Removing the unused function fn2().
Inlining fn3() into fn1().
Tail call optimization of call to function fn1().

LTO Current Restrictions and Limitations in ARM Compiler 6

As mentioned earlier LTO is currently at [ALPHA] quality and there are some limitations and restrictions on its usage at the moment. The armclang compiler in ARM Compiler 6 uses the armlink for the linking process, because LLVM Clang doesn’t have its own integrated linker (LLVM clang has a different linker llvm-link for bitcode files and lld to link standard object file). Using armlink as the linker makes it easier to link objects built with ARM Compiler 5 and ARM Compiler 6 and also be able to leverage all the benefits that armclang brings. Currently there are a few limitations of how LTO can be used which will be overcome as the tool chain matures.

LTO cannot be performed on static libraries as armar or armclang cannot generate bitcode files for libraries.
Partial Linking is not supported with LTO as it only works with elf objects not bitcode files.
You might get linking errors if your library code calls a function that was defined in the source code but removed by the link time optimizer.
Scatter-loading of LTO objects is supported but it’s recommended for code and data that doesn’t have a strict placement requirement.
Bitcode objects are not guaranteed to be compatible across compiler versions. This means that you should ensure all your bitcode files are built using the same version of the compiler when linking with LTO.

Conclusion

Link Time Optimization is a very promising optimization technique that is achieved by having tighter integration between the ARM compiler and linker. It currently has a few limitations which will be overcome in the future, and even in its present state it is extremely powerful and can generate code that’s highly optimized for size, which can also improve performance. This example shows the maximum benefits in code size and performance that can be achieved with LTO. It is important to keep in mind the mileage you may get with LTO may vary based on the nature of the source code it is applied to.

I will try and publish a more comprehensive code size and performance comparison of using LTO with industry standard benchmarks in the future. In the meantime I strongly encourage you to experiment and use this at your end and if possible provide feedback based on your results.

0 comments
0 members are here

Tools, Software and IDEs blog

Arm Toolchain for Embedded: next-generation Arm C/C++ embedded compiler

Paul Black

Arm is launching Arm Toolchain for Embedded (ATfE), an embedded C/C++ cross-compiler. The toolchain is expected to be launched in April 2025, but a beta version is available now.
- January 9, 2025
Product update: Arm Development Studio 2024.1 now available

Ronan Synnott

Arm Development Studio 2024.1 is now available with support for Cortex-A725 and Cortex-X925.
- January 2, 2025
Part 3: Leveraging Rust with Rich Operating Systems on Arm

Jonathan Pallant

Understand how Rust can take full advantage of running on a full-blown operating system such as Linux.
- November 15, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Link Time Optimization in ARM Compiler 6

Introduction

Design and Implementation

Example

Compiling without Link Time Optimization

Build Commands:

Generated Assembly code

Compiling with selective Link Time Optimization

Build Commands:

Optimizations

Compiling with full Link Time Optimization

Build Commands:

Generated Assembly Code

Optimizations

Interprocedural optimizations using ARM Compiler 5

Build Commands:

Generated Assembly Code

Optimizations:

LTO Current Restrictions and Limitations in ARM Compiler 6

Conclusion

Arm Toolchain for Embedded: next-generation Arm C/C++ embedded compiler

Product update: Arm Development Studio 2024.1 now available

Part 3: Leveraging Rust with Rich Operating Systems on Arm