Achieving the best performance and code size using Link Time Optimization in Arm Compiler 6

As Product Manager for the Arm Compilers, I’m often asked “Paul - this LTO stuff – what’s it all about?!” And I reply “Ah – I’m glad you asked me that! Please, grab yourself a cup of coffee and pull up a seat, let’s have a chat.”

Compiler optimization

All compilers optimize the images they create, and optimization is often an important consideration in compiler choice. If you’re working in the application space creating code targeted for multi-core Cortex-A implementations, then the chances are that you’re deeply interested in code performance. You’ll be interested in an image that runs as fast as possible but is still a faithful representation of your source code. Performance optimizations from the compiler will make your applications run faster and achieve more, and will improve responsiveness and the overall user experience. Or, you might be able to decrease the duty cycle of your cores, reducing power consumption and increasing battery life. You might even be able to transition to less powerful, more efficient cores, or to use fewer cores – decreasing both power consumption and hardware costs.

If you’re working on embedded code targeted at Cortex-M cores, then code density may be a higher priority for you than outright performance. You might be constrained by a small memory footprint, and therefore need high code density to fit as much functionality into that restricted footprint as possible. If you’re at the hardware design stage it might of course be possible to increase the memory size, but that will have an impact on hardware cost that could be significant. If you’re creating an update for devices that are already in the field, a hardware re-design is of course not an option – and having to drop features because of memory restrictions will be deeply distressing! You may also find a performance benefit from creating an image that’s small enough to fit inside your Tightly Coupled Memory, or critical loops tight enough to reduce cache reloads.

Whether you’re focussed on performance or code density, or you need the right blend of the two, using a compiler with the right optimization behaviour is important. At Arm we understand this, and we devote significant time and resources to developing compilers with superior optimization functionality for all types of user code.

Multi-file optimization

Traditional compiler optimization is carried out at the compilation unit level. Optimization is performed as each C/C++ source file is compiled:

 CL Compiler-Linker

This works reasonably well and gives good results, but it’s not so good at picking up optimizations that can be made between source files. Arm Compiler 5 went a step further and offered multi-file compilation. In essence, Compiler 5 would condense multiple source files together to produce somewhat optimized source code, and then compile the aggregated source code as a single unit:

 Compiler-Preprocessor Compiler Linker

Aggregating source files is a fairly simple strategy, and relatively easy to implement in a compiler. The key advantage here is the ability to inline functions between C/C++ source files, and depending on the code there can be significant performance benefits for doing this. The downside of this technique is that it’s quite resource-hungry. For a small number of source files it’s fine, but after that the time and memory requirements of the compiler tend to grow rapidly. By the time you get to a few tens of source files, and definitely by the time you reach hundreds or even thousands of source files, you need to find a better way.

Link Time Optimization

Arm Compiler 6 approaches this challenge via a slightly different route called Link Time Optimization (LTO). Individual C/C++ source files are compiled as usual, but not to object files. Instead the output of the compiler is an intermediate representation called bytecode, which contains additional information when compared to the more usual object file. Additional functionality in the linker then consumes the bytecode and produces the final image file:

 Enchanced Compiler EnhancedLinker

This strategy is more complex to implement, and involves enhancements to the linker as well as to the compiler. However it gives far greater scope to the linker for making optimizations across source files, without imposing excessive memory and time requirements at the link stage.

The benefit you’ll see from LTO depends very much on the source code being compiled and linked. As with multi-file compilation the key performance gain is from cross-module function inlining. Although the benefit can vary widely for different workloads and styles of code, our tests here across a wide range of benchmarks point to an average performance boost of around 10% - definitely worth having!

LTO has another trick up its sleeve too. The wide visibility the linker has across all the source files in the project means that is can be very effective at identifying code that’s never called or which doesn’t impact the program’s output. This code can then be optimized away, giving a benefit in smaller code size. Of course, inlining functions across source files will tend to introduce duplication and increase the image size. So LTO produces opposing pressures on code size and the end result will depend on your code: some projects will see a significant decrease in code size, some projects won’t benefit quite so much.

Link Time Optimization in Arm Compiler 6 is enabled using the –flto option with the compiler and –lto with the linker, and is automatically enabled when using –Omax.

This sounds too good to be true?!

So far this all sounds fantastic – a significant uplift in performance, plus a potential increase in code density, all without catastrophic increases in time and memory requirements at the link stage? Surely this is what we’ve all been waiting for in a compiler?!

Of course, when using LTO on a real project, there’s a few things we need to be aware of. The first thing we need to remember is that the LTO functionality results from a collaboration between the Arm Compiler and the Arm linker: the compiler creates additional information in the bytecode files which can then be consumed and acted upon by the linker. This means that when linking in a pre-built library, the linker doesn’t have access to the extra information for the library functions. LTO will still be effective on your source code of course: you just won’t see the benefit for functions imported from the library.

The second potential challenge is slightly more interesting. LTO is proper “gloves off” compiler functionality that unleashes truly aggressive optimizations to get you the best code performance possible. As such, LTO will subject your code to optimizations that it hasn’t faced before and sometimes, just once in a while, you might find that this shakes a latent bug out of your code. This is good news of course, you’ve faced up to a latent bug in your code and your code is now higher quality and that’s great. But at the point it happens, it can be troublesome and it can also be slightly unnerving.

We hit this very challenge last year, when we started migrating our MDK projects from Compiler 5 to Compiler 6. One particular project just stopped working when we turned LTO on: the code compiled, built, and ran, but it just sat spinning in a loop instead of doing anything useful. Since this project had been working fine for several years we concluded that the project itself must be blameless, and the fault must lie with the LTO implementation in the compiler.

In fact, we traced the fault to a bug in the project code. An interrupt routine was changing the value of a variable based on input from a keyboard, and the value of this variable was then being tested by a loop in the main program code. However we had omitted to mark the variable as volatile, and so following cross-module inlining the linker saw that the program loop could be optimized away: it was checking a variable that could not be changing. This resulted in large portions of the code being discarded. As soon as we corrected the variable definition, the code sprang back into life and worked perfectly.

One further challenge we encountered migrating the MDK projects to Compiler 6 centred on a routine that wrote data to a flash memory. The routine called functions from different source files and after LTO performed its cross-module inlining the routine ran significantly faster, as expected. However the routine ultimately ran just a little bit too fast….. it ran so fast that it stepped outside the timing characteristics of the flash memory. This left us with a little re-writing to do, and a sobering thought that the routine had worked well for years – but only by accident!

LTO has been available in Arm Compiler 6 for around a year now, and it’s been used extensively by customers and by our own internal development teams. None of that use has uncovered any defects in the functionality, and indeed LTO is inside the boundary of the Arm Compilers 6 qualification for Functional Safety. For Functional Safety use we can only qualify the most stable, high quality, and well tested compiler functionality, so qualification is a very visible mark of quality and confidence for LTO. If you do encounter any problems using LTO, the problem almost certainly is focused on a latent bug shaken out of the code: rather than a defect in the Arm Compiler.

Summary

So that’s it! Now you know everything there is to know about Link Time Optimization in Arm Compiler 6. We’ve talked about what it does and looked at the benefits it can bring, and we’ve also explored some of the things you might need to be aware of. So now it’s down to you to make the best use of LTO….. make your customers happy and your boss proud!

Learn more about Compiler 6

Anonymous