GNU toolchain performance in 2018

What is the GNU toolchain?

In this blog we will focus on two components of the GNU toolchain, the GNU Compiler Collection (GCC) and the GNU C library (glibc). A full toolchain contains several vital components like assemblers, linkers and debuggers, but in this blog we are focusing on the compiler and the C library. 

How important is it?

Very! GCC is the platform compiler for major Linux distributions like Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Ubuntu Linux and many more. That means it is used to compile the Linux kernel, all the supporting system components, and the software packages that constitute a modern Linux distribution. It is also the default compiler for the developers using these distributions for software engineering. Correspondingly, glibc is the default library in these systems, providing the backbone for the extraordinary diversity of functionality, performance and security required by modern software.

Given the above, we are hard at work making sure the GNU toolchain is the best it can be on Arm platforms. While some of the work presented here is by Arm engineers we must emphasize all of this is only possible because of our collaboration with the strong GNU toolchain community. Check out the various blogs throughout the community to get a feel for the breadth of work that is being done!

Toolchain performance

One of the areas we focus on is improving the performance of applications built with the GNU toolchain. There are many ways to do this and in this blog we present the highlights from our work in GCC and glibc as these are the two toolchain components that affect performance the most.

Improvements in GCC

The GNU Tools team in Arm has been hard at work doing our share to make this release the best version of GCC for Arm platforms to date.The project follows an annual release cadence and the 2018 release of GCC 8 has too many improvements to list in this blog! I would, however, like to highlight some of the many optimisation improvements that GCC gained over the last development cycle:

  • GCC gains a new loop interchange pass. This pass transforms loop nests to improve use of the data cache and makes memory access patterns more friendly for crucial subsequent optimisations like auto-vectorisation. It is a well-studied transform that has been missing a good implementation in GCC. Until now! It is now enabled by default at high optimisation levels and has already shown its utility by accelerating multiple benchmarks with a highlight in the 503.bwaves benchmark from the popular SPEC CPU 2017 benchmark suite of more than 10%. This is a phenomenal performance improvement, reproducible across all Arm processors and provided as part of the default toolchain for all users of GCC 8. Consider the loop:

for (int j = 0; j < N; j++)
  for (int k = 0; k < N; k++)
    for (int i = 0; i < N; i++)
      c[i][j] = c[i][j] + a[i][k] * b[k][j];
 The loop interchange pass can transform this into: 
for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)  // i, j, k interchanged
    for (int k = 0; k < N; k++)
      c[i][j] = c[i][j] + a[i][k] * b[k][j];
 

We can see the memory access pattern for c[i][j] changed to a more cache-friendly iteration. When each element in a row of the array c, accessed through i, lies in the same cache-line the interchanged access pattern makes much better use of the data locality.

  • The loop distribution pass in GCC is extended to handle more complex situations present in real code. Complex loops that contain vectorisable sequences mixed with non-vectorisable ones (for example due to loop-carried dependencies, complex data aliasing layouts) can be separated into their own loops. The parts that were vectorisable can then be vectorised independently of the rest of the code, giving the expected performance uplift. Again, this is not an academic, prototype implementation but production-ready functionality that is enabled by default in the compiler at high optimisation levels, giving an improvement of over 25% on the 456.hmmer benchmark from the SPEC CPU 2006 benchmark suite. This pass is a very powerful tool. The analysis it does can be used for many exciting optimisations in the compiler. For example, the code below:

#define M (256)
#define N (512)

struct st
{
  int a[M][N];
  int c[M];
  int b[M][N];
};

void
foo (struct st *p)
{
  for (unsigned i = 0; i < M; ++i)
    {
      p->c[i] = 0;
      for (unsigned j = N; j > 0; --j)
        {
          p->a[i][j - 1] = 0;
          p->b[i][j - 1] = 0;
        }
    }
}
 is now optimised into a single call to the standard memset function instead of initialising each field of the struct separately:
foo:
        mov     x2, 1024
        movk    x2, 0x10, lsl 16 // size of memory to initialise is size of whole 'st' struct in bytes
        mov     w1, 0 // initialise memory with zero
        b       memset

We take our role in the GNU developer community very seriously and all such impactful improvements are presented to the community, co-designed when possible and iterated through cycles of feedback until we have a solution that works not only for our convenience but is maintainable, scalable and usable by as many consumers of the toolchain as possible. We encourage strong participation at developer conferences and present on all kinds of topics, from Bin Cheng presenting the above loop optimisation work to our performance tracking methodology by James Greenhalgh.

Improvements in glibc

The glibc project has been pretty active as well. Many real world applications spend large portions of their execution time in the library. Arm collaborated with the excellent glibccommunity to deliver some truly exciting improvements for the 2.27 release on February 2017 and the preceding 2.26 release:

  • The most frequently used single-precision floating-point math routines expf, powf, logf and their derivatives are rewritten from the ground up. The new approach uses double-precision hardware to accelerate single-precision arithmetic operations and other improvements to the approximation algorithm to achieve massive increases in latency and throughput of the order of 200% and 300% over the previous implementations. On top of that, the new implementations achieve better precision and are written in completely portable standard C, replacing existing hard-to-maintain assembly implementations on some targets, improving the maintainability of the codebase as well. Szabolcs Nagy provided the new implementations and collaborated with the community to integrate this awesome work into the upstream glibc release. Thanks to these new routines using glibc 2.27 gives a whopping 60% improvement on the 521.wrf benchmark from the SPEC CPU 2017 suite! That by itself pushes the entire aggregate SPEC fprate 2017 score by 3%.

Math routine throughout against glibc 2.26 baseline

  • In response to a customer observation about inconsistent performance of the standard input/output function getchar we investigated and improved the locking sequence to give upwards of 400% improvement in single-threaded code that uses that common function heavily.

  • Wilco Dijkstra added an optimised implementation of the memcmp function improving its performance on aligned memory arguments by 25% and more than 500% on unaligned arguments.

  • Unnecessary synchronisation was removed when accessing Thread Local Storage (TLS) variables from a shared library. This roughly halves the access time  to these variables on AArch64 platforms.

  • Memory allocation and deallocation is one of the core functions of a C library and is tricky to get right because so many workloads need to do it. Finding the right balance between memory use, execution speed measured in single-threaded and multi-threaded environments across the whole gamut of supported architectures is not a task for the faint-hearted! The glibc community (and a call out here to our friends at Red Hat) put in a lot of effort in improving the algorithms used for memory allocation and everyone benefits. From the malloc improvements in glibc 2.26 we see gains of 3% and above in benchmarks like 523.xalancbmk from SPEC CPU 2017 and other malloc-heavy workloads.

Putting it all together

Users of Linux distributions that come out with these newer versions of GCC and glibc can get these and many more improvements as part of their out-of-the-box experience. Our performance tracking metrics show that using the 2018 state of the art components of the GNU toolchain against the equivalent early 2017 releases gives an uplift of at least 1.5% on the aggregate SPEC intrate score of the SPEC CPU 2017 suite and around 8% improvement on the SPEC fprate aggregate score. A Pretty good uplift from just upgrading the software stack. The SPEC CPU benchmarks are derived from real-world software packages that have been optimisation targets for decades in some cases. And remember, these are just the aggregate scores in one benchmark suite. Individual applications, depending on their execution profile may achieve much more.

This post focuses on performance improvements but the GNU toolchain is about so much more. Check out the long list of new features and improvements in GCC 8 on the main project page. Support for bleeding-edge language standards, novel architectures like the Arm Scalable Vector Extensions, the Armv8.4-A architecture, the latest processors spanning from the smallest embedded applications to the largest HPC behemoths and much more.

What's next?

The wheels of progress never stop turning. The GNU toolchain community and our team here in Arm is already hard at work improving the toolchain for the 2019 releases. We've got some very exciting projects in flight that we hope to share with you throughout the year.

We will be providing more visibility into the work we do to improve the GNU software ecosystem as well as ways you can get involved and provide us with feedback and areas you'd like to see improved.

Thank you for reading and watch this space.

This will be an exciting year for the GNU toolchain on Arm.

Anonymous