In this blog we will focus on two components of the GNU toolchain, the GNU Compiler Collection (GCC) and the GNU C library (glibc). A full toolchain contains several vital components like assemblers, linkers and debuggers, but in this blog we are focusing on the compiler and the C library.
glibc
Very! GCC is the platform compiler for major Linux distributions like Red Hat Enterprise Linux, SUSE Linux Enterprise Server, Ubuntu Linux and many more. That means it is used to compile the Linux kernel, all the supporting system components, and the software packages that constitute a modern Linux distribution. It is also the default compiler for the developers using these distributions for software engineering. Correspondingly, glibc is the default library in these systems, providing the backbone for the extraordinary diversity of functionality, performance and security required by modern software.
GCC
Given the above, we are hard at work making sure the GNU toolchain is the best it can be on Arm platforms. While some of the work presented here is by Arm engineers we must emphasize all of this is only possible because of our collaboration with the strong GNU toolchain community. Check out the various blogs throughout the community to get a feel for the breadth of work that is being done!
One of the areas we focus on is improving the performance of applications built with the GNU toolchain. There are many ways to do this and in this blog we present the highlights from our work in GCC and glibc as these are the two toolchain components that affect performance the most.
The GNU Tools team in Arm has been hard at work doing our share to make this release the best version of GCC for Arm platforms to date.The project follows an annual release cadence and the 2018 release of GCC 8 has too many improvements to list in this blog! I would, however, like to highlight some of the many optimisation improvements that GCC gained over the last development cycle:
503.bwaves
for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0; i < N; i++) c[i][j] = c[i][j] + a[i][k] * b[k][j];
for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) // i, j, k interchanged for (int k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j];
We can see the memory access pattern for c[i][j] changed to a more cache-friendly iteration. When each element in a row of the array c, accessed through i, lies in the same cache-line the interchanged access pattern makes much better use of the data locality.
c[i][j]
c
i
456.hmmer
#define M (256) #define N (512) struct st { int a[M][N]; int c[M]; int b[M][N]; }; void foo (struct st *p) { for (unsigned i = 0; i < M; ++i) { p->c[i] = 0; for (unsigned j = N; j > 0; --j) { p->a[i][j - 1] = 0; p->b[i][j - 1] = 0; } } }
memset
foo: mov x2, 1024 movk x2, 0x10, lsl 16 // size of memory to initialise is size of whole 'st' struct in bytes mov w1, 0 // initialise memory with zero b memset
We take our role in the GNU developer community very seriously and all such impactful improvements are presented to the community, co-designed when possible and iterated through cycles of feedback until we have a solution that works not only for our convenience but is maintainable, scalable and usable by as many consumers of the toolchain as possible. We encourage strong participation at developer conferences and present on all kinds of topics, from Bin Cheng presenting the above loop optimisation work to our performance tracking methodology by James Greenhalgh.
The glibc project has been pretty active as well. Many real world applications spend large portions of their execution time in the library. Arm collaborated with the excellent glibccommunity to deliver some truly exciting improvements for the 2.27 release on February 2017 and the preceding 2.26 release:
getchar
memcmp
malloc
glibc 2.26
523.xalancbmk
Users of Linux distributions that come out with these newer versions of GCC and glibc can get these and many more improvements as part of their out-of-the-box experience. Our performance tracking metrics show that using the 2018 state of the art components of the GNU toolchain against the equivalent early 2017 releases gives an uplift of at least 1.5% on the aggregate SPEC intrate score of the SPEC CPU 2017 suite and around 8% improvement on the SPEC fprate aggregate score. A Pretty good uplift from just upgrading the software stack. The SPEC CPU benchmarks are derived from real-world software packages that have been optimisation targets for decades in some cases. And remember, these are just the aggregate scores in one benchmark suite. Individual applications, depending on their execution profile may achieve much more.
This post focuses on performance improvements but the GNU toolchain is about so much more. Check out the long list of new features and improvements in GCC 8 on the main project page. Support for bleeding-edge language standards, novel architectures like the Arm Scalable Vector Extensions, the Armv8.4-A architecture, the latest processors spanning from the smallest embedded applications to the largest HPC behemoths and much more.
GCC 8
The wheels of progress never stop turning. The GNU toolchain community and our team here in Arm is already hard at work improving the toolchain for the 2019 releases. We've got some very exciting projects in flight that we hope to share with you throughout the year.
We will be providing more visibility into the work we do to improve the GNU software ecosystem as well as ways you can get involved and provide us with feedback and areas you'd like to see improved.
Thank you for reading and watch this space, this will be an exciting year for the GNU toolchain on Arm.