A question that we are often asked is, which compiler should I use for HPC on Arm?
Arm’s engineers contribute to the open-source GNU and LLVM compilers, alongside other engineers in the Arm ecosystem and beyond. Both compilers are important and highly valued.
Arm takes LLVM further by producing the Arm Compiler for Linux (ACfL). ACfL is packaged and product-grade tested by the team at Arm and is based on the latest LLVM release – but with some “extras” that I will talk about in this blog. ACfL is freely available from Arm for 64-bit Linux servers (https://developer.arm.com/downloads/-/arm-compiler-for-linux).
So, back to the original question – which compiler should I use?
In truth, there is no universal perfect compiler – and both compiler families are moving targets that are getting better performance with every release.
We show examples and demonstrate averages, but ultimately only benchmarking your workload will identify the best choice for you. Choice of optimization flags is usually a significant factor in performance too (-Ofast where permissible, or -O3 where not) along with targeting compilers at the platform you intend to use: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu
For HPC users, real workloads occupying full machines define system performance. Single-core performance does not expose system performance and the true impact of, for example, memory bandwidth. Furthermore, small test cases or benchmarks may not reflect the effects of cache size that would be seen on more typical larger cases.
To explore the performance, (mostly) real workloads have been run across a range of HPC segments:
For our benchmarking our configurations were:
Arm Performance Libraries (ArmPL) – our high-performance math library - is linked in in the codes identified by (*) in the list above as these have significant use of the BLAS calls or FFTs, in some cases this depends on the test cases selected. If you are not yet using ArmPL, you can read more about the latest release in our blog (https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compiler-for-linux-23-04).
60% of the measured cases were faster with ACfL than with GCC. And of these 35% were faster with ACfL by 5% over GCC.
Across the applications, there is a Geomean of 8.2% improvement from ACfL 23.04 over GCC 12.2.0, but this is skewed by one outlier from Computational Finance. If that outlier is excluded there is a 3.0% in favor of ACfL.
In domains such as Computational Fluid Dynamics (CFD), problems are usually memory-bandwidth dominated. Cores can only execute on data as quickly as they can be fed from memory. This reduces scope for massive compiler impact, but even memory-bound codes often have parts that are compute bound and worth a few percent from choice of compiler. We can still find performance variation in memory-bound codes such as OpenFOAM (+9% for ACfL) and HPCG (0% difference), WRF (+5% for ACfL) for example.
Even within some applications, two test cases may yield contradictions. In LAMMPS, for example, different “features” can be exercised by different test cases – with ACfL performance ranging from 3% worse to 8.5% better across the test cases. GROMACS BenchMEM (3% better with GCC) models a very small molecule (82k atoms), whereas BenchPEP and BenchRIB (> 5% better with ACfL) models very large molecules (12M and 2M atoms respectively).
There are several other examples where a solid gain can be had from ACfL. In particular, floating-point compute-dominated workloads such as molecular dynamics (GROMACS, LAMMPS) or materials (QMcPack), Geophysics (SW4Lite).
The Finance (Black-Scholes) and Genomics examples are worth a specific mention.
The Black-Scholes method in Computational Finance (see https://www.nobelprize.org/prizes/economic-sciences/1997/press-release/) is used to value European call/put options.
ACfL yields almost 3x faster runtime than GCC 12.2.0 – which can be attributed to the SIMD-math “extra” in ACfL, which is in progress but not yet in upstream open-source LLVM and not yet implemented as of GCC 13 on Arm.
The main loop of Black-Scholes is amenable to vectorization. There is, in essence, a stream of options that can be valued independently in parallel across vector units (such as NEON or SVE units on Arm architecture).
However – within the loop body, each iteration requires calculation of exp, exp2, sqrt and log – alongside easily vectorizable simple arithmetical operations.
SIMD math enables a vector of four doubles {d1,d2,d3,d4} (say) to use a vector-input version of exp() – thus being vectorized throughout and enabling more loop-vectorization as a whole. You can read more about this on our blog: https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/using-vector-math-functions-on-arm.
SIMD-math also provides some benefit to the WRF workload – which, whilst largely memory bound, is 5% faster with ACfL.
Intrinsics can be thought of as being “close to the metal” – only one step above assembler – but there is still scope for the compiler to optimize performance through the optimization of the placement of symbolic variables and the movement between registers in regions of intrinsics. As of the time of writing, we see better performance from ACfL in this area in the handful of codes we have benchmarked.
The GATK and BWA-MEM2 examples from Genomics involve significant SIMD intrinsics. Their x86_64 intrinsics have been ported to Arm NEON intrinsics automatically using SIMDe or sse2neon (https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/porting-sse-to-neon-are-libraries-the-way-forward).
Within the GATK, the PHMM routine in particular saw a 4x speed-up by using ACfL. Although its impact on entire runtime depended on the Java overheads and chosen thread count.
The data demonstrates that Arm Compiler for Linux is an important compiler in the developer’s toolbox for Arm – an easy route to performance improvements for many codes and complements the GCC compiler.
It is free to use, and can be installed by downloading from https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Linux
Alternatively, if you are using Spack as a package manager, ACfL and ArmPL are available via Spack. Our blog (https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compiler-for-linux-and-arm-pl-now-available-in-spack) demonstrates how to install and use the compiler.
More HPC blog posts
I have struggled with finding a good performing Fortran compiler for ARM. GFortran delivers OK-ish performance for my use case while any Flang based/related compiler lags behind significantly (I have tested Classic Flang, armflang, nvfortran, as well as AMD Flang on x64). Flang doesn't even work on Apple Silicon. The new LLVM Flang compiler isn't usable - it doesn't support all features in my code base. The lack of a fully functional Fortran compiler for Windows ARM is a real showstopper.
Please include the MUMPS solver in your test/benchmark suite for Fortran compilers. See https://en.wikipedia.org/wiki/MUMPS_(software) The source code is available from any Linux distribution.
A guide on building HPC applications for Apple silicon would be very welcome. What compilers to use, with what flags? Xcode doesn't support OpenMP, the Homebrew version of GCC has performance issues. Apparently, the HPC for Mac OS X website offers much better performing versions of GCC for M1.