Choosing Compilers for HPC on Arm

June 14, 2023

5 minute read time.

A question that we are often asked is, which compiler should I use for HPC on Arm?

Arm’s engineers contribute to the open-source GNU and LLVM compilers, alongside other engineers in the Arm ecosystem and beyond. Both compilers are important and highly valued.

Arm takes LLVM further by producing the Arm Compiler for Linux (ACfL). ACfL is packaged and product-grade tested by the team at Arm and is based on the latest LLVM release – but with some “extras” that I will talk about in this blog. ACfL is freely available from Arm for 64-bit Linux servers (https://developer.arm.com/downloads/-/arm-compiler-for-linux).

So, back to the original question – which compiler should I use?

In truth, there is no universal perfect compiler – and both compiler families are moving targets that are getting better performance with every release.

We show examples and demonstrate averages, but ultimately only benchmarking your workload will identify the best choice for you. Choice of optimization flags is usually a significant factor in performance too (-Ofast where permissible, or -O3 where not) along with targeting compilers at the platform you intend to use: https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

Observations from a handful of HPC benchmarks

For HPC users, real workloads occupying full machines define system performance. Single-core performance does not expose system performance and the true impact of, for example, memory bandwidth. Furthermore, small test cases or benchmarks may not reflect the effects of cache size that would be seen on more typical larger cases.

To explore the performance, (mostly) real workloads have been run across a range of HPC segments:

Industry Benchmark: the HPCG reference code
Genomics: BWA MEM2, GATK and Minimap2
Finance: Black-Scholes Option pricing, Binomial Options
Fluid Dynamics: OpenFOAM
Hydrodynamics: Lulesh
Molecular Dynamics: GROMACS and LAMMPS*
Materials: QMCPack*, QuantumEspresso*
Geosciences: SW4Lite
Weather: WRF

For our benchmarking our configurations were:

Compilers: Arm Compiler for Linux 23.04 and GNU 12.2.0 each with -mcpu=native flag to generate code tuned for the platform, and -O3, or -Ofast where permitted.
O/S and Platform: Amazon Linux 2023 on a c7gn.16large AWS Graviton3E

Arm Performance Libraries (ArmPL) – our high-performance math library - is linked in in the codes identified by (*) in the list above as these have significant use of the BLAS calls or FFTs, in some cases this depends on the test cases selected. If you are not yet using ArmPL, you can read more about the latest release in our blog (https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compiler-for-linux-23-04).

Measured Performance

ACfL vs. GCC 12.2 performance

60% of the measured cases were faster with ACfL than with GCC. And of these 35% were faster with ACfL by 5% over GCC.

Across the applications, there is a Geomean of 8.2% improvement from ACfL 23.04 over GCC 12.2.0, but this is skewed by one outlier from Computational Finance. If that outlier is excluded there is a 3.0% in favor of ACfL.

In domains such as Computational Fluid Dynamics (CFD), problems are usually memory-bandwidth dominated. Cores can only execute on data as quickly as they can be fed from memory. This reduces scope for massive compiler impact, but even memory-bound codes often have parts that are compute bound and worth a few percent from choice of compiler. We can still find performance variation in memory-bound codes such as OpenFOAM (+9% for ACfL) and HPCG (0% difference), WRF (+5% for ACfL) for example.

Codes with one binary but two winners

Even within some applications, two test cases may yield contradictions. In LAMMPS, for example, different “features” can be exercised by different test cases – with ACfL performance ranging from 3% worse to 8.5% better across the test cases. GROMACS BenchMEM (3% better with GCC) models a very small molecule (82k atoms), whereas BenchPEP and BenchRIB (> 5% better with ACfL) models very large molecules (12M and 2M atoms respectively).

Codes with larger differences

There are several other examples where a solid gain can be had from ACfL. In particular, floating-point compute-dominated workloads such as molecular dynamics (GROMACS, LAMMPS) or materials (QMcPack), Geophysics (SW4Lite).

The Finance (Black-Scholes) and Genomics examples are worth a specific mention.

Black-Scholes and SIMD-math

The Black-Scholes method in Computational Finance (see https://www.nobelprize.org/prizes/economic-sciences/1997/press-release/) is used to value European call/put options.

ACfL yields almost 3x faster runtime than GCC 12.2.0 – which can be attributed to the SIMD-math “extra” in ACfL, which is in progress but not yet in upstream open-source LLVM and not yet implemented as of GCC 13 on Arm.

The main loop of Black-Scholes is amenable to vectorization. There is, in essence, a stream of options that can be valued independently in parallel across vector units (such as NEON or SVE units on Arm architecture).

However – within the loop body, each iteration requires calculation of exp, exp2, sqrt and log – alongside easily vectorizable simple arithmetical operations.

SIMD math enables a vector of four doubles {d1,d2,d3,d4} (say) to use a vector-input version of exp() – thus being vectorized throughout and enabling more loop-vectorization as a whole. You can read more about this on our blog: https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/using-vector-math-functions-on-arm.

SIMD-math also provides some benefit to the WRF workload – which, whilst largely memory bound, is 5% faster with ACfL.

SIMD Intrinsics

Intrinsics can be thought of as being “close to the metal” – only one step above assembler – but there is still scope for the compiler to optimize performance through the optimization of the placement of symbolic variables and the movement between registers in regions of intrinsics. As of the time of writing, we see better performance from ACfL in this area in the handful of codes we have benchmarked.

The GATK and BWA-MEM2 examples from Genomics involve significant SIMD intrinsics. Their x86_64 intrinsics have been ported to Arm NEON intrinsics automatically using SIMDe or sse2neon (https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/porting-sse-to-neon-are-libraries-the-way-forward).

Within the GATK, the PHMM routine in particular saw a 4x speed-up by using ACfL. Although its impact on entire runtime depended on the Java overheads and chosen thread count.

Summary

The data demonstrates that Arm Compiler for Linux is an important compiler in the developer’s toolbox for Arm – an easy route to performance improvements for many codes and complements the GCC compiler.

It is free to use, and can be installed by downloading from https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Linux

Alternatively, if you are using Spack as a package manager, ACfL and ArmPL are available via Spack. Our blog (https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compiler-for-linux-and-arm-pl-now-available-in-spack) demonstrates how to install and use the compiler.

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog