Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Choosing Compilers for HPC on Arm
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm Compiler for Linux
  • HPC Compiler
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Choosing Compilers for HPC on Arm

David Lecomber
David Lecomber
June 14, 2023
5 minute read time.

A question that we are often asked is, which compiler should I use for HPC on Arm? 

Arm’s engineers contribute to the open-source GNU and LLVM compilers, alongside other engineers in the Arm ecosystem and beyond. Both compilers are important and highly valued.

Arm takes LLVM further by producing the Arm Compiler for Linux (ACfL). ACfL is packaged and product-grade tested by the team at Arm and is based on the latest LLVM release – but with some “extras” that I will talk about in this blog. ACfL is freely available from Arm for 64-bit Linux servers (https://developer.arm.com/downloads/-/arm-compiler-for-linux). 

So, back to the original question – which compiler should I use?

In truth, there is no universal perfect compiler – and both compiler families are moving targets that are getting better performance with every release.

We show examples and demonstrate averages, but ultimately only benchmarking your workload will identify the best choice for you.  Choice of optimization flags is usually a significant factor in performance too (-Ofast where permissible, or -O3 where not) along with targeting compilers at the platform you intend to use:  https://community.arm.com/arm-community-blogs/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

Observations from a handful of HPC benchmarks

For HPC users, real workloads occupying full machines define system performance. Single-core performance does not expose system performance and the true impact of, for example, memory bandwidth. Furthermore, small test cases or benchmarks may not reflect the effects of cache size that would be seen on more typical larger cases.

To explore the performance, (mostly) real workloads have been run across a range of HPC segments:

  • Industry Benchmark: the HPCG reference code
  • Genomics: BWA MEM2, GATK and Minimap2
  • Finance: Black-Scholes Option pricing, Binomial Options
  • Fluid Dynamics: OpenFOAM
  • Hydrodynamics: Lulesh
  • Molecular Dynamics: GROMACS and LAMMPS*
  • Materials: QMCPack*, QuantumEspresso*
  • Geosciences: SW4Lite
  • Weather: WRF

For our benchmarking our configurations were:

  • Compilers: Arm Compiler for Linux 23.04 and GNU 12.2.0 each with -mcpu=native flag to generate code tuned for the platform, and -O3, or -Ofast where permitted.
  • O/S and Platform: Amazon Linux 2023 on a c7gn.16large AWS Graviton3E

Arm Performance Libraries (ArmPL) – our high-performance math library - is linked in in the codes identified by (*) in the list above as these have significant use of the BLAS calls or FFTs, in some cases this depends on the test cases selected. If you are not yet using ArmPL, you can read more about the latest release in our blog (https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compiler-for-linux-23-04).

Measured Performance

ACfL vs. GCC 12.2 performance

60% of the measured cases were faster with ACfL than with GCC. And of these 35% were faster with ACfL by 5% over GCC.

Across the applications, there is a Geomean of 8.2% improvement from ACfL 23.04 over GCC 12.2.0, but this is skewed by one outlier from Computational Finance. If that outlier is excluded there is a 3.0% in favor of ACfL. 

In domains such as Computational Fluid Dynamics (CFD), problems are usually memory-bandwidth dominated. Cores can only execute on data as quickly as they can be fed from memory. This reduces scope for massive compiler impact, but even memory-bound codes often have parts that are compute bound and worth a few percent from choice of compiler. We can still find performance variation in memory-bound codes such as OpenFOAM (+9% for ACfL) and HPCG (0% difference), WRF (+5% for ACfL) for example.

Codes with one binary but two winners

Even within some applications, two test cases may yield contradictions. In LAMMPS, for example, different “features” can be exercised by different test cases – with ACfL performance ranging from 3% worse to 8.5% better across the test cases. GROMACS BenchMEM (3% better with GCC) models a very small molecule (82k atoms), whereas BenchPEP and BenchRIB (> 5% better with ACfL) models very large molecules (12M and 2M atoms respectively).

Codes with larger differences

There are several other examples where a solid gain can be had from ACfL. In particular, floating-point compute-dominated workloads such as molecular dynamics (GROMACS, LAMMPS) or materials (QMcPack), Geophysics (SW4Lite).

The Finance (Black-Scholes) and Genomics examples are worth a specific mention.

Black-Scholes and SIMD-math

The Black-Scholes method in Computational Finance (see https://www.nobelprize.org/prizes/economic-sciences/1997/press-release/) is used to value European call/put options.

ACfL yields almost 3x faster runtime than GCC 12.2.0 – which can be attributed to the SIMD-math “extra” in ACfL, which is in progress but not yet in upstream open-source LLVM and not yet implemented as of GCC 13 on Arm.

The main loop of Black-Scholes is amenable to vectorization. There is, in essence, a stream of options that can be valued independently in parallel across vector units (such as NEON or SVE units on Arm architecture). 

However – within the loop body, each iteration requires calculation of exp, exp2, sqrt and log – alongside easily vectorizable simple arithmetical operations. 

SIMD math enables a vector of four doubles {d1,d2,d3,d4} (say) to use a vector-input version of exp() – thus being vectorized throughout and enabling more loop-vectorization as a whole. You can read more about this on our blog: https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/using-vector-math-functions-on-arm.

SIMD-math also provides some benefit to the WRF workload – which, whilst largely memory bound, is 5% faster with ACfL.

SIMD Intrinsics  

Intrinsics can be thought of as being “close to the metal” – only one step above assembler – but there is still scope for the compiler to optimize performance through the optimization of the placement of symbolic variables and the movement between registers in regions of intrinsics. As of the time of writing, we see better performance from ACfL in this area in the handful of codes we have benchmarked.

The GATK and BWA-MEM2 examples from Genomics involve significant SIMD intrinsics. Their x86_64 intrinsics have been ported to Arm NEON intrinsics automatically using SIMDe or sse2neon (https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/porting-sse-to-neon-are-libraries-the-way-forward).

Within the GATK, the PHMM routine in particular saw a 4x speed-up by using ACfL. Although its impact on entire runtime depended on the Java overheads and chosen thread count.

Summary

The data demonstrates that Arm Compiler for Linux is an important compiler in the developer’s toolbox for Arm – an easy route to performance improvements for many codes and complements the GCC compiler.

It is free to use, and can be installed by downloading from https://developer.arm.com/Tools%20and%20Software/Arm%20Compiler%20for%20Linux

Alternatively, if you are using Spack as a package manager, ACfL and ArmPL are available via Spack. Our blog (https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compiler-for-linux-and-arm-pl-now-available-in-spack) demonstrates how to install and use the compiler.

More HPC Blogs

Anonymous
High Performance Computing (HPC) blog
  • Arm Neoverse-powered servers demonstrate HPC leadership

    David Lecomber
    David Lecomber
    In this blog, we compare the new Arm-based Hpc7g instance to the AMD-based Hpc6a instance type, across popular HPC applications on performance and cost.
    • July 12, 2023
  • Choosing Compilers for HPC on Arm

    David Lecomber
    David Lecomber
    In this blog, we look at compiler options for optimizing real-world HPC application performance and discuss considerations for choosing the right compiler for a particular application.
    • June 14, 2023
  • Java Vector API on AArch64

    eric liu
    eric liu
    This post discusses insight into the Vector API. We go over some Vector API fundamentals, basic usages, and features, and then show how well AArch64 supports the Vector API.
    • June 7, 2023