Optimizing the performance of HPC codes on Arm systems

November 26, 2018

6 minute read time.

In my previous blog on, Running HPC Applications on Arm, I talked about lots of the work that has been going on about getting a wide range of HPC applications running on today’s Arm systems. Today, I want to focus more on some of the key optimizations you can easily make when running your codes to get the best performance on existing Arm partner platforms.

1. Compiler choice

Your compiler choice does make a difference; regardless of your choice of system you will always be able to get both versions of GCC and Arm Compiler. Certain vendors will also have their own compiler suite, for instance the Cray Compilation Environment.

Since Arm Compiler is based on LLVM there are entirely different processes used to get optimized code to that in GCC. Arm actually has internal teams working on both toolchains, so you shouldn’t imagine one will always be better than the other. In fact, we have healthy competition internally between the two compiler teams in trying to increase the performance of different applications across Arm platforms.

The advantages of the commercial Arm Allinea Studio, of which Arm Compiler is a part, is that it is fully supported. We also ship a build of GCC through this product to ensure users are always able to get the best performing GCC version and give a ‘subscribers’ route for helping fix any issues in the GCC build, in addition to the usual support for the rest of the Arm licensed tools.

2. Compiler flags

No matter what systems you have run HPC codes upon, you’ll know that changing a compiler flag does make a difference to your application performance. The default sets of flags, for instance, that go into “-O3” have been picked to give high performance across a wide variety of cases. This may not, however, precisely match your application and we’d recommend investigating the man pages to see what options you can experiment with.

Common options for Arm Compiler that are worth exploring are:

-ffp-contract=fast	Controls when the compiler is permitted to form fused floating-point operations (such as FMAs). Note this is not always enabled by default
-Ofast	Similar to -O3, but this optimization level also performs additional aggressive optimizations that might violate strict compliance with language standards.
-ffastmath	Allows aggressive, lossy floating-point optimizations.
-mcpu={thunderx2t99, cortex-a72}	Setting your target microarchitecture can add extra performance to your software since an appropriate ‘cost model’ will be used by the instruction scheduler in the compiler to try and produce more optimal code on your system. Choosing ‘native’ selects the same microarchitecture on which you are compiling, whilst ‘generic’ would build for a reference Armv8.0 implementation.

You can find lots of documents on our Arm Compiler microsite, including common Fortran options and guides for developers familiar with other compiler toolchains (GCC, ICC, PGI).

3. Building your dependency stack

Most HPC software is not a monolithic git checkout, but instead requires linking against a host of dependencies. These libraries typically include:

MPI stacks
- We are working with all three major open source MPI variants, OpenMPI, MPICH and MVAPICH2, to increase the acceptance of our architecture and toolchains, and to help improve performance and scalability. Furthermore, commercially supported and tuned MPI stacks are often available from major OEM’s and Mellanox.
Base math libraries
- Arm Performance Libraries is part of Arm Allinea Studio, and provides high performing BLAS, LAPACK and FFT support. The FFTs use the standard FFTW3 interface, enabling most HPC codes to run out of the box.
- Note that open-source packages such as OpenBLAS, BLIS and FFTW are also available and perform well on Arm systems.
Higher level math libraries
- Typically, the maths libraries that HPC software calls are at a higher level than BLAS and LAPACK. For example, many codes rely on ScaLAPACK or PETSc. You should ensure that your builds of these libraries are linking appropriately against Arm Performance Libraries (or equivalent) rather than relying on a Netlib reference version.
- Pro-tip: Don’t build ScaLAPACK using -j as the final library may not contain all the necessary symbols.

Arm and its HPC community partners have been populating an HPC Wiki in order to capture the latest status and build information on many of the most prevalent HPC applications, libraries, and other common dependency packages. Check it out and feel free to contribute back anything that we’re missing.

4. Basic math.h functions

In Arm we have been looking at the performance of many key ‘math.h’ functions. Whilst we are primarily interested in getting the highest performing implementations for all users by default, there are significant delays between a patch being accepted by the glibc community and that version being made available by Linux distributions. To help get performance to our customers as soon as possible we have provided a separate library, called libamath, as part of the Arm Performance Libraries. This includes precompiled versions of our highest performing implementations of certain key functions. For the 19.0 release libamath includes versions of exp, pow and log in both single and double precision and sin, cos, sincos and tan in single precision.

In future releases we aim to increase the range of functions covered, and to keep enhancing the performance of those functions already provided. To aid us in this endeavour, we are inviting input from users about which functions they care most about, and where they see performance deficits. We are also happy to work with members of the community to ensure the availability of higher performing versions of these, or other common functions both in our products and in the wider market

WRF as a performance case study

There are lots of HPC software applications out there that users are interested in. Mini-apps are great for exercising particular aspects of compilers, for example, but they tend to be very self-contained in terms of optimization characteristics outside of the compiler. In order to demonstrate the effect of the aspects described above, let me focus on one example: WRF, the Weather Research and Forecasting Model. This is a very large and widely used meteorological prediction application, implementing massive weather simulations.

In order to get WRF up and running on Arm, the number of modifications to the vanilla code are small, and are all detailed in our build recipe. These instructions list all the package dependencies we’ve tested our builds against.

For WRF, there are small but significant advantages from using Arm Compiler, where it is almost 10% better in the CONUS 12km test case, described in our instructions. However, with the additional use of libamath, there is a further saving of 30% of the original GCC time.

Obviously not all codes will have significant usage of routines of which we have already provided more optimized versions. In the accompanying chart we highlight these differences for WRF, along with Cloverleaf and Branson. Cloverleaf, input test case of clover_bm128.in, clearly does not benefit noticeably from the optimized library, having only a relatively smaller number of calls to “pow()” functions. Branson, a transport mini-app, on the other hand, input test case of proxy_small.xml (with grey IMC), does however show almost 20% of the total execution time can be saved by using libamath.

Arm Compiler GCC performance comparison

The Arm HPC ecosystem keeps growing

Overall, these example applications are just a small snapshot of the performance optimization work that is being undertaken by Arm and our partners. The ‘hot off the press’ news from SC18 shows some interesting results from production scale Arm HPC deployments from both Astra at Sandia, and Isambard for the GW4 Alliance in the UK. Alongside this active community, Arm is committed to continually increasing application performance through our tools as we move forward in support of our partners’ HPC deployments.

Note: Arm has just launched Arm Allinea Studio 19.0 with major enhancements to the compiler and libraries.

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog