A common misconception about Arm's entry into the HPC and server worlds is that the software ecosystem is very immature. The truth is actually that most software out there 'just works' or can be made to do so with very little angst.
As with all things in the Arm ecosystem, the momentum for application porting isn't just the achievement of Arm, but a collaboration with all of our partners and users. This includes silicon partners, such as Cavium, Fujitsu and Qualcomm, systems providers such as Cray and HPE, end users such as national laboratories across the world, and software companies such as NAG and PGI. In addition, we all work with an independent organization called Linaro whose aim is to increase the adoption of Arm-based technology through open source software.
For HPC, the biggest area of interest for end users is the performance of their real scientific applications. This has two main components: does it run correctly; and does it fun fast? Achieving a positive answer for both of those questions will be the main focus of this blog.
First let’s talk about our strategy to prioritize enabling end-users.
With HPC machines having been dominated by a single architecture for a generation it is no surprise that many codes are no longer used to being run across multiple architectures. For the most part we have found that the big scientific codes that are deployed widely are well engineered to accept new architectures into their configuration files. Some did fall through the cracks to assuming that if it wasn't anything it knew about then it must be a Windows system on which it was running!
Selecting the specific applications where we have focused our porting efforts has not been easy. Each supercomputer center or community has its own set of 'must do' applications, and hence we have adopted a strategy of 'going wide' to get as many applications compiling, working and passing tests on Arm as possible. As part of this strategy we have adopted the latest list of applications published by Intersect360 as representative of the most important codes run on HPC systems today. Whilst that list has many codes which are not representative of real scientific HPC workloads (e.g. SAP), commercial software (e.g. StarCCM+ and ANSYS FLUENT) and even closed-source HPC codes, there are many that are fully open source. We have succeeded in getting all of these, alongside some of the closed-source codes, successfully working on Arm.
A selection of the popular open source codes we already have ported and have published validated recipes online are shown in the accompanying image. Other closed source applications known to work include VASP, AMBER, CASTEP, and ONETEP. We have also published recipes for many common benchmarks including HPL, HPCG, STREAM, Cloverleaf, Bookleaf, Tealeaf, HACC, Nalu, Nebone, Kripke, PENNANT, RAJA, LULESH, SNAP, MiniAMR, MiniFE, and MiniGhost.
For end users, there is little point in us having gone through the exercise of porting, just to say "it works". We have therefore adopted a two-pronged strategy of posting our build recipes publicly, and starting the process of upstreaming any changes needed to the communities that own the software. The upstreaming work takes time and effort, often involving getting the development communities access to Arm boxes in order to run their own testing, and preferably build Arm into their own development processes. Posting recipes in a public place is less onerous, and we have a shared 'Packages Wiki' which allows not only ourselves, but also our partners and early adopters, to contribute as well.
Second, let’s pause for a moment and examine how far we’ve got in this endeavour.
The porting work we have been doing has focused on both commonly available tool chains, namely GCC and the Arm Compiler. GCC has proved to be the easiest compiler to get supported in applications, mainly through its widespread use as a baseline compiler on other architectures. The Arm Compiler is LLVM-based, which has typically not had much general use in HPC before. For the most part these ports have been problem-free, although there has been more work needed in getting build configurations set up correctly.
One area that has seen significant improvement from porting HPC applications has been Fortran support. Unsurprisingly for Arm, when addressing the mobile and embedded markets Fortran support had never previously been needed. However, the move into HPC, with such a large number of Fortran-based codes, has necessitated developing a commercial solution to complement the open source gfortran. Flang is based on the NNSA-funded open sourcing of the PGI compiler's Fortran front end to be usable by LLVM. Our applications work has been using very large Fortran codes developed over many revisions of the Fortran standards, and it is unsurprising that this has explored many cases that have thrown up Flang issues. We continue to fix these in our commercial compiler, and simultaneously work with the upstream community to ensure that all Flang users, regardless of architecture, get these improvements.
In addition to enabling the Arm Compiler, the recipes we are posting on the Packages Wiki and on arm.com/hpc also include the necessary changes to support the use of Arm Performance Libraries. These libraries are the vendor math libraries providing full BLAS, LAPACK and FFT functionality. They are given for both Arm Compiler and GCC toolchain variants to allow users choice of compilers, and still enable them to get the best performing implementations. Recent performance work is covered in our Arm Allinea Studio 18.3 release blog.
Now let’s take a moment to share a few hints and tips that are useful for any porting effort.
Anyone who’s ported an application between operating systems has done far more work than will ever be required to port the same application to Arm. The basic system environment is the same on Arm as on other architectures, i.e. Linux and all its friends are already there. The typical HPC dependencies, such as HDF5, NetCDF, MPI and many math libraries, have also already been ported to Arm. Just follow the recipes on the Packages Wiki. You won’t have to track down equivalent libraries for Arm; the libraries available on Arm are exactly the same ones. Your typical HPC application probably also has the additional benefit of being monolithic with relatively few dependencies, so you likely won’t wind up in a maze of twisty dependencies.
Most of the porting work involves tweaking the application’s build system to use Arm compilers and/or system libraries optimized for Arm. If your application builds with GNU compilers then start with those before moving to the Arm compilers so you can easily find any missing dependencies or quirks in the build system. Once you’re building with gcc and/or gfortran, update the build system to incorporate Arm’s armclang and armflang compilers. The Arm Fortran compiler has a legacy in PGI, so if your application has been well tested with PGI compilers you’re already ahead of the game.
Watch out for non-standard compiler extensions and language features that are specific to vendor compilers. The Arm compiler team is actively implementing many of these non-standard extensions to ease porting efforts, but it’s best to adhere to the language standard whenever possible. Vendor-specific vector intrinsics will need to be updated to equivalent NEON intrinsics. The open source sse2neon tool is one starting point.
Compiling successfully on Arm is just the start. Now we need to verify that we are getting the right results. Regression tests, where available, are a porter’s best friend, but be aware that many automated test suites make fundamental assumptions about the answers that are not necessarily true across architectures. You may need an expert to check your results and verify that they’re correct.
Once you’re satisfied with your compilation process and test suite, it’s time to optimize for performance. Start by adding more compiler optimization flags. With GGC and Arm compilers, start with the -O3 or -Ofast compiler flags and verify the accuracy of the result. Additionally, if using the Arm Fortran compiler (armflang) add the -fstack-arrays flag for further improved performance. Note that this flag will become part of -Ofast in a future release of the Arm compilers. And remember to set your stack size to “unlimited” (e.g. `ulimit -s unlimited`). If your code crashes with -fstack-arrays then use `-Ofast -fno-stack-arrays` to force automatic arrays to be heap allocated.
One common performance area to be aware of is the standard math libraries which are bundled with the operating system. These are designed for stability and maximum compatibility on a range of systems, rather than tuned for performance on particular systems. Transcendental functions like exp() and log() can be orders of magnitude faster when linking against math libraries optimized for Arm. Arm are working with the distros to get our upstream enhancements to libc and libm available to all users as soon as possible. In the meantime we are providing an extra library inside Arm Performance Libraries, called libamath, to get more consistent, high performing implementations into users’ hands today.
So you want your application to run at full speed on Arm? Understanding how fast that could be takes only a little extra work.
At Arm we are not in the business of pitting one partner against another, or even comparing which applications work best on which architecture. Choice is an important part of our ecosystem, with different hardware implementations that will perform better on different applications. For example, applications that benefit from significant memory bandwidth will work well on some partners' implementations, whereas compute-bound highly vectorizable codes will look best on a future SVE system.
With that in mind we will just highlight results shown by others providing more unbiased comparisons.
For example, the GW4 consortium in the UK has purchased a significant Cavium ThunderX2 system from Cray that they are using to compare real HPC code application performance against other architectures. They have already been able to do many benchmark runs comparing the ThunderX2 core against other architectures. In summary they have shown that the ThunderX2 nodes are very competitive with the other systems they tested against, performing even better in many cases, especially where codes are memory bound.
Now we have reached the point that a wide range of HPC software packages have been shown to work on Arm systems, and the performance results on single nodes are good, the focus going forward will move towards parallel performance. For this, the optimization of many codes will be best handled by the individual development teams. As part of this, though, we will be helping evaluate the performance we are seeing ourselves, especially using the parallel profiling capabilities of Arm Forge.
So, what are you waiting for? Let’s get to it! Come and find out more at ISC 2018 so we can collaborate further on this crucial process of rolling high-performance applications onto Arm platforms.