Update from Genci: A journey of optimizing applications on Arm platforms

June 24, 2020

10 minute read time.

Co-authored with Christelle Piechurski (Genci), Laurent Nguyen (CEA) and Cyril Mazauric (Atos).

About Genci

GENCI (Grand Equipement National du Calcul Intensif) is the French High-Performance Computing (HPC) National Agency in charge of promoting and acquiring HPC, Artificial Intelligence (AI) capabilities, and massive data storage resources. These resources are dedicated to open research by French and European academic & industrial organizations to foster collaboration and address large scale and complex scientific challenges in domains such as astrophysics, meteorology, molecular chemistry, energy, fusion, health, and so on. GENCI is funding three national supercomputers of which one, Joliot-curie (TGCC/CEA), is part of the Tier0 machines. Each supercomputer is hosted and operated by its partners: Joliot-Curie (TGCC/CEA), Jean Zay (IDRIS/CNRS) and Occigen (CINES/CPU).

GENCI has five partners: the Ministry of the High Education School, Research, and Innovation (MESRI- Ministèr de l'Enseignement Supérieur de la Recherche et de l’Innovation), the French Nuclear Agency (CEA - Commissariat à l'énergie Atomique), the French National Centre for Scientific Research (CNRS - Centre National de la Recherche Scientifique), the Universities (CPU - Conférence des Présidents d'Université) and finally INRIA. GENCI's role in Europe is two-fold. First, being one of the 28 PRACE members, GENCI supports the development of the European HPC ecosystem. Second, it supports the independence and sovereignty of the European Union, which is facilitated by implementing European technologies (for example, European processor Initiative) in European Exascale systems in 2023-2024.

GENCI's Technology Watch Unit

The Technological Watch Unit is a vector of sponsorship and support for the French scientific community in learning new technologies and their software ecosystem. This vehicle helps scientists conduct software performance engineering efforts on their applications so they can run efficiently at extreme scales and on new technologies when they are brought to production. This is done through various steps: learning about the technology itself, assessing hardware, middleware, and software ecosystem maturity through code's portability, and seeing how applications scale in comparison with existing technologies. This is also the right place for technology suppliers and chip makers to work cooperatively to improve a technology before that technology becomes widespread in the market. They help to determine the most suitable HW/SW implementations to adopt in a design, identify which scientific community benefits most from the micro-architecture, and support efforts to encourage new community adoption. In that respect, GENCI is acquiring prototype platforms and giving early access to scientific experts from CEA, CNRS, CPU, and INRIA, depending on the scientific domains the institutes want to focus on. When the technology and ecosystem are matured enough, the prototype platforms become open to every user through preparatory access mode on www.edari.fr (DARI means "Demande d'Attribution de Ressources Informatiques" - a request to get access to computing resources).

Arm-based Partition

2019 was a key year in the pursuit of the work started on Arm technology, over ten years after the initial involvement in the Mont Blanc project at GENCI. This year saw the deployment at TGCC/CEA of a new prototype called INTI, based on BullSequana X1000 with Marvell ThunderX2 technology, and its opening to the entire scientific community in Q4 2019 (www.edari.fr) with the ambition to support and accompany the French scientific community on the road to Exascale. The INTI platform has enabled three hackathons, allowing application owners to understand the efforts to port their codes to Arm technology and to get the most out of future generations of Arm processors, including the in-development European processor candidate for Exascale.

This is an image of a BullSequana X1000.

Figure 1: BullSequana X1000

INTI is equipped with 30 dual-socket Marvell ThunderX2 nodes, each with 64 custom ARMv8 cores running at 2.2GHz, 256GB of DDR4 memory supported by 8-channels, and interconnected through a 4X EDR Mellanox Infiniband network. The software ecosystem deployed on this platform is relying on the OCEAN (CEA) software stack based on CentOS 7.5. OpenMPI was chosen as the default MPI library and a large amount of compilation work and testing was done to improve OpenMPI performance for users. The graph on the right shows the improvement for collective broadcast MPI communication up to 1024 processes.

A graph showing improvement of MPI collective communications.

Figure 2: Improvement of MPI collective communications.

Leading Applications

To carry out this work of analysis and understanding Arm technology, we have chosen to work with production scientific applications used by the European community to advance scientific research. The developers of these applications know the behavior of their codes on different technologies, and are constantly trying to adapt them to take advantage of all the hardware resources and the associated software ecosystem at their disposal. The purpose of these workshops is to allow them to better understand the Arm architecture and how their codes can benefit of it.

These applications cover most scientific fields (CFD/Combustion, Oceanography, Earth simulation, chemistry, and so on) and use a whole range of different languages and paradigms (C, C ++, Fortran, MPI, OpenMP, I/O, and so on). Some applications are compute bound, while others have performance curves bounded by memory bandwidth, inter-process communication, system-level I/O.

AVBP represents one of the most advanced CFD tools worldwide for the numerical simulation of unsteady turbulence for reacting flows. AVBP is widely used both for basic research and applied research of industrial interest.
Yales2 aims at the solving of two-phase combustion from primary atomization to pollutant prediction on massive complex meshes. It can handle efficiently unstructured meshes with several billions of elements, thus enabling the Direct Numerical Simulation of laboratory and semi-industrial configurations.
SpecFEM3D simulates acoustic (fluid), elastic (solid), coupled acoustic/elastic, poroelastic, or seismic wave propagation in any type of conforming mesh of hexahedra (structured or not). It can, for instance, model seismic waves propagating in sedimentary basins or any other regional geological model following earthquakes.
NEMO, standing for Nucleus for European Modeling of the Ocean, is a state-of-the-art modeling framework for research activities and forecasting services in ocean and climate sciences, developed in a sustainable way by a European consortium (CNRS, Met Office, NERC, CMCC, Mercator Ocean).
HYDRO is a simplified version of RAMSES code and solves the compressible Euler equations of hydrodynamics. RAMSES was developed by CEA Saclay as a leading application to study large-scale structure and galaxy formation.
PATMOS is a mini-app developed by CEA for Monte Carlo neutron transport. The code is complex enough to be representative of a real simulation code and at the same time conceived to be easy to change and adapt.
BigDFT implements density functional theory (DFT) by solving the Kohn–Sham equations describing the electrons in a material. The code is developed by CEA and has been used in production for eight years, mainly in the domain of structure prediction calculations.
DYNAMICO is a newly dynamical core developed at IPSL (French Institute Pierre-Simon Laplace). It solves the Navier-Stokes equations on a rotating sphere (Coriolis) with shallow approximation (primitive equations). This is the compute intensive part (the Dynamical core) of a global atmospheric circulation model used to model climate for various scientific studies with strong societal implication (for example, climate global warming).

A Journey with Arm Technology for HPC

Application Porting - First Hackathon (September 2018)

Just a few weeks after the delivery by Atos of four dual-socket Marvell ThunderX2 compute nodes, a two-day hackathon took place at la Maison de la simulation, in Saclay (France). The main objective was to evaluate the efforts required to port a set of representative applications already running on x86_64 architecture. Most of these codes were successfully ported after one day, and jobs were executed on Arm nodes, though no optimization was done at that point. Comparison numbers with the leading HPC system from Genci (CEA-TGCC Joliot-Curie/Irene Multi-Petascale system) were generated. These sets of experiments were key to assess portability efforts on Arm and also prepare the next steps by targeting applications optimization and scalability. This was also an excellent opportunity for the Technology Watch Group to engage with the wider Arm community during the Arm HPC User Group at SC'18. Results from this hackathon are summarized in the following graph. It shows the application speedup between an execution run on the Intel x86 64-bit top bin SKU 8168 (24c/2,7GHz) and the Marvell ThunderX2 (32x, 2.2Ghz), considering Intel results as the baseline.

Figure 3: Sept 2018 - Comparative performances on a representative set of applications.

Application Scaling - Second Hackathon (June 2019)

To further improve the overall performance of these applications, a second event brought all together the applications owners at the Teratec Campus, Bruyères-Le-Châtel (France). Scaling study was the main topic of this second hackathon ran in June 2019. Experiments with OpenMPI v2.0.4 underlined that fine-tuning of the MPI software stack was of major importance to scale efficiently with a set of complex applications. These first results had been presented during an application-oriented workshop at the Arm Research Summit (Austin, Sept 2019). As of early 2020, these numbers have been significantly improved with the update of the MPI software stack.

Figure 4: June 2019 - First study of the scaling of the various applications up to 1,024 computing cores (OpenMPI 2.0.4).

Figure 6: Feb 2020 - Scaling Improvement after fine tuning of the MPI software stack (OpenMPI 4.0.2).

Looking Forward - Third Hackathon (February 2020)

The third Arm hackathon, conducted in late February 2020, focused on the SVE characterization of full-fledged representative applications. To capture low-level information such as memory traces or the breakdown of instructions, we relied on the Arm Instruction Emulator's (ArmIE) ability to target regions-of-interest (ROI) to focus on key loops and limit the amount of runtime data collected. To indicate a ROI, you simply add the ROI markers (start and stop macros) in the source code (more details can be found here). In our case, we recorded dynamic instruction execution traces at various SVE vector lengths. Figure 6 shows ratios of SVE instruction counts with respect to each vector length. This metric expresses the capacity to benefit from larger vector sizes at the application level. As expected, optimized codes from the climate community (for example, NEMO or DYNAMICO) are in best position to benefit from larger SIMD units. Applications such as AVBP requires more work to take advantage of large vector units since the current version of the AVBP code is more sensitive to core frequency or high core count than vector length. Additional metrics such as SIMD vector lanes utilization and memory operation breakdown are also available and will be helpful input to future co-design activities.

Figure 6: SVE analysis of the impact of the vector length on a set of representative applications.

Summary

We have presented a recap of the last 18 months efforts in porting and optimizing leading applications from various scientific communities such as Weather Forecast, Materials, Earth Sciences, and CFD, on an Arm-based HPC platform. Preliminary results are encouraging and demonstrate the viability of Arm architectures for key applications from the French HPC community. The performance results from SVE simulation are also promising and form an excellent starting point for future application performance studies and co-design activities. Next step will be to asses performances on SVE-enabled hardware, this will be a major step forward. Stay tuned.

Acknowledgment

We acknowledge all participants to the workshops : Emeric Brun (CEA); Davide Mancusi (CEA); Ghislain Lartigue (CORIA); Dimitri Lecas (IDRIS); Isabelle D'ast (Cerfacs); Gabriel Staffelbach (Cerfacs); Yann Meurdesoif (CEA); France Boillod-Cerneux (CEA); Marc Joos (CEA); Ansar Callo (CEA); Moureau Vincent (CORIA); Pierre-François Lavallee (IDRIS); Juan Escobar (CNRS); Matthieu Haefele (CNRS) ; Julien Derouillat (CEA); Pierre Kestener (CEA); Rémi Lacroix (Idris); Abel Marin-Lafleche (Maison de la Simulation); Vineet Soni (CEA); Mathieu Lobet (CEA); Nicolas Benoit (Atos); David Guibert (Atos); Bruno Froge (CEA) ; Gabriel Hautreux (CINES) ; Joel Wanza-Weloli (Atos); Olly Perks (Arm); Conrad Hillairet (Arm); Craig Prunty (Marvell); Laurent Nguyen (CEA); Cyril Mazauric (Atos); Fabrice Dupros (Arm); Christelle Piechurski (GENCI).

We are also pleased to acknowledge European Centres of Excellence for HPC applications, in particular Excellerat for the CFD, Cheese for Earth Sciences, Esiwace for Weather and Climate, and MaX for material Sciences.

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog