Co-authored with Christelle Piechurski (Genci), Laurent Nguyen (CEA) and Cyril Mazauric (Atos).
GENCI (Grand Equipement National du Calcul Intensif) is the French High-Performance Computing (HPC) National Agency in charge of promoting and acquiring HPC, Artificial Intelligence (AI) capabilities, and massive data storage resources. These resources are dedicated to open research by French and European academic & industrial organizations to foster collaboration and address large scale and complex scientific challenges in domains such as astrophysics, meteorology, molecular chemistry, energy, fusion, health, and so on. GENCI is funding three national supercomputers of which one, Joliot-curie (TGCC/CEA), is part of the Tier0 machines. Each supercomputer is hosted and operated by its partners: Joliot-Curie (TGCC/CEA), Jean Zay (IDRIS/CNRS) and Occigen (CINES/CPU).
GENCI has five partners: the Ministry of the High Education School, Research, and Innovation (MESRI- Ministèr de l'Enseignement Supérieur de la Recherche et de l’Innovation), the French Nuclear Agency (CEA - Commissariat à l'énergie Atomique), the French National Centre for Scientific Research (CNRS - Centre National de la Recherche Scientifique), the Universities (CPU - Conférence des Présidents d'Université) and finally INRIA. GENCI's role in Europe is two-fold. First, being one of the 28 PRACE members, GENCI supports the development of the European HPC ecosystem. Second, it supports the independence and sovereignty of the European Union, which is facilitated by implementing European technologies (for example, European processor Initiative) in European Exascale systems in 2023-2024.
The Technological Watch Unit is a vector of sponsorship and support for the French scientific community in learning new technologies and their software ecosystem. This vehicle helps scientists conduct software performance engineering efforts on their applications so they can run efficiently at extreme scales and on new technologies when they are brought to production. This is done through various steps: learning about the technology itself, assessing hardware, middleware, and software ecosystem maturity through code's portability, and seeing how applications scale in comparison with existing technologies. This is also the right place for technology suppliers and chip makers to work cooperatively to improve a technology before that technology becomes widespread in the market. They help to determine the most suitable HW/SW implementations to adopt in a design, identify which scientific community benefits most from the micro-architecture, and support efforts to encourage new community adoption. In that respect, GENCI is acquiring prototype platforms and giving early access to scientific experts from CEA, CNRS, CPU, and INRIA, depending on the scientific domains the institutes want to focus on. When the technology and ecosystem are matured enough, the prototype platforms become open to every user through preparatory access mode on www.edari.fr (DARI means "Demande d'Attribution de Ressources Informatiques" - a request to get access to computing resources).
2019 was a key year in the pursuit of the work started on Arm technology, over ten years after the initial involvement in the Mont Blanc project at GENCI. This year saw the deployment at TGCC/CEA of a new prototype called INTI, based on BullSequana X1000 with Marvell ThunderX2 technology, and its opening to the entire scientific community in Q4 2019 (www.edari.fr) with the ambition to support and accompany the French scientific community on the road to Exascale. The INTI platform has enabled three hackathons, allowing application owners to understand the efforts to port their codes to Arm technology and to get the most out of future generations of Arm processors, including the in-development European processor candidate for Exascale.
Figure 1: BullSequana X1000
INTI is equipped with 30 dual-socket Marvell ThunderX2 nodes, each with 64 custom ARMv8 cores running at 2.2GHz, 256GB of DDR4 memory supported by 8-channels, and interconnected through a 4X EDR Mellanox Infiniband network. The software ecosystem deployed on this platform is relying on the OCEAN (CEA) software stack based on CentOS 7.5. OpenMPI was chosen as the default MPI library and a large amount of compilation work and testing was done to improve OpenMPI performance for users. The graph on the right shows the improvement for collective broadcast MPI communication up to 1024 processes.
Figure 2: Improvement of MPI collective communications.
To carry out this work of analysis and understanding Arm technology, we have chosen to work with production scientific applications used by the European community to advance scientific research. The developers of these applications know the behavior of their codes on different technologies, and are constantly trying to adapt them to take advantage of all the hardware resources and the associated software ecosystem at their disposal. The purpose of these workshops is to allow them to better understand the Arm architecture and how their codes can benefit of it.
These applications cover most scientific fields (CFD/Combustion, Oceanography, Earth simulation, chemistry, and so on) and use a whole range of different languages and paradigms (C, C ++, Fortran, MPI, OpenMP, I/O, and so on). Some applications are compute bound, while others have performance curves bounded by memory bandwidth, inter-process communication, system-level I/O.
Just a few weeks after the delivery by Atos of four dual-socket Marvell ThunderX2 compute nodes, a two-day hackathon took place at la Maison de la simulation, in Saclay (France). The main objective was to evaluate the efforts required to port a set of representative applications already running on x86_64 architecture. Most of these codes were successfully ported after one day, and jobs were executed on Arm nodes, though no optimization was done at that point. Comparison numbers with the leading HPC system from Genci (CEA-TGCC Joliot-Curie/Irene Multi-Petascale system) were generated. These sets of experiments were key to assess portability efforts on Arm and also prepare the next steps by targeting applications optimization and scalability. This was also an excellent opportunity for the Technology Watch Group to engage with the wider Arm community during the Arm HPC User Group at SC'18. Results from this hackathon are summarized in the following graph. It shows the application speedup between an execution run on the Intel x86 64-bit top bin SKU 8168 (24c/2,7GHz) and the Marvell ThunderX2 (32x, 2.2Ghz), considering Intel results as the baseline.
Figure 3: Sept 2018 - Comparative performances on a representative set of applications.
To further improve the overall performance of these applications, a second event brought all together the applications owners at the Teratec Campus, Bruyères-Le-Châtel (France). Scaling study was the main topic of this second hackathon ran in June 2019. Experiments with OpenMPI v2.0.4 underlined that fine-tuning of the MPI software stack was of major importance to scale efficiently with a set of complex applications. These first results had been presented during an application-oriented workshop at the Arm Research Summit (Austin, Sept 2019). As of early 2020, these numbers have been significantly improved with the update of the MPI software stack.
Figure 4: June 2019 - First study of the scaling of the various applications up to 1,024 computing cores (OpenMPI 2.0.4).
Figure 6: Feb 2020 - Scaling Improvement after fine tuning of the MPI software stack (OpenMPI 4.0.2).
The third Arm hackathon, conducted in late February 2020, focused on the SVE characterization of full-fledged representative applications. To capture low-level information such as memory traces or the breakdown of instructions, we relied on the Arm Instruction Emulator's (ArmIE) ability to target regions-of-interest (ROI) to focus on key loops and limit the amount of runtime data collected. To indicate a ROI, you simply add the ROI markers (start and stop macros) in the source code (more details can be found here). In our case, we recorded dynamic instruction execution traces at various SVE vector lengths. Figure 6 shows ratios of SVE instruction counts with respect to each vector length. This metric expresses the capacity to benefit from larger vector sizes at the application level. As expected, optimized codes from the climate community (for example, NEMO or DYNAMICO) are in best position to benefit from larger SIMD units. Applications such as AVBP requires more work to take advantage of large vector units since the current version of the AVBP code is more sensitive to core frequency or high core count than vector length. Additional metrics such as SIMD vector lanes utilization and memory operation breakdown are also available and will be helpful input to future co-design activities.
Figure 6: SVE analysis of the impact of the vector length on a set of representative applications.
We have presented a recap of the last 18 months efforts in porting and optimizing leading applications from various scientific communities such as Weather Forecast, Materials, Earth Sciences, and CFD, on an Arm-based HPC platform. Preliminary results are encouraging and demonstrate the viability of Arm architectures for key applications from the French HPC community. The performance results from SVE simulation are also promising and form an excellent starting point for future application performance studies and co-design activities. Next step will be to asses performances on SVE-enabled hardware, this will be a major step forward. Stay tuned.
We acknowledge all participants to the workshops : Emeric Brun (CEA); Davide Mancusi (CEA); Ghislain Lartigue (CORIA); Dimitri Lecas (IDRIS); Isabelle D'ast (Cerfacs); Gabriel Staffelbach (Cerfacs); Yann Meurdesoif (CEA); France Boillod-Cerneux (CEA); Marc Joos (CEA); Ansar Callo (CEA); Moureau Vincent (CORIA); Pierre-François Lavallee (IDRIS); Juan Escobar (CNRS); Matthieu Haefele (CNRS) ; Julien Derouillat (CEA); Pierre Kestener (CEA); Rémi Lacroix (Idris); Abel Marin-Lafleche (Maison de la Simulation); Vineet Soni (CEA); Mathieu Lobet (CEA); Nicolas Benoit (Atos); David Guibert (Atos); Bruno Froge (CEA) ; Gabriel Hautreux (CINES) ; Joel Wanza-Weloli (Atos); Olly Perks (Arm); Conrad Hillairet (Arm); Craig Prunty (Marvell); Laurent Nguyen (CEA); Cyril Mazauric (Atos); Fabrice Dupros (Arm); Christelle Piechurski (GENCI).
We are also pleased to acknowledge European Centres of Excellence for HPC applications, in particular Excellerat for the CFD, Cheese for Earth Sciences, Esiwace for Weather and Climate, and MaX for material Sciences.