Co-authored with Conrad Hillairet (Arm), Gabriel Staffelbach (Cerfacs) and Isabelle D'Ast (Cerfacs)
Reliable simulations of turbulent reacting flows remain a scientific challenge and leading research scientific institutions dedicate significant effort on this topic. Indeed, the complexity of the physical mechanisms, including the modeling of the combustion and the underlying three-dimensional geometry led to one of the hardest problems in computational science. Therefore, high-performance computing techniques are key to tackling associated large-scale industrial test-cases arising from several strategic areas such as aerospace, gas turbine engines, or oil and gas. The AVBP suite is at the forefront in terms of usage of supercomputers leading to several scientific breakthroughs. This includes the first 360 degrees high-fidelity simulation (two billion cells) of a part of the real engine DGEN380.
In this blog, we discuss performance results of AVBP for two realistic benchmarks. We focus on the Arm Neoverse N1 design and we detail the performance numbers from a price performance perspective, but we also report on the time-to-solution metric.
Figure 1. First 360-degrees Large-Eddy Simulation of a full engine – around two billions of cells, joint collaboration between CERFACS, SAFRAN and AKIRA technologies, Dr. C. Pérez Arroyo (post-doctoral fellow at CERFACS).
Cerfacs is a basic and applied research center, specialized in modeling and numerical simulation. Through its facilities and expertise in High Performance Computing, Cerfacs deals with major scientific and technical research problems of public and industrial interest. Cerfacs hosts interdisciplinary researchers such as physicians, applied mathematicians, numerical analysts, and software engineers. These groups design and develop innovative methods and software solutions to meet the needs of the aeronautics, space, climate, energy, and environmental fields. It is involved in major national and international projects and is strongly interacting with its seven shareholders: Airbus Group, Cnes, EDF, Météo France, Onera, Safran and Total. It is also associated with partners like CNRS (Associated Research Unit), Irit (common laboratory), CEA, and Inria (cooperation agreements).
AVBP (v7.6) is a state-of-the-art numerical tool to solve the Navier Stokes equations for compressible reactive flows used both in industry and academia. Originally built to study aerodynamics, simulations from AVBP are used for the design of combustion chambers, turbomachinery, piston engines. And even for rocket propulsion, pollutant production and dispersion prediction and safety applications.
Turbulence models are based on the Large Eddy Simulation approach. Coupled with third order in space and time Taylor-Galerkin numerical scheme it offers high fidelity results on unstructured multi-element grids.
We consider AWS EC2 compute-optimized instances for most of our experiments. In addition to the C5, C5a and C6g instances, we also include performance numbers from the newly introduced M6i Intel-based Ice-Lake instance.
More detailed information about the instance types can be found here: https://aws.amazon.com/ec2/instance-types/
All these instances are powered by the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor. Amazon Linux 2 is used as the Operating System.
For multi-instances runs, we exploit the c6gn and the c5n instances which support the Elastic Fabric Adapter (EFA) network interface (up to 100Gbps). In that case, the C5.18xlarge n instance is based on the Intel Skylake processor that features a sustained all core Turbo frequency of up to 3.4GHz.
In addition to these cloud-based resources, we also include results from the Ampere Altra processor (https://amperecomputing.com/altra/) for single-node experiment. This 80-cores chip is also based on the Neoverse N1 design with clock speed of 3.0GHz. It supports 128 lanes of high-speed PCIe Gen4 and eight channels of DDR4-3200. The on-premise configuration used in this blog-post is dual-socket with up to 160 Arm-based N1 cores.
For these performance tests, a state-of-the-art simulation of an explosion was used, using two grids, one with 20M tetrahedra (Explo20M) and one with 60M tetrahedra (60M).
Figure 1a. Mesh view (top) and rendering of the Iso-surface of temperature (Simulation snapshot on the bottom).
AVBP (v7.6) that provides frictionless support of Arm architecture is discussed in this blog-post. No source code modifications are required to obtain a working binary on Arm Neoverse N1 platforms. ABVP is compiled using GCC 10.2.0 and OpenMPI 4.1.1. Regarding dependencies, we rely on Metis v5.10, Parmetis 4.0.3 and HDF5 v1.10.5.
On C5, C5a and M6i platforms we compare performance results using the GNU compiler (v10.2.0) and the Intel OneAPI compiler (v2021.3.0) using standard tuning flags such as “-O3 -mcpu=native”. In our case, performances are in the same ballpark (<5%), so we also rely on GNU v10.2.0 for experiments on x86 platforms.
In this section, we study the Expl20M benchmark on single-node configurations. Figure 2 provides an overview of the overall efficiency of AVBP on various hardware configurations. We measure excellent performances of the code on all cloud-based platforms (+75%). On platforms with large core count such as the dual-socket Ampere Altra node, AVBP demonstrate very good efficiency results using the 160 cores (71%).
Figure 2. Parallel efficiency at the node-level.
Figure 3 compares the relative elapsed time for single-node executions. Results on the c6g AWS Graviton2 instances are used as a reference. EC2 instances based on single-socket AMD Rome, dual-socket Intel Cascade-Lake instances, and AWS Graviton2 show comparable performance results. Only the recently introduced Intel Ice Lake platform can deliver superior performances (speedup is 36%). In that case the ratio in terms of both peak FP64 (5x) and memory bandwidth (2x) can probably explain these results. Overall, the dual-socket Ampera Altra node is the most effective configuration for AVBP as the code can fully exploit the 160 computing cores. This configuration outperforms the Intel Ice Lake instance by about 35%.
Figure 3. Normalized timing (AWS Graviton2 results as a reference) at the node-level.
The last graph of this section recaps the price to performance ratios for cloud-based resources. The AWS Graviton2 instance is the best option with a significant margin in comparison to the other compute-optimized instances. The gain is beyond 50% on average with a maximum of 85% in comparison with the Intel Ice Lake M6I instance.
Figure 4. Normalized cost (AWS Graviton2 results as a reference) for a single instance running Expl20M AVBP benchmark. Prices are for the North-Virginia region. Lower means better.
In this section, we evaluate the behavior of AVBP on large-scale configurations with up to 2,048 cores and 32 compute-optimized instances using the Expl60M benchmark.
The first plot shows the parallel efficiency of the code. As expected, excellent results are observed up to the maximum number of instances used. ABVP is known for excellent parallel efficiency on Petascale systemsl. And these performance results confirm that cloud-based resources with high-performance interconnects such as the Elastic Fabric Adapter (EFA) can support complex workloads. Arm-based and the x86 instances exhibit similar results, this highlights the maturity of the software stacks for the exploitation of EFA on both platforms.
Figure 5. Parallel efficiency comparison up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores on c5n.18xlarge). Lower is better.
Figure 6 compares the relative elapsed time. In that case, the AWS Graviton2 instances outperform the C5n instance with an average margin of 10%. We benefit from the superior number of computing cores available at the node level.
Figure 6. Relative elapsed time (AWS Graviton 2 as a reference) up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores on c5n.18xlarge).
A similar trend is observed on Figure 7 that describes the performance to cost ratio. C6gn instances are the most cost-effective solution by a significant margin (+50%) whatever the number of instances involved.
Figure 7. Normalized cost (AWS Graviton2 results as a reference) for a multi-instances simulation of AVBP (Expl60M benchmark) up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores on c5n.18xlarge). Prices are for the Ireland region. Lower means better.
In this blog post, we discussed performance results of the ABVP CFD suite for two representative benchmarks. We analyzed both the time-to-solution and the cost metrics for single and multi-instances of Amazon EC2 instances. The Arm-based AWS Graviton2 instances provide a significant savings in cost while delivering better performances in most cases. The C6g/C6gn Graviton2 single and multi-instances provide a significant saving : : up to 56% against the most performing x86 compute-optimized instances (C5 and C5a). A maximum of 85% is achieved in comparison with the newly introduced M6i instance based on Intel Ice-Lake. The study also shows the readiness of the high-speed interconnect EFA both on Arm-based and x86 instances.
In addition to those conclusions for cloud-based resources, we also included results from the on-premises Ampere Altra dual-socket system. This configuration that features up to 160 Neoverse N1 cores demonstrates excellent results, both in terms of scalability and time-to-solution metrics.
Part of this work has been supported by the EXCELLERAT project (the European Centre of Excellence for Engineering Applications) which has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement no. 823691.
 Colin, O., Rudgyard, M. : Development of high-order Taylor-Galerkin schemes for LES. J. of Computational Physics, 162:338–371, (2000)
 Baum, M., Poinsot, T., and Thevenin, D. Accurate boundary conditions for multicomponent reacting ows. J. of Computational Physics Vol. 116 (1994).
 Pérez Arroyo, C., Dombard, J. , Duchaine, F. , Gicquel, L., Martin, B., Odier, N., Staffelbach, G. Towards the Large-Eddy Simulation of a full engine: Integration of a 360 azimuthal degrees fan, compressor, and combustion chamber. Part II: Comparison against stand-alone simulations. J. of the Global Power and Propulsion Society (2021)
Arm-based and the x86 instances exhibit similar results, this highlights the maturity of the software stacks for the exploitation of EFA on both platforms. tree cutting Austin
Arm unveiled the performance numbers for its Arm Neoverse V1 and N2 server chip platforms, with processing boosts ranging from 40% to 50% over the previous generation. mcdvoice