Running AVBP Industrial code on Arm Neoverse N1

October 14, 2021

9 minute read time.

Co-authored with Conrad Hillairet (Arm), Gabriel Staffelbach (Cerfacs) and Isabelle D'Ast (Cerfacs)

Reliable simulations of turbulent reacting flows remain a scientific challenge and leading research scientific institutions dedicate significant effort on this topic. Indeed, the complexity of the physical mechanisms, including the modeling of the combustion and the underlying three-dimensional geometry led to one of the hardest problems in computational science. Therefore, high-performance computing techniques are key to tackling associated large-scale industrial test-cases arising from several strategic areas such as aerospace, gas turbine engines, or oil and gas. The AVBP suite is at the forefront in terms of usage of supercomputers leading to several scientific breakthroughs. This includes the first 360 degrees high-fidelity simulation (two billion cells) of a part of the real engine DGEN380.

In this blog, we discuss performance results of AVBP for two realistic benchmarks. We focus on the Arm Neoverse N1 design and we detail the performance numbers from a price performance perspective, but we also report on the time-to-solution metric.

Large-Eddy Simulation of a full engine

Figure 1. First 360-degrees Large-Eddy Simulation of a full engine – around two billions of cells, joint collaboration between CERFACS, SAFRAN and AKIRA technologies, Dr. C. Pérez Arroyo (post-doctoral fellow at CERFACS).

About CERFACS

Cerfacs is a basic and applied research center, specialized in modeling and numerical simulation. Through its facilities and expertise in High Performance Computing, Cerfacs deals with major scientific and technical research problems of public and industrial interest. Cerfacs hosts interdisciplinary researchers such as physicians, applied mathematicians, numerical analysts, and software engineers. These groups design and develop innovative methods and software solutions to meet the needs of the aeronautics, space, climate, energy, and environmental fields. It is involved in major national and international projects and is strongly interacting with its seven shareholders: Airbus Group, Cnes, EDF, Météo France, Onera, Safran and Total. It is also associated with partners like CNRS (Associated Research Unit), Irit (common laboratory), CEA, and Inria (cooperation agreements).

About AVBP

AVBP (v7.6) is a state-of-the-art numerical tool to solve the Navier Stokes equations for compressible reactive flows used both in industry and academia. Originally built to study aerodynamics, simulations from AVBP are used for the design of combustion chambers, turbomachinery, piston engines. And even for rocket propulsion, pollutant production and dispersion prediction and safety applications.

Turbulence models are based on the Large Eddy Simulation approach. Coupled with third order in space and time Taylor-Galerkin numerical scheme it offers high fidelity results on unstructured multi-element grids.

Setup

We consider AWS EC2 compute-optimized instances for most of our experiments. In addition to the C5, C5a and C6g instances, we also include performance numbers from the newly introduced M6i Intel-based Ice-Lake instance.

The C6g.16xlarge instances rely on AWS Graviton2 processors, which is a custom-built 64-bit Arm chip based on Neoverse cores. Every vCPU is a physical core (that is, no simultaneous multithreading) and the instances are single socket with 64 cores for the c6g.16xlarge as an example.
The C5.24xlarge instances feature second-generation Intel Xeon Scalable Processors (Cascade Lake) with a sustained all core turbo frequency of 3.6GHz in a dual-socket configuration (48 cores).
The C5a.24xlarge instances feature the second generation of AMD EPYC 7002 (Rome) series processors clocked at 3.3GHz in a single-socket configuration (48 cores).
The M6i.32xlarge instance exploits the Intel Ice Lake processor in a dual-socket configuration (64 physical cores) with an all-core turbo clock speed of 3.5GHz and eight memory channels.

More detailed information about the instance types can be found here: https://aws.amazon.com/ec2/instance-types/

All these instances are powered by the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor. Amazon Linux 2 is used as the Operating System.

For multi-instances runs, we exploit the c6gn and the c5n instances which support the Elastic Fabric Adapter (EFA) network interface (up to 100Gbps). In that case, the C5.18xlarge n instance is based on the Intel Skylake processor that features a sustained all core Turbo frequency of up to 3.4GHz.

In addition to these cloud-based resources, we also include results from the Ampere Altra processor (https://amperecomputing.com/altra/) for single-node experiment. This 80-cores chip is also based on the Neoverse N1 design with clock speed of 3.0GHz. It supports 128 lanes of high-speed PCIe Gen4 and eight channels of DDR4-3200. The on-premise configuration used in this blog-post is dual-socket with up to 160 Arm-based N1 cores.

Test-Cases

For these performance tests, a state-of-the-art simulation of an explosion was used, using two grids, one with 20M tetrahedra (Explo20M) and one with 60M tetrahedra (60M).

Figure 1a. Mesh view (top) and rendering of the Iso-surface of temperature (Simulation snapshot on the bottom).

Building AVBP on Arm Neoverse N1

AVBP (v7.6) that provides frictionless support of Arm architecture is discussed in this blog-post. No source code modifications are required to obtain a working binary on Arm Neoverse N1 platforms. ABVP is compiled using GCC 10.2.0 and OpenMPI 4.1.1. Regarding dependencies, we rely on Metis v5.10, Parmetis 4.0.3 and HDF5 v1.10.5.

On C5, C5a and M6i platforms we compare performance results using the GNU compiler (v10.2.0) and the Intel OneAPI compiler (v2021.3.0) using standard tuning flags such as “-O3 -mcpu=native”. In our case, performances are in the same ballpark (<5%), so we also rely on GNU v10.2.0 for experiments on x86 platforms.

Single-node results

In this section, we study the Expl20M benchmark on single-node configurations. Figure 2 provides an overview of the overall efficiency of AVBP on various hardware configurations. We measure excellent performances of the code on all cloud-based platforms (+75%). On platforms with large core count such as the dual-socket Ampere Altra node, AVBP demonstrate very good efficiency results using the 160 cores (71%).

Figure 2. Parallel efficiency at the node-level.

Figure 3 compares the relative elapsed time for single-node executions. Results on the c6g AWS Graviton2 instances are used as a reference. EC2 instances based on single-socket AMD Rome, dual-socket Intel Cascade-Lake instances, and AWS Graviton2 show comparable performance results. Only the recently introduced Intel Ice Lake platform can deliver superior performances (speedup is 36%). In that case the ratio in terms of both peak FP64 (5x) and memory bandwidth (2x) can probably explain these results. Overall, the dual-socket Ampera Altra node is the most effective configuration for AVBP as the code can fully exploit the 160 computing cores. This configuration outperforms the Intel Ice Lake instance by about 35%.

Figure 3. Normalized timing (AWS Graviton2 results as a reference) at the node-level.

The last graph of this section recaps the price to performance ratios for cloud-based resources. The AWS Graviton2 instance is the best option with a significant margin in comparison to the other compute-optimized instances. The gain is beyond 50% on average with a maximum of 85% in comparison with the Intel Ice Lake M6I instance.

Figure 4. Normalized cost (AWS Graviton2 results as a reference) for a single instance running Expl20M AVBP benchmark. Prices are for the North-Virginia region. Lower means better.

Running at scale on AWS EC2 compute optimized instances

In this section, we evaluate the behavior of AVBP on large-scale configurations with up to 2,048 cores and 32 compute-optimized instances using the Expl60M benchmark.

The first plot shows the parallel efficiency of the code. As expected, excellent results are observed up to the maximum number of instances used. ABVP is known for excellent parallel efficiency on Petascale systemsl. And these performance results confirm that cloud-based resources with high-performance interconnects such as the Elastic Fabric Adapter (EFA) can support complex workloads. Arm-based and the x86 instances exhibit similar results, this highlights the maturity of the software stacks for the exploitation of EFA on both platforms.

Figure 5. Parallel efficiency comparison up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores on c5n.18xlarge). Lower is better.

Figure 6 compares the relative elapsed time. In that case, the AWS Graviton2 instances outperform the C5n instance with an average margin of 10%. We benefit from the superior number of computing cores available at the node level.

Figure 6. Relative elapsed time (AWS Graviton 2 as a reference) up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores on c5n.18xlarge).

A similar trend is observed on Figure 7 that describes the performance to cost ratio. C6gn instances are the most cost-effective solution by a significant margin (+50%) whatever the number of instances involved.

Figure 7. Normalized cost (AWS Graviton2 results as a reference) for a multi-instances simulation of AVBP (Expl60M benchmark) up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores on c5n.18xlarge). Prices are for the Ireland region. Lower means better.

Summary

In this blog post, we discussed performance results of the ABVP CFD suite for two representative benchmarks. We analyzed both the time-to-solution and the cost metrics for single and multi-instances of Amazon EC2 instances. The Arm-based AWS Graviton2 instances provide a significant savings in cost while delivering better performances in most cases. The C6g/C6gn Graviton2 single and multi-instances provide a significant saving : : up to 56% against the most performing x86 compute-optimized instances (C5 and C5a). A maximum of 85% is achieved in comparison with the newly introduced M6i instance based on Intel Ice-Lake. The study also shows the readiness of the high-speed interconnect EFA both on Arm-based and x86 instances.

In addition to those conclusions for cloud-based resources, we also included results from the on-premises Ampere Altra dual-socket system. This configuration that features up to 160 Neoverse N1 cores demonstrates excellent results, both in terms of scalability and time-to-solution metrics.

Acknowledgment

Part of this work has been supported by the EXCELLERAT project (the European Centre of Excellence for Engineering Applications) which has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement no. 823691.

References

[1] Colin, O., Rudgyard, M. : Development of high-order Taylor-Galerkin schemes for LES. J. of Computational Physics, 162:338–371, (2000)

[2] Baum, M., Poinsot, T., and Thevenin, D. Accurate boundary conditions for multicomponent reacting ows. J. of Computational Physics Vol. 116 (1994).

[3] Pérez Arroyo, C., Dombard, J. , Duchaine, F. , Gicquel, L., Martin, B., Odier, N., Staffelbach, G. Towards the Large-Eddy Simulation of a full engine: Integration of a 360 azimuthal degrees fan, compressor, and combustion chamber. Part II: Comparison against stand-alone simulations. J. of the Global Power and Propulsion Society (2021)

Parents

No Data

Comment

$core_v2_ui.GetResizedImageHtml($comment.User.AvatarUrl, 44, 44, "%{border='0px', alt=$comment.User.DisplayName, ResizeMethod='ZoomAndCrop'}")

$core_v2_ui.UserPresence($comment.User.Id) $comment.User.DisplayName over 2023 years ago

This comment is under review.
- Cancel
- Up $currentVotes.ToString("+0;-0;0") Down
- $core_v2_ui.Like($comment.CommentId, $comment.CommentContentTypeId, "%{ Format = $likeFormat, IncludeTip = 'true' }")
- Reply
- More
- Cancel

Children

No Data

High Performance Computing (HPC) blog

Defacto SoC Compiler performance on AWS Graviton3

Tim Thornton

In this blog, we compare the runtime performance and cost of using the Defacto SoC Compiler on Arm and x86-based Amazon EC2 instances.
- April 17, 2024
Arm Compiler for Linux and Arm Performance Libraries 24.04

Chris Goodyer

In this blog we outline some of the biggest changes available in version 24.04 of the Arm Compiler for Linux.
- April 16, 2024
SPDK NVMe over TCP Optimization on Arm

Rui Chang

This blog introduces optimizing Storage Performance Development Kit (SPDK) NVMe over TCP on Arm, and how to maximize its performance.
- February 5, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog