Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Running AVBP Industrial code on Arm Neoverse N1
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • aws
  • Graviton2
  • Neoverse N1
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Running AVBP Industrial code on Arm Neoverse N1

fabrice  dupros
fabrice dupros
October 14, 2021
9 minute read time.

Co-authored with Conrad Hillairet (Arm), Gabriel Staffelbach (Cerfacs) and Isabelle D'Ast (Cerfacs)

Reliable simulations of turbulent reacting flows remain a scientific challenge and leading research scientific institutions dedicate significant effort on this topic. Indeed, the complexity of the physical mechanisms, including the modeling of the combustion and the underlying three-dimensional geometry led to one of the hardest problems in computational science. Therefore, high-performance computing techniques are key to tackling associated large-scale industrial test-cases arising from several strategic areas such as aerospace, gas turbine engines, or oil and gas. The AVBP suite is at the forefront in terms of usage of supercomputers leading to several scientific breakthroughs. This includes the first 360 degrees high-fidelity simulation (two billion cells) of a part of the real engine DGEN380.

In this blog, we discuss performance results of AVBP for two realistic benchmarks. We focus on the Arm Neoverse N1 design and we detail the performance numbers from a price performance perspective, but we also report on the time-to-solution metric. 

Large-Eddy Simulation of a full engine

Figure 1. First 360-degrees Large-Eddy Simulation of a full engine – around two billions of cells, joint collaboration between CERFACS, SAFRAN and AKIRA technologies, Dr. C. Pérez Arroyo (post-doctoral fellow at CERFACS).

About CERFACS

Cerfacs is a basic and applied research center, specialized in modeling and numerical simulation. Through its facilities and expertise in High Performance Computing, Cerfacs deals with major scientific and technical research problems of public and industrial interest. Cerfacs hosts interdisciplinary researchers such as physicians, applied mathematicians, numerical analysts, and software engineers. These groups design and develop innovative methods and software solutions to meet the needs of the aeronautics, space, climate, energy, and environmental fields. It is involved in major national and international projects and is strongly interacting with its seven shareholders: Airbus Group, Cnes, EDF, Météo France, Onera, Safran and Total. It is also associated with partners like CNRS (Associated Research Unit), Irit (common laboratory), CEA, and Inria (cooperation agreements).

About AVBP

AVBP (v7.6) is a state-of-the-art numerical tool to solve the Navier Stokes equations for compressible reactive flows used both in industry and academia. Originally built to study aerodynamics, simulations from AVBP are used for the design of combustion chambers, turbomachinery, piston engines. And even for rocket propulsion, pollutant production and dispersion prediction and safety applications. 

Turbulence models are based on the Large Eddy Simulation approach. Coupled with third order in space and time Taylor-Galerkin numerical scheme it offers high fidelity results on unstructured multi-element grids.

Setup 

We consider AWS EC2 compute-optimized instances for most of our experiments. In addition to the C5, C5a and C6g instances, we also include performance numbers from the newly introduced M6i Intel-based Ice-Lake instance.  

  • The C6g.16xlarge instances rely on AWS Graviton2 processors, which is a custom-built 64-bit Arm chip based on Neoverse cores. Every vCPU is a physical core (that is, no simultaneous multithreading) and the instances are single socket with 64 cores for the c6g.16xlarge as an example.
  • The C5.24xlarge instances feature second-generation Intel Xeon Scalable Processors (Cascade Lake) with a sustained all core turbo frequency of 3.6GHz in a dual-socket configuration (48 cores).
  • The C5a.24xlarge instances feature the second generation of AMD EPYC 7002 (Rome) series processors clocked at 3.3GHz in a single-socket configuration (48 cores).
  • The M6i.32xlarge instance exploits the Intel Ice Lake processor in a dual-socket configuration (64 physical cores) with an all-core turbo clock speed of 3.5GHz and eight memory channels.

More detailed information about the instance types can be found here: https://aws.amazon.com/ec2/instance-types/ 

All these instances are powered by the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor. Amazon Linux 2 is used as the Operating System. 

For multi-instances runs, we exploit the c6gn and the c5n instances which support the Elastic Fabric Adapter (EFA) network interface (up to 100Gbps). In that case, the C5.18xlarge n instance is based on the Intel Skylake processor that features a sustained all core Turbo frequency of up to 3.4GHz.

In addition to these cloud-based resources, we also include results from the Ampere Altra processor (https://amperecomputing.com/altra/) for single-node experiment. This 80-cores chip is also based on the Neoverse N1 design with clock speed of 3.0GHz. It supports 128 lanes of high-speed PCIe Gen4 and eight channels of DDR4-3200. The on-premise configuration used in this blog-post is dual-socket with up to 160 Arm-based N1 cores.

Test-Cases

For these performance tests, a state-of-the-art simulation of an explosion was used, using two grids, one with 20M tetrahedra (Explo20M) and one with 60M tetrahedra (60M).

 Mesh view

Iso-surface temperature
 Figure 1a. Mesh view (top) and rendering of the Iso-surface of temperature (Simulation snapshot on the bottom).

Building AVBP on Arm Neoverse N1

AVBP (v7.6) that provides frictionless support of Arm architecture is discussed in this blog-post. No source code modifications are required to obtain a working binary on Arm Neoverse N1 platforms. ABVP is compiled using GCC 10.2.0 and OpenMPI 4.1.1. Regarding dependencies, we rely on Metis v5.10, Parmetis 4.0.3 and HDF5 v1.10.5.  

On C5, C5a and M6i platforms we compare performance results using the GNU compiler (v10.2.0) and the Intel OneAPI compiler (v2021.3.0) using standard tuning flags such as “-O3 -mcpu=native”. In our case, performances are in the same ballpark (<5%), so we also rely on  GNU v10.2.0 for experiments on x86 platforms.

Single-node results

In this section, we study the Expl20M benchmark on single-node configurations. Figure 2 provides an overview of the overall efficiency of AVBP on various hardware configurations. We measure excellent performances of the code on all cloud-based platforms (+75%). On platforms with large core count such as the dual-socket Ampere Altra node, AVBP demonstrate very good efficiency results using the 160 cores (71%).

Parallel efficiency at the node level

Figure 2. Parallel efficiency at the node-level.

Figure 3 compares the relative elapsed time for single-node executions. Results on the c6g AWS Graviton2 instances are used as a reference. EC2 instances based on single-socket AMD Rome, dual-socket Intel Cascade-Lake instances, and AWS Graviton2 show comparable performance results. Only the recently introduced Intel Ice Lake platform can deliver superior performances (speedup is 36%). In that case the ratio in terms of both peak FP64 (5x) and memory bandwidth (2x) can probably explain these results. Overall, the dual-socket Ampera Altra node is the most effective configuration for AVBP as the code can fully exploit the 160 computing cores. This configuration outperforms the Intel Ice Lake instance by about 35%.

Normalized timing at the node level

Figure 3. Normalized timing (AWS Graviton2 results as a reference) at the node-level.

The last graph of this section recaps the price to performance ratios for cloud-based resources. The AWS Graviton2 instance is the best option with a significant margin in comparison to the other compute-optimized instances. The gain is beyond 50% on average with a maximum of 85% in comparison with the Intel Ice Lake M6I instance. 

Normalized cost reference
Figure 4. Normalized cost (AWS Graviton2 results as a reference) for a single instance running Expl20M AVBP benchmark. Prices are for the North-Virginia region. Lower means better.

Running at scale on AWS EC2 compute optimized instances

In this section, we evaluate the behavior of AVBP on large-scale configurations with up to 2,048 cores and 32 compute-optimized instances using the Expl60M benchmark.

The first plot shows the parallel efficiency of the code. As expected, excellent results are observed up to the maximum number of instances used. ABVP is known for excellent parallel efficiency on Petascale systemsl. And these performance results confirm that cloud-based resources with high-performance interconnects such as the Elastic Fabric Adapter (EFA) can support complex workloads. Arm-based and the x86 instances exhibit similar results, this highlights the maturity of the software stacks for the exploitation of EFA on both platforms. 

Parallel efficiency comparison between instance types

Figure 5. Parallel efficiency comparison up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores on c5n.18xlarge). Lower is better.

Figure 6 compares the relative elapsed time. In that case, the AWS Graviton2 instances outperform the C5n instance with an average margin of 10%. We benefit from the superior number of computing cores available at the node level. 

Relative elapsed time between instance types

Figure 6. Relative elapsed time (AWS Graviton 2 as a reference) up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores  on c5n.18xlarge).

A similar trend is observed on Figure 7 that describes the performance to cost ratio. C6gn instances are the most cost-effective solution by a significant margin (+50%) whatever the number of instances involved.

Normalized cost for a multi-instance simulationFigure 7. Normalized cost (AWS Graviton2 results as a reference) for a multi-instances simulation of AVBP (Expl60M benchmark) up to 32 AWS EC2 instances (2,048 cores on c6gn.16xlarge and 1,152 cores  on c5n.18xlarge). Prices are for the Ireland region. Lower means better.

Summary

In this blog post, we discussed performance results of the ABVP CFD suite for two representative benchmarks. We analyzed both the time-to-solution and the cost metrics for single and multi-instances of Amazon EC2 instances. The Arm-based AWS Graviton2 instances provide a significant savings in cost while delivering better performances in most cases. The C6g/C6gn Graviton2  single and multi-instances provide a significant saving : : up to 56% against the most performing x86 compute-optimized instances (C5 and C5a).  A maximum of 85% is achieved in comparison with the newly introduced M6i instance based on Intel Ice-Lake. The study also shows the readiness of the high-speed interconnect EFA both on Arm-based and x86 instances.

In addition to those conclusions for cloud-based resources, we also included results from the on-premises Ampere Altra dual-socket system. This configuration that features up to 160 Neoverse N1 cores demonstrates excellent results, both in terms of scalability and time-to-solution metrics.

Acknowledgment

Part of this work has been supported by the EXCELLERAT project (the European Centre of Excellence  for Engineering Applications) which has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement no. 823691.  

References

[1] Colin, O., Rudgyard, M. : Development of high-order Taylor-Galerkin schemes for LES. J. of Computational Physics, 162:338–371, (2000)

[2] Baum, M., Poinsot, T., and Thevenin, D. Accurate boundary conditions for multicomponent reacting ows. J. of Computational Physics Vol. 116 (1994). 

[3] Pérez Arroyo, C., Dombard, J. , Duchaine, F. , Gicquel, L., Martin, B., Odier, N., Staffelbach, G. Towards the Large-Eddy Simulation of a full engine: Integration of a 360 azimuthal degrees fan, compressor, and combustion chamber. Part II: Comparison against stand-alone simulations. J. of the Global Power and Propulsion Society (2021)

Anonymous
  • uwofnewfewf
    Offline uwofnewfewf 28 days ago in reply to kenny221

    I am confused by your comment. Would you mind explaining it further to a structural engineering professional?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • kenny221
    Offline kenny221 1 month ago

    It offers eight DDR4-3200 channels and 128 high-speed PCIe Gen4 lanes. Dual-socket on-premises configuration with up to 160 Arm-based N1 cores was used in this blog article.  Regards from spray foam insulation service team!

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • CyLim
    Offline CyLim 3 months ago in reply to dogansur

    We benefit from the superior number of computing cores available at the node level.  https://www.educatedautomation.com

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • allanpetter
    Offline allanpetter 3 months ago

    In that case, the C5.18xlarge n instance is based on the Intel Skylake processor that features a sustained all core Turbo frequency of up to 3.4GHz. 

    Dallas texas bathroom remodeling companies

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • jacksonkim
    Offline jacksonkim 4 months ago in reply to dogansur

    Profiling hardware events can provide insight into the code execution behavior on the various micro-architectural units.  utv rentals duck creek utah

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
>
High Performance Computing (HPC) blog
  • AWS Graviton3 improves Cadence EDA tools performance for Arm

    Tim Thornton
    Tim Thornton
    In this blog we provide an update to our use of Cadence EDA tools in the AWS cloud, with a focus on Graviton3 performance improvements.
    • November 16, 2022
  • A case study in vectorizing HACCmk using SVE

    Brian Waldecker
    Brian Waldecker
    This blog uses the HACCmk benchmark to demonstrate the vectorization capabilities and benefits of SVE over NEON (ASIMD)
    • November 3, 2022
  • Bringing WRF up to speed with Arm Neoverse

    Phil Ridley
    Phil Ridley
    In this blog we examine the WRF weather model and examine the performance improvement available using AWS Graviton3 (Neoverse V1 core) compared to AWS Graviton2 (Neoverse N1 core).
    • October 19, 2022