Demonstration of low mach-number CFD modeling with Nalu on AWS Graviton2 M6g instances

June 25, 2020

4 minute read time.

Computational Fluid Dynamic (CFD) simulations are of great importance for the physical process they model with predictive goals as well as the technology research that is developed both numerically and experimentally. The low mach-number Nalu CFD code from Sandia National Labs (SNL) [1] exercises cutting edge numerical methods for CFD systems that are also used in SNL mission-critical simulation codes. Nalu exercises many software packages such as HDF5, PNetCDF, Trilinos, and Kokkos, which are integral components of more mission critical codes. Understanding the performance of Nalu and its software stack on emerging CPU technology is beneficial to members of the scientific HPC community.

About Nalu

The SIERRA Low Mach Module: Nalu (henceforth referred to as Nalu), developed at Sandia National Labs, represents a generalized unstructured, massively parallel, variable density turbulent flow capability designed for energy applications. This code base began as an effort to prototype Sierra Toolkit [2] usage along with direct parallel matrix assembly to the Trilinos [4], Epetra, and Tpetra data structure. However, the simulation tool has evolved to support a variety of research projects germane to the energy sector including wind aerodynamic prediction and traditional gas-phase combustion applications.

Building Nalu on AWS

The build process follows the directions listed in the Nalu online documentation. The Arm-based AWS Graviton2 system can take advantage of the Arm Performance Library (armpl) and Arm Compiler for Linux (ACFL). The following build directions can help guide building HPC applications on current and future Arm-based systems. ACFL 20.1 and OpenMPI 4.0.3 with UCX 1.8.0 were used for this exercise. The build process followed the directions detailed on https://gitlab.com/arm-hpc/packages/-/wikis/packages/nalucfd. The software dependencies are:

SuperLU
libXML2
Boost
Yaml-cpp
zlib
HDF5
Parallel-NetCDF
NetCDF
Trilinos

It is helpful to set commonly used variables in a sourced file. For example:

$ cat /home/student003/srivad01/nalu/source.this
module load openmpi/acfl-20.1/4.0.3
module load Neoverse-N1/RHEL/7/arm-linux-compiler-20.1/armpl/20.1.0

export nalu_build_dir=/home/student002/srivad01/nalu/build_dir/arm20.1
export nalu_install_dir=/home/student002/srivad01/nalu/install_dir/arm20.1
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/bin/../../gcc-9.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64

export PATH=/home/student002/srivad01/nalu/install_dir/arm20.1/cmake/3.14.7/bin:${PATH}

These build directions can be modified to accommodate more packages in the Sierra Toolkit.

Simulation Description

The test (milestone) problem is a mixture-fraction based turbulent open jet with a Reynolds number ~6000. The jet emanates from the base of the domain. Mesh sizes range from 2.73E5 to 8.9E9 elements. This experiment uses a mess of ~2E6 elements. Approximately half of the total number of unknown variables are in the momentum solve. The elliptical poisson pressure (continuity) system, mixed-fraction, and kinetic energy systems nearly equally comprised the remaining count of unknown variables. Trilinos is the primary package for solving the systems while using Kokkos for data and memory management among the hierarchy of memory systems on the compute nodes.

The figure on the right shows simulation visualization results from Trinity acceptance campaign [5,6].

AWS M6g Timing Results

This exercise on the AWS mg6 instances uses the R1 mesh size with 2E5 elements for the single node exploration. The mesh_R2c-nt.g with 2E6 elements, hdf5 compression with no initial time step information in the mesh file was used for the multi-node exploration. The intent was to have enough unknown variables to require large computation effort after decomposition across parallel processing elements.

MPI scaling

Full MPI rank saturation was the assumption for the 1 to 8 node scaling study. Only compilation optimization via simd and arch flags was used (as noted in build directions).

NALU graph Arm-based AWS

Figure 1: Speed up (strong scaling) from 32 cores of single socket to 64 cores for each of 8 nodes.

Noticeable compute improvement is observed for large node counts. The minimal speedup from 4 to 8 nodes suggest a repeat of the exploration with more unknown variables via a large mesh.

Performance Exploration

Arm Forge can be exercised on AWS systems to analyze performance and debug. In this exercise, the timing for four and eight nodes are quite similar.

Arm Forge screenshot Figure 2: Nalu milestone case on 4 and 8 Gaviton2 M6g nodes.

This initial performance comparisons shows an increase of MPI activity time ratio past four nodes suggesting the workload does not saturate the larger compute core count. Given the MAP profiles showing that the app is only using ~228MB/rank => 14.5GB / node, it makes financial sense to run this problem on a C6g (128GB/node) rather than the M6g (256GB/node).

AWS Graviton2 Instance	Time [secs]	Price per instance [US dollars]	Sim. price [US dollars]
mg6.16xlarge [64 vCpu]	477.12	$2.46	$19.59
cg6.16xlarge [64 vCpu]	490.91	$2.18	$17.80

Table [1]. Nalu 2E6 elements simulation on C6g Graviton2 instance with nearly 10% cost savings.

Summary

Exploration of HPC application deployment on AWS's Graviton2 resources have shown efficacy for performance and price budget. This exercise has demonstrated that complicated scientific applications that require many dependent packages can be successfully built and exercised on Arm-based distributed memory systems.

Explore HPC on Arm

Citations

[1] Domino, S. "Sierra Low Mach Module: Nalu Theory Manual 1.0", SAND2015-3107W, Sandia National Laboratories Unclassified Unlimited Release (UUR), 2015. https://github.com/NaluCFD/NaluDoc.

[2] H. Edwards, A. Williams, G. Sjaardema, D. Baur, and W. Cochran. Sierra toolkit computational mesh computational model. Technical Report SAND-20101192, Sandia National Laboratories, Albuquerque, NM, 2010.

[3] M. Heroux, R. Bartlett, V. Howle, R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, J. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An overview of trilinos. Technical Report SAND-20032927, Sandia National Laboratories, Albuquerque, NM, 2003.

[4] S. Tieszen, S. Domino, and A. Black. Validation of a simple turbulence model suitable for closure of temporally-filtered navier-stokes equations using a helium plume. Technical Report SAND-20053210, Sandia National Laboratories, Albuquerque, NM, June 2005.

[5] A. M. Agelastos, P. T. Lin. Simulation Information Regarding Sandia National Laboratories’ Trinity Capability Improvement Metric. Sandia Report SAND2013-8748, Sandia National Laboratories, Albuquerque, NM, October 17, 2013

[6] P. T. Lin, M. T. Bettencourt, S. Domino, T. Fisher, M. Hoemmen, J. J. Hu, E. T. Phipps, A. Prokopenko, S. Rajamanickam, C. M. Siefert and Stephen Kennon, Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos, Parallel Process. Lett., vol. 24, 2014.

Servers and Cloud Computing blog

Integrated Modular Firmware Solutions: A Vital Component of Custom Silicon Chiplet Architecture Designs

Marc Meunier

Firmware is now the backbone of chiplet-based silicon—enabling modular integration, early validation, and secure, efficient system orchestration.
- October 8, 2025
Scaling GenAI Infrastructure with proteanTecs and Arm’s Neoverse CSS

Marc Meunier

proteanTecs successful integration of monitoring into Arm Neoverse CSS brings customer-ready solutions with accelerated time-to-market.
- October 2, 2025
Accelerate LLM Inference with ONNX Runtime on Arm Neoverse-powered Microsoft Cobalt 100

Na Li

In this blog, we take a closer look at how Microsoft Cobalt 100 processors and Arm’s ONNX Runtime optimizations deliver significant performance gains for running LLMs.
- October 1, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog