Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Demonstration of low mach-number CFD modeling with Nalu on AWS Graviton2 M6g instances
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • performance
  • HPC Compiler
  • Arm Forge
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Demonstration of low mach-number CFD modeling with Nalu on AWS Graviton2 M6g instances

Srinath Vadlamani, PhD
Srinath Vadlamani, PhD
June 25, 2020
4 minute read time.

Computational Fluid Dynamic (CFD) simulations are of great importance for the physical process they model with predictive goals as well as the technology research that is developed both numerically and experimentally. The low mach-number Nalu CFD code from Sandia National Labs (SNL) [1] exercises cutting edge numerical methods for CFD systems that are also used in SNL mission-critical simulation codes. Nalu exercises many software packages such as HDF5, PNetCDF, Trilinos, and Kokkos, which are integral components of more mission critical codes. Understanding the performance of Nalu and its software stack on emerging CPU technology is beneficial to members of the scientific HPC community.

About Nalu 

The SIERRA Low Mach Module: Nalu (henceforth referred to as Nalu), developed at Sandia National Labs, represents a generalized unstructured, massively parallel, variable density turbulent flow capability designed for energy applications. This code base began as an effort to prototype Sierra Toolkit [2] usage along with direct parallel matrix assembly to the Trilinos [4], Epetra, and Tpetra data structure. However, the simulation tool has evolved to support a variety of research projects germane to the energy sector including wind aerodynamic prediction and traditional gas-phase combustion applications.

Building Nalu on AWS

The build process follows the directions listed in the Nalu online documentation. The Arm-based AWS Graviton2 system can take advantage of the Arm Performance Library (armpl) and Arm Compiler for Linux (ACFL). The following build directions can help guide building HPC applications on current and future Arm-based systems. ACFL 20.1 and OpenMPI 4.0.3 with UCX 1.8.0 were used for this exercise. The build process followed the directions detailed on https://gitlab.com/arm-hpc/packages/-/wikis/packages/nalucfd. The software dependencies are:

  • SuperLU
  • libXML2
  • Boost
  • Yaml-cpp
  • zlib
  • HDF5
  • Parallel-NetCDF
  • NetCDF
  • Trilinos

It is helpful to set commonly used variables in a sourced file. For example:

$ cat /home/student003/srivad01/nalu/source.this
module load openmpi/acfl-20.1/4.0.3
module load Neoverse-N1/RHEL/7/arm-linux-compiler-20.1/armpl/20.1.0

export nalu_build_dir=/home/student002/srivad01/nalu/build_dir/arm20.1
export nalu_install_dir=/home/student002/srivad01/nalu/install_dir/arm20.1
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/bin/../../gcc-9.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64

export PATH=/home/student002/srivad01/nalu/install_dir/arm20.1/cmake/3.14.7/bin:${PATH}

These build directions can be modified to accommodate more packages in the Sierra Toolkit.

Simulation Description

The test (milestone) problem is a mixture-fraction based turbulent open jet with a Reynolds number ~6000. The jet emanates from the base of the domain. Mesh sizes range from 2.73E5 to 8.9E9 elements. This experiment uses a mess of ~2E6 elements. Approximately half of the total number of unknown variables are in the momentum solve. The elliptical poisson pressure (continuity) system, mixed-fraction, and kinetic energy systems nearly equally comprised the remaining count of unknown variables. Trilinos is the primary package for solving the systems while using Kokkos for data and memory management among the hierarchy of memory systems on the compute nodes.

The figure on the right shows simulation visualization results from Trinity acceptance campaign [5,6].

AWS M6g Timing Results

This exercise on the AWS mg6 instances uses the R1 mesh size with 2E5 elements for the single node exploration. The mesh_R2c-nt.g with 2E6 elements, hdf5 compression with no initial time step information in the mesh file was used for the multi-node exploration. The intent was to have enough unknown variables to require large computation effort after decomposition across parallel processing elements.

MPI scaling

Full MPI rank saturation was the assumption for the 1 to 8 node scaling study. Only compilation optimization via simd and arch flags was used (as noted in build directions). 

 NALU graph Arm-based AWS

Figure 1: Speed up (strong scaling) from 32 cores of single socket to 64 cores for each of 8 nodes.  

Noticeable compute improvement is observed for large node counts. The minimal speedup from 4 to 8 nodes suggest a repeat of the exploration with more unknown variables via a large mesh. 

Performance Exploration

Arm Forge can be exercised on AWS systems to analyze performance and debug. In this exercise, the timing for four and eight nodes are quite similar. 

 Arm Forge screenshot Arm Forge screenshotFigure 2: Nalu milestone case on 4 and 8 Gaviton2 M6g nodes. 

This initial performance comparisons shows an increase of MPI activity time ratio past four nodes suggesting the workload does not saturate the larger compute core count.  Given the MAP profiles showing that the app is only using ~228MB/rank => 14.5GB / node, it makes financial sense to run this problem on a C6g (128GB/node) rather than the M6g (256GB/node).

AWS Graviton2 Instance Time [secs] Price per instance [US dollars] Sim. price [US dollars]
mg6.16xlarge [64 vCpu] 477.12 $2.46 $19.59
cg6.16xlarge  [64 vCpu] 490.91 $2.18 $17.80

Table [1]. Nalu 2E6 elements simulation on C6g Graviton2 instance with nearly 10% cost savings. 

Summary

Exploration of HPC application deployment on AWS's Graviton2 resources have shown efficacy for performance and price budget. This exercise has demonstrated that complicated scientific applications that require many dependent packages can be successfully built and exercised on Arm-based distributed memory systems. 

Explore HPC on Arm

Citations

[1] Domino, S. "Sierra Low Mach Module: Nalu Theory Manual 1.0", SAND2015-3107W, Sandia National Laboratories Unclassified Unlimited Release (UUR), 2015. https://github.com/NaluCFD/NaluDoc.

[2] H. Edwards, A. Williams, G. Sjaardema, D. Baur, and W. Cochran. Sierra toolkit computational mesh computational model. Technical Report SAND-20101192, Sandia National Laboratories, Albuquerque, NM, 2010.

[3] M. Heroux, R. Bartlett, V. Howle, R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, J. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An overview of trilinos. Technical Report SAND-20032927, Sandia National Laboratories, Albuquerque, NM, 2003.

[4] S. Tieszen, S. Domino, and A. Black. Validation of a simple turbulence model suitable for closure of temporally-filtered navier-stokes equations using a helium plume. Technical Report SAND-20053210, Sandia National Laboratories, Albuquerque, NM, June 2005.

[5] A.  M. Agelastos, P. T. Lin. Simulation Information Regarding Sandia National Laboratories’ Trinity Capability Improvement Metric. Sandia Report SAND2013-8748, Sandia National Laboratories, Albuquerque, NM,  October 17, 2013

[6] P. T. Lin, M. T. Bettencourt, S. Domino, T. Fisher, M. Hoemmen, J. J. Hu, E. T. Phipps, A. Prokopenko, S. Rajamanickam, C. M. Siefert and Stephen Kennon, Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos, Parallel Process. Lett., vol. 24, 2014.

Anonymous
Servers and Cloud Computing blog
  • Redefining Datacenter Performance for AI: The Arm Neoverse Advantage

    Shivangi Agrawal
    Shivangi Agrawal
    In this blog post, explore the features that make Neoverse V series the choice of compute platform for AI.
    • September 8, 2025
  • Migrating our GenAI pipeline to AWS Graviton powered by Arm Neoverse: A 40% cost reduction story

    Hrudu Shibu
    Hrudu Shibu
    This blog post explains how Esankethik.com, an IT and AI solutions company, successfully migrated its internal GenAI pipeline to AWS Graviton Arm64.
    • August 28, 2025
  • Using GitHub Arm-hosted runners to install Arm Performance Libraries

    Waheed Brown
    Waheed Brown
    In this blog post, learn how Windows developers can set up and use Arm-hosted Windows runners in GitHub Action.
    • August 21, 2025