Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Arm Research
    • DesignStart
    • Education Hub
    • Graphics and Gaming
    • High Performance Computing
    • Innovation
    • Multimedia
    • Open Source Software and Platforms
    • Physical
    • Processors
    • Security
    • System
    • Software Tools
    • TrustZone for Armv8-M
    • 中文社区
  • Blog
    • Announcements
    • Artificial Intelligence
    • Automotive
    • Healthcare
    • HPC
    • Infrastructure
    • Innovation
    • Internet of Things
    • Machine Learning
    • Mobile
    • Smart Homes
    • Wearables
  • Forums
    • All developer forums
    • IP Product forums
    • Tool & Software forums
  • Support
    • Open a support case
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
High Performance Computing
  • Developer Community
  • Tools and Software
  • High Performance Computing
  • Jump...
  • Cancel
High Performance Computing
HPC blog Demonstration of low mach-number CFD modeling with Nalu on AWS Graviton2 M6g instances
  • HPC blog
  • HPC forum
  • Server & HPC events
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
  • New
More blogs in High Performance Computing
  • HPC blog

Tags
  • High Performance Computing (HPC)
  • performance
  • HPC Compiler
  • Arm Forge
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Demonstration of low mach-number CFD modeling with Nalu on AWS Graviton2 M6g instances

Srinath Vadlamani, PhD
Srinath Vadlamani, PhD
June 25, 2020

Computational Fluid Dynamic (CFD) simulations are of great importance for the physical process they model with predictive goals as well as the technology research that is developed both numerically and experimentally. The low mach-number Nalu CFD code from Sandia National Labs (SNL) [1] exercises cutting edge numerical methods for CFD systems that are also used in SNL mission-critical simulation codes. Nalu exercises many software packages such as HDF5, PNetCDF, Trilinos, and Kokkos, which are integral components of more mission critical codes. Understanding the performance of Nalu and its software stack on emerging CPU technology is beneficial to members of the scientific HPC community.

About Nalu 

The SIERRA Low Mach Module: Nalu (henceforth referred to as Nalu), developed at Sandia National Labs, represents a generalized unstructured, massively parallel, variable density turbulent flow capability designed for energy applications. This code base began as an effort to prototype Sierra Toolkit [2] usage along with direct parallel matrix assembly to the Trilinos [4], Epetra, and Tpetra data structure. However, the simulation tool has evolved to support a variety of research projects germane to the energy sector including wind aerodynamic prediction and traditional gas-phase combustion applications.

Building Nalu on AWS

The build process follows the directions listed in the Nalu online documentation. The Arm-based AWS Graviton2 system can take advantage of the Arm Performance Library (armpl) and Arm Compiler for Linux (ACFL). The following build directions can help guide building HPC applications on current and future Arm-based systems. ACFL 20.1 and OpenMPI 4.0.3 with UCX 1.8.0 were used for this exercise. The build process followed the directions detailed on https://gitlab.com/arm-hpc/packages/-/wikis/packages/nalucfd. The software dependencies are:

  • SuperLU
  • libXML2
  • Boost
  • Yaml-cpp
  • zlib
  • HDF5
  • Parallel-NetCDF
  • NetCDF
  • Trilinos

It is helpful to set commonly used variables in a sourced file. For example:

$ cat /home/student003/srivad01/nalu/source.this
module load openmpi/acfl-20.1/4.0.3
module load Neoverse-N1/RHEL/7/arm-linux-compiler-20.1/armpl/20.1.0

export nalu_build_dir=/home/student002/srivad01/nalu/build_dir/arm20.1
export nalu_install_dir=/home/student002/srivad01/nalu/install_dir/arm20.1
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/bin/../../gcc-9.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64

export PATH=/home/student002/srivad01/nalu/install_dir/arm20.1/cmake/3.14.7/bin:${PATH}

These build directions can be modified to accommodate more packages in the Sierra Toolkit.

Simulation Description

The test (milestone) problem is a mixture-fraction based turbulent open jet with a Reynolds number ~6000. The jet emanates from the base of the domain. Mesh sizes range from 2.73E5 to 8.9E9 elements. This experiment uses a mess of ~2E6 elements. Approximately half of the total number of unknown variables are in the momentum solve. The elliptical poisson pressure (continuity) system, mixed-fraction, and kinetic energy systems nearly equally comprised the remaining count of unknown variables. Trilinos is the primary package for solving the systems while using Kokkos for data and memory management among the hierarchy of memory systems on the compute nodes.

The figure on the right shows simulation visualization results from Trinity acceptance campaign [5,6].

AWS M6g Timing Results

This exercise on the AWS mg6 instances uses the R1 mesh size with 2E5 elements for the single node exploration. The mesh_R2c-nt.g with 2E6 elements, hdf5 compression with no initial time step information in the mesh file was used for the multi-node exploration. The intent was to have enough unknown variables to require large computation effort after decomposition across parallel processing elements.

MPI scaling

Full MPI rank saturation was the assumption for the 1 to 8 node scaling study. Only compilation optimization via simd and arch flags was used (as noted in build directions). 

 NALU graph Arm-based AWS

Figure 1: Speed up (strong scaling) from 32 cores of single socket to 64 cores for each of 8 nodes.  

Noticeable compute improvement is observed for large node counts. The minimal speedup from 4 to 8 nodes suggest a repeat of the exploration with more unknown variables via a large mesh. 

Performance Exploration

Arm Forge can be exercised on AWS systems to analyze performance and debug. In this exercise, the timing for four and eight nodes are quite similar. 

 Arm Forge screenshot Arm Forge screenshotFigure 2: Nalu milestone case on 4 and 8 Gaviton2 M6g nodes. 

This initial performance comparisons shows an increase of MPI activity time ratio past four nodes suggesting the workload does not saturate the larger compute core count.  Given the MAP profiles showing that the app is only using ~228MB/rank => 14.5GB / node, it makes financial sense to run this problem on a C6g (128GB/node) rather than the M6g (256GB/node).

AWS Graviton2 Instance Time [secs] Price per instance [US dollars] Sim. price [US dollars]
mg6.16xlarge [64 vCpu] 477.12 $2.46 $19.59
cg6.16xlarge  [64 vCpu] 490.91 $2.18 $17.80

Table [1]. Nalu 2E6 elements simulation on C6g Graviton2 instance with nearly 10% cost savings. 

Summary

Exploration of HPC application deployment on AWS's Graviton2 resources have shown efficacy for performance and price budget. This exercise has demonstrated that complicated scientific applications that require many dependent packages can be successfully built and exercised on Arm-based distributed memory systems. 

Explore HPC on Arm

Citations

[1] Domino, S. "Sierra Low Mach Module: Nalu Theory Manual 1.0", SAND2015-3107W, Sandia National Laboratories Unclassified Unlimited Release (UUR), 2015. https://github.com/NaluCFD/NaluDoc.

[2] H. Edwards, A. Williams, G. Sjaardema, D. Baur, and W. Cochran. Sierra toolkit computational mesh computational model. Technical Report SAND-20101192, Sandia National Laboratories, Albuquerque, NM, 2010.

[3] M. Heroux, R. Bartlett, V. Howle, R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, J. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An overview of trilinos. Technical Report SAND-20032927, Sandia National Laboratories, Albuquerque, NM, 2003.

[4] S. Tieszen, S. Domino, and A. Black. Validation of a simple turbulence model suitable for closure of temporally-filtered navier-stokes equations using a helium plume. Technical Report SAND-20053210, Sandia National Laboratories, Albuquerque, NM, June 2005.

[5] A.  M. Agelastos, P. T. Lin. Simulation Information Regarding Sandia National Laboratories’ Trinity Capability Improvement Metric. Sandia Report SAND2013-8748, Sandia National Laboratories, Albuquerque, NM,  October 17, 2013

[6] P. T. Lin, M. T. Bettencourt, S. Domino, T. Fisher, M. Hoemmen, J. J. Hu, E. T. Phipps, A. Prokopenko, S. Rajamanickam, C. M. Siefert and Stephen Kennon, Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos, Parallel Process. Lett., vol. 24, 2014.

Anonymous
HPC blog
  • Designing Arm Cortex-M55 CPU on Arm Neoverse powered AWS Graviton2 Processors

    Tim Thornton
    Tim Thornton
    In this blog, read how Arm made the transition from on-prem EDA to running EDA in the Cloud on AWS Graviton2.
    • December 17, 2020
  • Ocean Modeling with HYCOM on AWS Graviton2

    Lucas Pettey
    Lucas Pettey
    AWS Graviton2 based c6g instances offer the fastest resolution time on HYCOM.
    • December 10, 2020
  • Trends to Watch in HPC

    Brent Gorda
    Brent Gorda
    In this blog, Brent Gorda discusses a few key trends that will impact the future of high-performance computing.
    • November 4, 2020