Computational Fluid Dynamic (CFD) simulations are of great importance for the physical process they model with predictive goals as well as the technology research that is developed both numerically and experimentally. The low mach-number Nalu CFD code from Sandia National Labs (SNL) [1] exercises cutting edge numerical methods for CFD systems that are also used in SNL mission-critical simulation codes. Nalu exercises many software packages such as HDF5, PNetCDF, Trilinos, and Kokkos, which are integral components of more mission critical codes. Understanding the performance of Nalu and its software stack on emerging CPU technology is beneficial to members of the scientific HPC community.
The SIERRA Low Mach Module: Nalu (henceforth referred to as Nalu), developed at Sandia National Labs, represents a generalized unstructured, massively parallel, variable density turbulent flow capability designed for energy applications. This code base began as an effort to prototype Sierra Toolkit [2] usage along with direct parallel matrix assembly to the Trilinos [4], Epetra, and Tpetra data structure. However, the simulation tool has evolved to support a variety of research projects germane to the energy sector including wind aerodynamic prediction and traditional gas-phase combustion applications.
The build process follows the directions listed in the Nalu online documentation. The Arm-based AWS Graviton2 system can take advantage of the Arm Performance Library (armpl) and Arm Compiler for Linux (ACFL). The following build directions can help guide building HPC applications on current and future Arm-based systems. ACFL 20.1 and OpenMPI 4.0.3 with UCX 1.8.0 were used for this exercise. The build process followed the directions detailed on https://gitlab.com/arm-hpc/packages/-/wikis/packages/nalucfd. The software dependencies are:
It is helpful to set commonly used variables in a sourced file. For example:
$ cat /home/student003/srivad01/nalu/source.thismodule load openmpi/acfl-20.1/4.0.3module load Neoverse-N1/RHEL/7/arm-linux-compiler-20.1/armpl/20.1.0export nalu_build_dir=/home/student002/srivad01/nalu/build_dir/arm20.1export nalu_install_dir=/home/student002/srivad01/nalu/install_dir/arm20.1export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/arm/arm-linux-compiler-20.1_Generic-AArch64_RHEL-7_aarch64-linux/bin/../../gcc-9.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64export PATH=/home/student002/srivad01/nalu/install_dir/arm20.1/cmake/3.14.7/bin:${PATH}
These build directions can be modified to accommodate more packages in the Sierra Toolkit.
The test (milestone) problem is a mixture-fraction based turbulent open jet with a Reynolds number ~6000. The jet emanates from the base of the domain. Mesh sizes range from 2.73E5 to 8.9E9 elements. This experiment uses a mess of ~2E6 elements. Approximately half of the total number of unknown variables are in the momentum solve. The elliptical poisson pressure (continuity) system, mixed-fraction, and kinetic energy systems nearly equally comprised the remaining count of unknown variables. Trilinos is the primary package for solving the systems while using Kokkos for data and memory management among the hierarchy of memory systems on the compute nodes.
The figure on the right shows simulation visualization results from Trinity acceptance campaign [5,6].
This exercise on the AWS mg6 instances uses the R1 mesh size with 2E5 elements for the single node exploration. The mesh_R2c-nt.g with 2E6 elements, hdf5 compression with no initial time step information in the mesh file was used for the multi-node exploration. The intent was to have enough unknown variables to require large computation effort after decomposition across parallel processing elements.
Full MPI rank saturation was the assumption for the 1 to 8 node scaling study. Only compilation optimization via simd and arch flags was used (as noted in build directions).
Figure 1: Speed up (strong scaling) from 32 cores of single socket to 64 cores for each of 8 nodes.
Noticeable compute improvement is observed for large node counts. The minimal speedup from 4 to 8 nodes suggest a repeat of the exploration with more unknown variables via a large mesh.
Arm Forge can be exercised on AWS systems to analyze performance and debug. In this exercise, the timing for four and eight nodes are quite similar.
Figure 2: Nalu milestone case on 4 and 8 Gaviton2 M6g nodes.
This initial performance comparisons shows an increase of MPI activity time ratio past four nodes suggesting the workload does not saturate the larger compute core count. Given the MAP profiles showing that the app is only using ~228MB/rank => 14.5GB / node, it makes financial sense to run this problem on a C6g (128GB/node) rather than the M6g (256GB/node).
Table [1]. Nalu 2E6 elements simulation on C6g Graviton2 instance with nearly 10% cost savings.
Exploration of HPC application deployment on AWS's Graviton2 resources have shown efficacy for performance and price budget. This exercise has demonstrated that complicated scientific applications that require many dependent packages can be successfully built and exercised on Arm-based distributed memory systems.
[CTAToken URL = "https://community.arm.com/b/hpc" target="_blank" text="Explore HPC on Arm" class ="green"]
[1] Domino, S. "Sierra Low Mach Module: Nalu Theory Manual 1.0", SAND2015-3107W, Sandia National Laboratories Unclassified Unlimited Release (UUR), 2015. https://github.com/NaluCFD/NaluDoc.
[2] H. Edwards, A. Williams, G. Sjaardema, D. Baur, and W. Cochran. Sierra toolkit computational mesh computational model. Technical Report SAND-20101192, Sandia National Laboratories, Albuquerque, NM, 2010.
[3] M. Heroux, R. Bartlett, V. Howle, R. Hoekstra, J. Hu, T. Kolda, R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, J. Thornquist, R. Tuminaro, J. Willenbring, and A. Williams. An overview of trilinos. Technical Report SAND-20032927, Sandia National Laboratories, Albuquerque, NM, 2003.
[4] S. Tieszen, S. Domino, and A. Black. Validation of a simple turbulence model suitable for closure of temporally-filtered navier-stokes equations using a helium plume. Technical Report SAND-20053210, Sandia National Laboratories, Albuquerque, NM, June 2005.
[5] A. M. Agelastos, P. T. Lin. Simulation Information Regarding Sandia National Laboratories’ Trinity Capability Improvement Metric. Sandia Report SAND2013-8748, Sandia National Laboratories, Albuquerque, NM, October 17, 2013
[6] P. T. Lin, M. T. Bettencourt, S. Domino, T. Fisher, M. Hoemmen, J. J. Hu, E. T. Phipps, A. Prokopenko, S. Rajamanickam, C. M. Siefert and Stephen Kennon, Towards Extreme-Scale Simulations for Low Mach Fluids with Second-Generation Trilinos, Parallel Process. Lett., vol. 24, 2014.