The accuracy of weather and climate predictions is becoming increasingly important. Having the ability to predict extreme events in advance helps us prepare and thus minimize impact. Increasing capability of HPC and weather and climate modeling is key to weather and climate prediction. One such well-known EU and UK ocean modeling and operational forecasting application is NEMO.
Traditionally, models like NEMO are run using large on-premise HPC systems with 1000’s computing cores but weather researchers are increasingly looking at ways to move their applications to the cloud [1]. This blog discusses NEMO’s performance on AWS Graviton2 processor, based on Arm's Neoverse N1 cores.
NEMO (Nucleus for European Modeling of the Ocean) provides an extensive framework that can be used for oceanographic research, operational oceanography seasonal forecasting and climate studies. NEMO has been developed since 2008 by a consortium of five European institutes:
NEMO’s ocean model consists of three major components:
This flexible framework enables NEMO to be used as a tool for studying the ocean and its interactions with the other components of the earth climate system (atmosphere, sea-ice, biogeochemical tracers) over a wide range of space and time scales.
NEMO version 4.0.1 is available via SVN
svn co http://forge.ipsl.jussieu.fr/nemo/svn/NEMO/releases/release-4.0.1
The code is implemented in Fortran with MPI for parallelization. NEMO also requires the following external libraries: an MPI distribution (e.g. Open MPI, MPICH, etc.), HDF5, NetCDF-C and NetCDF-Fortran.
For a fair comparison of code generation and performance across the different CPU architectures we will use the Gnu compiler toolchain (GCC). In the following, we shall only describe how to build the NEMO source, and we assume installations of Open MPI 4.0.3, HDF5-1.10.5, NetCDF-C-4.7.0 and NetCDF-Fortran-4.4.5 are already available.
Instructions how to build NEMO are given on the Arm community GitLab pages. NEMO uses the UK Met Office developed FCM system. Therefore, before starting a build, we only need to a create configuration file in the directory.
release-4.0.1/arch/
An example GCC configuration file (arch-aarch64_gnu.fcm) is described here. For AWS Graviton2 set
arch-aarch64_gnu.fcm
%FCFLAGS -mcpu=neoverse-n1 -fdefault-real-8 -fdefault-double-8 -Ofast -funroll-all-loops -fcray-pointer -ffree-line-length-none -g
Note that for the non-Arm based processors we need -march=native in place of -mcpu=neoverse-n1
-march=native
-mcpu=neoverse-n1
The two NEMO benchmarks used for this testing are both distributed along with the NEMO source. The first benchmark is a 1° resolution setup of the BENCH test and will be referred to as BENCH-1. This test is described in a technical report [2] and has 9,013,800 grid points. It was designed to be robust in term of achieving numerical stability and is therefore well-suited to benchmarking and performance analysis.
The second benchmark is a 25° resolution setup of the GYRE_PISCES configuration test and will be referred to as GYRE_PISCES_25. This is an idealized configuration representing double gyres in the north hemisphere. A beta-plane is used with a regular grid spacing at 25° and a horizontal resolution of 101 vertical levels. This model is forced with analytical heat, freshwater and wind-stress fields. The configuration is coupled to the PISCES biogeochemical model [3] and has 37,875,000 grid points.
The benchmarks can be built from the directory release-4.0.1/
release-4.0.1/
BENCH-1
./makenemo -m aarch64_gnu -a BENCH -n 'MY_BENCH_gnu' del_key key_iomput add_key key_nosignedzero
GYRE_PISCES_25
./makenemo -m aarch64_gnu -r GYRE_PISCES -n 'MY_GYRE_PISCES_gnu' del_key key_iomput add_key key_nosignedzero
After a successful build, for BENCH-1, there should be a new directory.
release-4.0.1/tests/MY_BENCH_gnu/EXP00
And for GYRE_PISCES_25
release-4.0.1/cfgs/MY_GYRE_PISCES_gnu/EXP00
Within those directories there is a symbolic link to the executable nemo.
nemo
For performance comparison, we ran BENCH-1 and GYRE_PISCES_25 across several AWS HPC instances (m6g.16xlarge, m5a.16xlarge, m5.16xlarge, m5n.16xlarge, and m4.16xlarge). Each instance type has 64 vCPUs (with a vCPU corresponding to 1 CPU core), 256GiB Memory, AWS Enhanced Network and Elastic Block Store (EBS) filesystem.
AWS Instance
Processor
Network Bandwidth (Gbps)
EBS Bandwidth (Gbps)
m6g.16xlarge
Graviton2
25
19
m5a.16xlarge
AMD EPYC 7571
12
9.5
m5.16xlarge
Intel® Xeon® Platinum 8175m
20
13.6
m5n.16xlarge
Intel® Xeon® Platinum 8259m
75
m4.16xlarge
Intel® Xeon® E5-2686
10
Both BENCH-1 and GYRE_PISCES_25 run on a single node and are not sensitive to I/O effects. We used GCC 9.3 to build NEMO across all instances, apart from m6g.16xlarge (GCC 9.2). Figure 1 shows the time to solution for the BENCH-1 test case. For this analysis, lower time is better.
The AWS Graviton2 shows best performance for this metric. The larger (64kvs32k) L1 cache of Graviton2 helps, since most data access with BENCH-1 relies upon efficient cache reuse rather than vectorization.
Figure 1: Time to Solution for the BENCH-1 Test
With GYRE_PISCES_25, as shown in Figure 2, the AWS Graviton2 comes in third place behind the two Intel® Xeon® Platinum based instances. GYRE_PISCES_25 is heavily memory-bandwidth bound, and we are comparing 12 memory channels of DDR4-2666/2933 with 8 memory channels of DDR4-3200. Therefore, on a number of vCPUs per-socket comparison, the AWS Graviton2 would be the best choice. In terms of the cost per simulation (lowest cost in $ and turnaround time), the AWS Graviton2 leads, as shown in Figure 3.
Figure 2: Time to Solution for the GYRE_PISCES_25 Configuration Test
Figure 3: Cost per Simulation for the GYRE_PISCES_25 Configuration Test
GYRE_PISCES_25 is suitable to run across several nodes and therefore test performance of the AWS Enhanced Network. Figure 4 shows results running this benchmark on up to 4 nodes with AWS Graviton2 instances. Here we use 64 MPI tasks per node. Since the overall number of grid points is fixed, the number of grid points per MPI task will decrease as we use more nodes. Due to the memory-bound aspect of this test (as shown in Figure 2), scalability will depend on inter-node network performance. Here, performance of the AWS Enhanced Network enables excellent scalability.
Figure 4: Running the GYRE_PISCES_25 Configuration Test with Several Nodes
We have covered the steps taken to run a leading EU/UK ocean modelling application, NEMO, on AWS Graviton2, which is based on Arm Neoverse N1 cores. Building NEMO and its external packages was a straightforward process using the GCC toolchain. Excellent performance was easy to achieve.
Comparing NEMO performance across AWS instances, the AWS Graviton2 is a suitable platform for achieving great performance. This is due to its advantage when considering the number of vCPUs per-socket and its capability for memory-bandwidth bound applications. And at the time of writing, M6g instances are available at a 20% lower price than equivalently configured M5 instances. In terms of cost per simulation, for NEMO this means at least 10% lower cost than the M5 instances.
[CTAToken URL = "https://www.arm.com/solutions/infrastructure" target="_blank" text="See Arm Infrastructure solutions for HPC" class ="green"]
[1] Siuta, D., West, G., Modzelewski, H., Schigas, R., Stull, R., 2016: Viability of Cloud Computing for Real-Time Numerical Weather Prediction, Weather and Forecasting, Volume 31, Issue 6, p.1985-1996, 2016.
[2] Maisonnave, E. and Masson, S., 2019: NEMO 4.0 performance: how to identify and reduce unnecessary communications, Technical Report, TR/CMGC/19/19, CECI, UMR CERFACS/CNRS No5318, France.
[3] Aumont, O., Ethé, C., Tagliabue, A., Bopp, L. and Gehlen, M., 2015: PISCES-v2: an ocean biogeochemical model for carbon and ecosystem studies, Geosci. Model Dev., 8, 2465–2513, 2015.