Evaluation of the NEMO Ocean Model on Arm Neoverse-based AWS Graviton2

June 22, 2020

8 minute read time.

The accuracy of weather and climate predictions is becoming increasingly important. Having the ability to predict extreme events in advance helps us prepare and thus minimize impact. Increasing capability of HPC and weather and climate modeling is key to weather and climate prediction. One such well-known EU and UK ocean modeling and operational forecasting application is NEMO.

Traditionally, models like NEMO are run using large on-premise HPC systems with 1000’s computing cores but weather researchers are increasingly looking at ways to move their applications to the cloud [1]. This blog discusses NEMO’s performance on AWS Graviton2 processor, based on Arm's Neoverse N1 cores.

Background to NEMO

NEMO (Nucleus for European Modeling of the Ocean) provides an extensive framework that can be used for oceanographic research, operational oceanography seasonal forecasting and climate studies. NEMO has been developed since 2008 by a consortium of five European institutes:

The Fondazione Centro Euro-Mediterraneo sui Cambiamenti Climatici Fondazione (CMCC)
The National Centre for Scientific Research or CNRS (Centre National de la Recherche Scientifique)
Mercator Ocean
The UK Met Office
The UK Natural Environment Research Council National Oceanography Centre (NERC-NOC)

NEMO’s ocean model consists of three major components:

An ocean dynamics and thermodynamics and primitive equations solver (NEMO-OPA).
A sub-grid-scale aware sea and ice thermodynamics simulator and primitive equation solver (NEMO-SI3).
An on/off-line oceanic tracer transport and biogeochemical processes solver (NEMO-TOP/PISCES).

This flexible framework enables NEMO to be used as a tool for studying the ocean and its interactions with the other components of the earth climate system (atmosphere, sea-ice, biogeochemical tracers) over a wide range of space and time scales.

The NEMO Source Code

NEMO version 4.0.1 is available via SVN

svn co http://forge.ipsl.jussieu.fr/nemo/svn/NEMO/releases/release-4.0.1

The code is implemented in Fortran with MPI for parallelization. NEMO also requires the following external libraries: an MPI distribution (e.g. Open MPI, MPICH, etc.), HDF5, NetCDF-C and NetCDF-Fortran.

Building NEMO

For a fair comparison of code generation and performance across the different CPU architectures we will use the Gnu compiler toolchain (GCC). In the following, we shall only describe how to build the NEMO source, and we assume installations of Open MPI 4.0.3, HDF5-1.10.5, NetCDF-C-4.7.0 and NetCDF-Fortran-4.4.5 are already available.

Instructions how to build NEMO are given on the Arm community GitLab pages. NEMO uses the UK Met Office developed FCM system. Therefore, before starting a build, we only need to a create configuration file in the directory.

release-4.0.1/arch/

An example GCC configuration file (arch-aarch64_gnu.fcm) is described here. For AWS Graviton2 set

%FCFLAGS      -mcpu=neoverse-n1 -fdefault-real-8 -fdefault-double-8 -Ofast -funroll-all-loops -fcray-pointer -ffree-line-length-none -g

Note that for the non-Arm based processors we need -march=native in place of -mcpu=neoverse-n1

NEMO Benchmarks

The two NEMO benchmarks used for this testing are both distributed along with the NEMO source. The first benchmark is a 1° resolution setup of the BENCH test and will be referred to as BENCH-1. This test is described in a technical report [2] and has 9,013,800 grid points. It was designed to be robust in term of achieving numerical stability and is therefore well-suited to benchmarking and performance analysis.

The second benchmark is a 25° resolution setup of the GYRE_PISCES configuration test and will be referred to as GYRE_PISCES_25. This is an idealized configuration representing double gyres in the north hemisphere. A beta-plane is used with a regular grid spacing at 25° and a horizontal resolution of 101 vertical levels. This model is forced with analytical heat, freshwater and wind-stress fields. The configuration is coupled to the PISCES biogeochemical model [3] and has 37,875,000 grid points.

The benchmarks can be built from the directory release-4.0.1/

BENCH-1

./makenemo -m aarch64_gnu -a BENCH -n 'MY_BENCH_gnu' del_key key_iomput add_key key_nosignedzero

GYRE_PISCES_25

./makenemo -m aarch64_gnu -r GYRE_PISCES -n 'MY_GYRE_PISCES_gnu' del_key key_iomput add_key key_nosignedzero

After a successful build, for BENCH-1, there should be a new directory.

release-4.0.1/tests/MY_BENCH_gnu/EXP00

And for GYRE_PISCES_25

release-4.0.1/cfgs/MY_GYRE_PISCES_gnu/EXP00

Within those directories there is a symbolic link to the executable nemo.

For performance comparison, we ran BENCH-1 and GYRE_PISCES_25 across several AWS HPC instances (m6g.16xlarge, m5a.16xlarge, m5.16xlarge, m5n.16xlarge, and m4.16xlarge). Each instance type has 64 vCPUs (with a vCPU corresponding to 1 CPU core), 256GiB Memory, AWS Enhanced Network and Elastic Block Store (EBS) filesystem.

AWS Instance	Processor	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)
m6g.16xlarge	Graviton2	25	19
m5a.16xlarge	AMD EPYC 7571	12	9.5
m5.16xlarge	Intel® Xeon® Platinum 8175m	20	13.6
m5n.16xlarge	Intel® Xeon® Platinum 8259m	75	13.6
m4.16xlarge	Intel® Xeon® E5-2686	25	10

Single Node Performance

Both BENCH-1 and GYRE_PISCES_25 run on a single node and are not sensitive to I/O effects. We used GCC 9.3 to build NEMO across all instances, apart from m6g.16xlarge (GCC 9.2). Figure 1 shows the time to solution for the BENCH-1 test case. For this analysis, lower time is better.

The AWS Graviton2 shows best performance for this metric. The larger (64kvs32k) L1 cache of Graviton2 helps, since most data access with BENCH-1 relies upon efficient cache reuse rather than vectorization.

Figure 1: Time to Solution for the BENCH-1 Test

With GYRE_PISCES_25, as shown in Figure 2, the AWS Graviton2 comes in third place behind the two Intel® Xeon® Platinum based instances. GYRE_PISCES_25 is heavily memory-bandwidth bound, and we are comparing 12 memory channels of DDR4-2666/2933 with 8 memory channels of DDR4-3200. Therefore, on a number of vCPUs per-socket comparison, the AWS Graviton2 would be the best choice. In terms of the cost per simulation (lowest cost in $ and turnaround time), the AWS Graviton2 leads, as shown in Figure 3.

Figure 2: Time to Solution for the GYRE_PISCES_25 Configuration Test

Figure 3: Cost per Simulation for the GYRE_PISCES_25 Configuration Test

Multi Node Performance

GYRE_PISCES_25 is suitable to run across several nodes and therefore test performance of the AWS Enhanced Network. Figure 4 shows results running this benchmark on up to 4 nodes with AWS Graviton2 instances. Here we use 64 MPI tasks per node. Since the overall number of grid points is fixed, the number of grid points per MPI task will decrease as we use more nodes. Due to the memory-bound aspect of this test (as shown in Figure 2), scalability will depend on inter-node network performance. Here, performance of the AWS Enhanced Network enables excellent scalability.

Figure 4: Running the GYRE_PISCES_25 Configuration Test with Several Nodes

Summary

We have covered the steps taken to run a leading EU/UK ocean modelling application, NEMO, on AWS Graviton2, which is based on Arm Neoverse N1 cores. Building NEMO and its external packages was a straightforward process using the GCC toolchain. Excellent performance was easy to achieve.

Comparing NEMO performance across AWS instances, the AWS Graviton2 is a suitable platform for achieving great performance. This is due to its advantage when considering the number of vCPUs per-socket and its capability for memory-bandwidth bound applications. And at the time of writing, M6g instances are available at a 20% lower price than equivalently configured M5 instances. In terms of cost per simulation, for NEMO this means at least 10% lower cost than the M5 instances.

See Arm Infrastructure solutions for HPC

References

[1] Siuta, D., West, G., Modzelewski, H., Schigas, R., Stull, R., 2016: Viability of Cloud Computing for Real-Time Numerical Weather Prediction, Weather and Forecasting, Volume 31, Issue 6, p.1985-1996, 2016.

[2] Maisonnave, E. and Masson, S., 2019: NEMO 4.0 performance: how to identify and reduce unnecessary communications, Technical Report, TR/CMGC/19/19, CECI, UMR CERFACS/CNRS No5318, France.

[3] Aumont, O., Ethé, C., Tagliabue, A., Bopp, L. and Gehlen, M., 2015: PISCES-v2: an ocean biogeochemical model for carbon and ecosystem studies, Geosci. Model Dev., 8, 2465–2513, 2015.

Servers and Cloud Computing blog

Refining MurmurHash64A for greater efficiency in Libstdc++

Zongyao Zhang

Discover how tuning MurmurHash64A’s memory access pattern yields up to 9% faster hashing performance.
- October 16, 2025
How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog