Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Evaluation of the NEMO Ocean Model on Arm Neoverse-based AWS Graviton2
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Cloud Computing
  • HPC Compiler
  • GCC
  • MPI
  • infrastructure
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Evaluation of the NEMO Ocean Model on Arm Neoverse-based AWS Graviton2

Phil Ridley
Phil Ridley
June 22, 2020
8 minute read time.

The accuracy of weather and climate predictions is becoming increasingly important. Having the ability to predict extreme events in advance helps us prepare and thus minimize impact. Increasing capability of HPC and weather and climate modeling is key to weather and climate prediction. One such well-known EU and UK ocean modeling and operational forecasting application is NEMO.  

Traditionally, models like NEMO are run using large on-premise HPC systems with 1000’s computing cores but weather researchers are increasingly looking at ways to move their applications to the cloud [1]. This blog discusses NEMO’s performance on AWS Graviton2 processor, based on Arm's Neoverse N1 cores. 

Background to NEMO

NEMO (Nucleus for European Modeling of the Ocean) provides an extensive framework that can be used for oceanographic research, operational oceanography seasonal forecasting and climate studies. NEMO has been developed since 2008 by a consortium of five European institutes:

  • The Fondazione Centro Euro-Mediterraneo sui Cambiamenti Climatici Fondazione (CMCC)
  • The National Centre for Scientific Research or CNRS (Centre National de la Recherche Scientifique)
  • Mercator Ocean
  • The UK Met Office
  • The UK Natural Environment Research Council National Oceanography Centre (NERC-NOC)

NEMO’s ocean model consists of three major components:

  • An ocean dynamics and thermodynamics and primitive equations solver (NEMO-OPA).
  • A sub-grid-scale aware sea and ice thermodynamics simulator and primitive equation solver (NEMO-SI3).
  • An on/off-line oceanic tracer transport and biogeochemical processes solver (NEMO-TOP/PISCES).

This flexible framework enables NEMO to be used as a tool for studying the ocean and its interactions with the other components of the earth climate system (atmosphere, sea-ice, biogeochemical tracers) over a wide range of space and time scales.

The NEMO Source Code

NEMO version 4.0.1 is available via SVN 

svn co http://forge.ipsl.jussieu.fr/nemo/svn/NEMO/releases/release-4.0.1 

The code is implemented in Fortran with MPI for parallelization. NEMO also requires the following external libraries: an MPI distribution (e.g. Open MPI, MPICH, etc.), HDF5, NetCDF-C and NetCDF-Fortran.

Building NEMO

For a fair comparison of code generation and performance across the different CPU architectures we will use the Gnu compiler toolchain (GCC). In the following, we shall only describe how to build the NEMO source, and we assume installations of Open MPI 4.0.3, HDF5-1.10.5, NetCDF-C-4.7.0 and NetCDF-Fortran-4.4.5 are already available.

Instructions how to build NEMO are given on the Arm community GitLab pages. NEMO uses the UK Met Office developed FCM system. Therefore, before starting a build, we only need to a create configuration file in the directory.

release-4.0.1/arch/

An example GCC configuration file (arch-aarch64_gnu.fcm) is described here. For AWS Graviton2 set

%FCFLAGS      -mcpu=neoverse-n1 -fdefault-real-8 -fdefault-double-8 -Ofast -funroll-all-loops -fcray-pointer -ffree-line-length-none -g              

Note that for the non-Arm based processors we need -march=native in place of -mcpu=neoverse-n1

NEMO Benchmarks

The two NEMO benchmarks used for this testing are both distributed along with the NEMO source. The first benchmark is a 1° resolution setup of the BENCH test and will be referred to as BENCH-1. This test is described in a technical report [2] and has 9,013,800 grid points. It was designed to be robust in term of achieving numerical stability and is therefore well-suited to benchmarking and performance analysis. 

The second benchmark is a 25° resolution setup of the GYRE_PISCES configuration test and will be referred to as GYRE_PISCES_25. This is an idealized configuration representing double gyres in the north hemisphere. A beta-plane is used with a regular grid spacing at 25° and a horizontal resolution of 101 vertical levels. This model is forced with analytical heat, freshwater and wind-stress fields. The configuration is coupled to the PISCES biogeochemical model [3] and has 37,875,000 grid points. 

The benchmarks can be built from the directory release-4.0.1/

BENCH-1

./makenemo -m aarch64_gnu -a BENCH -n 'MY_BENCH_gnu' del_key key_iomput add_key key_nosignedzero

GYRE_PISCES_25 

./makenemo -m aarch64_gnu -r GYRE_PISCES -n 'MY_GYRE_PISCES_gnu' del_key key_iomput add_key key_nosignedzero

After a successful build, for BENCH-1, there should be a new directory. 

release-4.0.1/tests/MY_BENCH_gnu/EXP00 

And for GYRE_PISCES_25 

release-4.0.1/cfgs/MY_GYRE_PISCES_gnu/EXP00 

Within those directories there is a symbolic link to the executable nemo.

For performance comparison, we ran BENCH-1 and GYRE_PISCES_25 across several AWS HPC instances (m6g.16xlarge, m5a.16xlarge, m5.16xlarge, m5n.16xlarge, and m4.16xlarge). Each instance type has 64 vCPUs (with a vCPU corresponding to 1 CPU core), 256GiB Memory, AWS Enhanced Network and Elastic Block Store (EBS) filesystem.

AWS Instance

Processor

Network Bandwidth (Gbps)

EBS Bandwidth (Gbps)

m6g.16xlarge 

Graviton2 

25

19

m5a.16xlarge 

AMD EPYC 7571 

12

9.5

m5.16xlarge 

Intel® Xeon® Platinum 8175m

20

13.6

m5n.16xlarge 

Intel® Xeon® Platinum 8259m

75

13.6

m4.16xlarge 

Intel® Xeon® E5-2686

25

10

Single Node Performance

Both BENCH-1 and GYRE_PISCES_25 run on a single node and are not sensitive to I/O effects. We used GCC 9.3 to build NEMO across all instances, apart from m6g.16xlarge (GCC 9.2). Figure 1 shows the time to solution for the BENCH-1 test case. For this analysis, lower time is better.

The AWS Graviton2 shows best performance for this metric. The larger (64kvs32k) L1 cache of Graviton2 helps, since most data access with BENCH-1 relies upon efficient cache reuse rather than vectorization.

Figure 1: Time to Solution for the BENCH-1 Test 

With GYRE_PISCES_25, as shown in Figure 2, the AWS Graviton2 comes in third place behind the two Intel® Xeon® Platinum based instances. GYRE_PISCES_25 is heavily memory-bandwidth bound, and we are comparing 12 memory channels of DDR4-2666/2933 with 8 memory channels of DDR4-3200. Therefore, on a number of vCPUs per-socket comparison, the AWS Graviton2 would be the best choice. In terms of the cost per simulation (lowest cost in $ and turnaround time), the AWS Graviton2 leads, as shown in Figure 3.

Figure 2: Time to Solution for the GYRE_PISCES_25 Configuration Test 

Figure 3: Cost per Simulation for the GYRE_PISCES_25 Configuration Test 

Multi Node Performance

GYRE_PISCES_25 is suitable to run across several nodes and therefore test performance of the AWS Enhanced Network. Figure 4 shows results running this benchmark on up to 4 nodes with AWS Graviton2 instances. Here we use 64 MPI tasks per node. Since the overall number of grid points is fixed, the number of grid points per MPI task will decrease as we use more nodes. Due to the memory-bound aspect of this test (as shown in Figure 2), scalability will depend on inter-node network performance. Here, performance of the AWS Enhanced Network enables excellent scalability.

Figure 4: Running the GYRE_PISCES_25 Configuration Test with Several Nodes

Summary

We have covered the steps taken to run a leading EU/UK ocean modelling application, NEMO, on AWS Graviton2, which is based on Arm Neoverse N1 cores. Building NEMO and its external packages was a straightforward process using the GCC toolchain. Excellent performance was easy to achieve.

Comparing NEMO performance across AWS instances, the AWS Graviton2 is a suitable platform for achieving great performance. This is due to its advantage when considering the number of vCPUs per-socket and its capability for memory-bandwidth bound applications. And at the time of writing, M6g instances are available at a 20% lower price than equivalently configured M5 instances. In terms of cost per simulation, for NEMO this means at least 10% lower cost than the M5 instances.

See Arm Infrastructure solutions for HPC

References

[1] Siuta, D., West, G., Modzelewski, H., Schigas, R., Stull, R., 2016: Viability of Cloud Computing for Real-Time Numerical Weather Prediction, Weather and Forecasting, Volume 31, Issue 6, p.1985-1996, 2016.

[2] Maisonnave, E. and Masson, S., 2019: NEMO 4.0 performance: how to identify and reduce unnecessary communications, Technical Report, TR/CMGC/19/19, CECI, UMR CERFACS/CNRS No5318, France.

[3] Aumont, O., Ethé, C., Tagliabue, A., Bopp, L. and Gehlen, M., 2015: PISCES-v2: an ocean biogeochemical model for carbon and ecosystem studies, Geosci. Model Dev., 8, 2465–2513, 2015. 


Anonymous
Servers and Cloud Computing blog
  • How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

    Peter Ma
    Peter Ma
    Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
    • July 4, 2025
  • Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

    Chris Goodyer
    Chris Goodyer
    In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
    • June 17, 2025
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025