Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Bringing WRF up to speed with Arm Neoverse
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm Compiler for Linux
  • High Performance Compute
  • arm performance libraries
  • GCC
  • Arm Forge
  • Server and Infrastructure
  • Neoverse
  • Server and HPC
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Bringing WRF up to speed with Arm Neoverse

Phil Ridley
Phil Ridley
October 19, 2022
5 minute read time.

Background

The better understanding of climate change and more reliable weather forecasting requires sophisticated numerical weather prediction models which consume large amounts of High Performance Computing (HPC) resources. Interest in cloud-based HPC for such models continues to grow [1].

One such widely used application is the WRF (Weather Research and Forecasting) model. This blog discusses running a simple WRF model on the AWS Graviton2 and AWS Graviton3 processors, which are both based on Arm Neoverse core designs. Interestingly, the AWS Graviton3 is the first cloud-based processor that has the Arm Scalable Vector Extension (SVE) ISA and DDR5 memory. We look at the performance of WRF on both instance types and review a few easy steps to consider to maximize performance.

Instances

The AWS Graviton2 (c6g) was introduced back in 2019 and is based on the Arm Neoverse N1 core design. Each N1 core operates with 2x128-bit (Neon) floating-point units per cycle. The AWS Graviton3 (c7g) was introduced back in late 2021 and is based on the Arm Neoverse V1 core design. Neoverse V1 also has SVE which enables wider vectors than with Neon. The AWS Graviton3 has a vector width of 256-bits, meaning it can operate with either 2x256-bit (SVE) or 4x128-bit (Neon) floating-point units per cycle. So in theory a 2x increase in floating-point performance is possible.

 

Instance

Amazon Linux 2 kernel

CPU

Memory

c6g.16xlarge

4.14.294-220.533.amzn2.aarch64

64xAWS Graviton2 single socket.

Running at 2.5GHz

8 memory channels of DD4-3200

Single NUMA regions

c7g.16xlarge

5.10.144-127.601.amzn2.aarch64

64xAWS Graviton3 single socket. Running at 2.6GHz

8 memory channels of DD5-4800

Single NUMA regions 

In terms of multi-node capability the c6g(n) is AWS 100Gbs EFA ready, whereas the c7g is currently only available with a 30Gbs network. For WRF, there are two well-known test cases:  Conus12km which can be run on a single node, and the larger Conus2.5km, which is more suited to multi-node runs. Here, we keep to single node Conus12km runs, to keep the discussion around features in common between instances. In practice, the impact of the speed of the interconnect on scalability depends on the size of the WRF case of interest and how many instances used. For some cases, this might be around 16+ instances [2].

Build details

We use WRF4.4, which is the current release, along with dependencies: OpenMPI-4.1.3, HDF5-1.13.1, NetCDF-C-4.8.1 and NetCDF-F-4.5.4. These dependencies are built with GCC-12.2.0 across all AWS instances. It is equally possible to use other toolchains to build WRF, such as the Arm Compiler for Linux or the NVIDIA HPC Software Development Kit (SDK) [3]. GCC currently just has a small edge on out-of-the-box performance for this particular test. The flags used are shown in the following table.

Instance GCC12.2.0 flags
c6g -march=armv8.2-a -mtune=neoverse-n1
c7g -march=armv8.4-a -mtune=neoverse-512tvb

The dependencies build without any modification across all selected instances, however WRF4.4 itself does need a few minor modifications. These modifications can be applied at the configuration step, as mentioned here. If for any reason a build fails, then it is worth checking the configure.wrf file to see that the correct flags are set.

Performance comparison

Performing runs on each instance gives us the following. We found that running with 8 MPI tasks and 8 OpenMP threads per task gave best overall results. Here s/ts denotes seconds/time step (or Mean Time per Step) and is taken as the average of the values for 'Timing for main' from the resulting rsl.error.0000 file.

Instance

s/ts

Launch line (OMP_NUM_THREADS=8 OMP_PLACES=cores)

c6g.16xlarge

1.6383

mpirun -n 8 --report-bindings --map-by socket:PE=8 ./wrf.exe

c7g.16xlarge

1.13068

mpirun -n 8 --report-bindings --map-by socket:PE=8 ./wrf.exe

Let us consider c6g vs c7g performance, as there is a significant difference. What is helping the most on the c7g? Is it having double the vector width, SVE instructions, or the faster DDR5 memory? Checking whether it is the SVE instructions is easy, just by running the c6g (AWS Graviton2) built (non-SVE) executable on the c7g. In fact, we see that performance is almost the same, telling us that it is not the SVE helping here. So, the uplift is mostly from the faster memory bandwidth with DDR5 and the 4x128-bit (Neon) floating-point units.

It is worth mentioning that checking SVE vs Neon performance for your application is a good idea on any SVE enabled processor. It really depends on the application whether it is able to utilize SVE instructions optimally.

Better performance without effort

If we have a closer look at the performance, using Arm Forge we can see that the c6g shows 0.98 Cycles Per Instruction, whereas c7g is much lower at 0.66 (same for Neon or SVE). This reduction is also contributing to the performance uplift going from c6g to c7g and is due to the improved Neoverse V1 CPU pipeline. It is also worth noting that the overall wall-clock times for c6g and c7g are 896s and 614s, respectively. Can we do anything to lower these times?

From the list of Functions in Figure 1 we see that powf_finite, which is from the standard C library of basic mathematical functions(libm-2.26), is taking 6.1% of the overall runtime. There are also other similar functions taking around 1% of this time, along with memset and memcpy from the standard C library (libc-2.26).

Figure 1: List of top functions

We can improve on this result very easily by replacing the implementation of these functions with ones from the Arm Performance Libraries. This library not only contains highly optimized routines for BLAS, LAPACK, and FFTW but also libamath, which has the widely used functions from libm. The library also contains an optimized memset and memcpy in libastring. We can use these libraries by just relinking the main wrf.exe with libamath and libastring (taking care to place the link leftmost to the one for libm), for example.

mpif90 -o wrf.exe … -L/opt/arm/armpl-22.1.0_AArch64_RHEL-7_gcc_aarch64-linux/lib -lastring -L/opt/arm/armpl-22.1.0_AArch64_RHEL-7_gcc_aarch64-linux/lib -lamath -lm -lz

In terms of overall wall-clock time, we measure 821s for c6g and 569s c7g. This is a good improvement for simply switching to another library. Similarly, as recommended here, some applications can benefit from zlib-cloudflare which has been optimized for faster compression. Trying again, this time with the optimized libz, wall-clock times reduce to 776s and 539s for c6g and c7g, respectively. In summary, the corresponding Mean Time per Step values as shown in Figure 2, agree with the overall wall-clock times.

Comparison of Mean Time per Step for c6g and c7g

Figure 2: Comparison of Mean Time per Step (s) for c6g and c7g

Summary

In this blog we have taken a widely used numerical weather prediction model, WRF4.4, and compared performance for two AWS EC2 instances: the AWS Graviton2 (c6g) and AWS Graviton3 (c7g). For a standard GCC12.2.0 build of WRF4.4 we have seen that the c7g gives 30% better performance than c6g. We have also seen how easy it is to achieve a further 13% performance improvement by using the Arm Performance Libraries and an optimized implementation of zlib.

More HPC blogs

References

[1] https://www.metoffice.gov.uk/about-us/press-office/news/corporate/2021/met-office-and-microsoft-announce-supercomputer-project

[2] https://aws.amazon.com/blogs/hpc/numerical-weather-prediction-on-aws-graviton2/

[3] https://github.com/arm-hpc-devkit/nvidia-arm-hpc-devkit-users-guide/blob/main/examples/wrf.md

Anonymous
  • Phil Ridley
    Offline Phil Ridley 5 months ago in reply to Honnappa Nagarahalli

    Either approaches (-mcpu or -march and -mtune) are fine - it's mostly personal preference, e.g. with GCC

    on G3

    -mcpu=native => -march=armv8.4-a -mtune=zeus

    and on G2


    -mcpu=native => -march=armv8.2-a -mtune=ares


    Alternatively, it's possible to specify directly, e.g.

    -mcpu=neoverse-n1 => -march=armv8.2-a -mtune=neoverse-n1
    -mcpu=neoverse-v1 => -march=armv8.4-a -mtune=neoverse-v1
    -mcpu=neoverse-512tvb => -march=armv8.4-a -mtune=neoverse-512tvb -march=native => -mtune=generic

    Please bear in mind that if -mcpu=native is used to build on G3 then the resulting executable may not necessarily be able to run on G2, but an executable built on G2 with -mcpu=native should be able to run on G3.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Honnappa Nagarahalli
    Offline Honnappa Nagarahalli 5 months ago

    Is it better to use -mpcu=native for compiler flags? Developers do not have to worry about the details of the underlying CPU (other than G2 or G3) and there is no need of cross compilation anymore.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
High Performance Computing (HPC) blog
  • AWS Graviton3 improves Cadence EDA tools performance for Arm

    Tim Thornton
    Tim Thornton
    In this blog we provide an update to our use of Cadence EDA tools in the AWS cloud, with a focus on Graviton3 performance improvements.
    • November 16, 2022
  • A case study in vectorizing HACCmk using SVE

    Brian Waldecker
    Brian Waldecker
    This blog uses the HACCmk benchmark to demonstrate the vectorization capabilities and benefits of SVE over NEON (ASIMD)
    • November 3, 2022
  • Bringing WRF up to speed with Arm Neoverse

    Phil Ridley
    Phil Ridley
    In this blog we examine the WRF weather model and examine the performance improvement available using AWS Graviton3 (Neoverse V1 core) compared to AWS Graviton2 (Neoverse N1 core).
    • October 19, 2022