Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
High Performance Computing (HPC) blog Assessing AWS Graviton2 for running WRF
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • GCC
  • MPI
  • infrastructure
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Assessing AWS Graviton2 for running WRF

Phil Ridley
Phil Ridley
August 17, 2020
5 minute read time.

Operational weather forecasting is key to informing many important decisions that are made in agriculture, marine, air transport, environment, military, financial, and other sectors. The WRF (Weather Research and Forecasting) model is a widely used US model that can simulate forecasts based on actual atmospheric conditions (observations). Real-time forecasting applications can then use these results to make informed decisions.

Over the last few years weather researchers have been investigating the use of cloud resources to run their models, with WRF [1] being a popular example. Access to such flexible resources is expected to enable more researchers the ability to perform advanced WRF simulations [2]. In this blog, we discuss running WRF on the AWS Graviton2 processor, based on Arm's Neoverse N1 cores.

Background to WRF

The WRF model is a numerical weather prediction system designed for atmospheric research and operational forecasting applications. WRF has been developed since the late 1990s through a collaborative partnership between.

  • National Center for Atmospheric Research (NCAR)
  • The National Oceanic and Atmospheric Administration (NOAA)
  • The U.S. Air Force (USAF)
  • The Naval Research Laboratory (NRL)
  • The University of Oklahoma (OU)
  • The Federal Aviation Administration (FAA)

WRF features two dynamical cores.

  • The ARW (Advanced Research WRF)
  • The NMM (Nonhydrostatic Mesoscale Model)

This enables WRF to support a wide range of meteorological applications across scales from 0.001-1000kms. There are over 30,000 registered WRF users world-wide.

The WRF Source Code

Releases of the WRF source code are available to download here. This blog discusses version 3.9.1.1, so that we can compare performance with existing results and use readily available data sets for reproducibility. The latter is not possible from WRF version 4.

WRF uses both MPI and OpenMP for parallelization, and also requires the following external libraries: an MPI distribution (for example, Open MPI, MPICH, and so on), HDF5, NetCDF-C NetCDF-Fortran.

Building WRF

Instructions how to build WRF are given on the Arm community GitLab pages. This is a straightforward process. The environment variables HDFDIR and NETCDF need to be set to the location of the HDF5 and NetCDF installations and the stanza set in the file.

WRFV3-3.9.1.1/arch/configure_new.defaults

Then execute the configure script (in dm+SM mode), followed by the build script. To facilitate a fair comparison we will use the Gnu compiler toolchain (GCC) version 9.3.

WRF Performance Benchmarks

The two WRF benchmarks we use are the Continental U.S. (CONUS) 2.5km and 12km: CONUS 2.5km and CONUS 12km. CONUS 12km is a 48-hour simulation with a time step of 72 seconds at October 24, 2001. The benchmark period is for hours 25-27 starting from a restart file at the end of hour 24. CONUS 2.5km is a 9-hour simulation with a time step of 15 seconds at June 4, 2005. The benchmark period is for hours 6-9, starting from a restart file at the end of hour 6. Since CONUS 2.5km is at a much higher resolution than CONUS 12km, this is the larger case, making it much better for demonstrating processor scalability.

For performance testing, we ran CONUS 2.5km and CONUS 12km on up to 12 AWS Graviton2 HPC instances (c6g.16xlarge). Each instance has 64 vCPUs (with a vCPU corresponding to 1 CPU core), 128GiB Memory, AWS Enhanced Network running at 25Gbps and Elastic Block Store (EBS) filesystem running at 19Gbps.

Single Node Performance

To optimize single-node performance we first determined the best combination of MPI task/OMP thread to use for each test. For CONUS 12km Figure 1 shows the lowest time per simulation time-step (0.33 elapsed seconds) could be achieved by using 8 MPI tasks with 8 OMP threads.

Figure 1: Mean Time per Step for CONUS 12km

Similarly, for CONUS 2.5km, Figure 2 shows the lowest time per simulation time-step (4.97 elapsed seconds) could be achieved by using 32 MPI tasks with 2 OMP threads. This can be attributed to vectorization of the key loops helping more, whereas the (smaller) CONUS 12km test can benefit from memory-bandwidth alone.

 Figure 2: Mean Time per Step for CONUS 2.5km

Multi Node Performance

CONUS 2.5km is known to scale well and is therefore suitable to test performance of the AWS Enhanced Network across several nodes. Figure 3 shows results running this benchmark on up to 12 nodes with AWS Graviton2 instances. Per-node we use 32 MPI tasks with 2 OMP threads. As CONUS 2.5km is of fixed size, the number of grid points per MPI task decrease as we use more nodes. Performance of the AWS Enhanced Network enables excellent scalability, and this agrees with the almost linear scalability reported in [3].

Figure 3: Strong Scaling for CONUS 2.5km

Performance Comparison

For a brief performance comparison with other AWS instances, we also tested WRF on c5a.16xlarge (AMD EPYC 7571), c5.18xlarge (Intel® Xeon® Platinum 8175m) and c5n.18xlarge (Intel® Xeon® Platinum 8259CL). Again, using the same Gnu built software stack with the following flags for WRF.

-Ofast -march=native -fopenmp -frecursive -funroll-loops  

Results for CONUS 12km (Figure 4) and CONUS 2.5km (Figure 5) show the best performance is achieved with the AWS Graviton2. This is due to both CPU capability and maturity of the Gnu compiler for Arm architectures.

Figure 4: Comparison of CONUS 12km Across AWS Instances

Figure 5: Comparison of CONUS 2.5km Across AWS Instances

Summary

We have demonstrated how easy it is to run a widely used weather forecasting application, WRF on AWS Graviton2, which is based on Arm Neoverse N1 cores. The AWS platform as a whole enables excellent scalability. By comparing performance of WRF with a selection of other AWS instances, the AWS Graviton2 shows the overall fastest time to solution.

At the time of writing, c6g instances are available at a 20% lower price than equivalently configured c5 instances. In terms of cost per simulation, for WRF this means at least 45% lower cost than the c5 instances.

See Arm Infrastructure solutions for HPC

References

[1] Goga, K., Parodi, A., Ruiu, P., Terzo, O., Performance Analysis of WRF Simulations in a Public Cloud and HPC environment, Volume 611, p. 384-396, 2017.

[2] Powers, G. Jordan, and others, The Weather Research and Forecasting Model: Overview, System Efforts, and Future Directions, Volume 98(8), Bull. Amer. Meteor. Soc., p. 1717–1737, 2017.

[3] https://www.amd.com/system/files/documents/wrf-and-amd-epyc-the-right-combination-for-weather-modeling.pdf

Anonymous
Parents
  • mayujsw
    Offline mayujsw over 2 years ago

    hi, I tried this conus 12 on AWS c6g.16xlarge following the arm gitlab setup: https://gitlab.com/arm-hpc/packages/-/wikis/packages/wrf-modeler. it would run into seg fault with this config with gcc. but  after added -O3 to FCBASEOPTS, it can work, however, the score is 0.362, not 0.33, is there any special config to get this good score ?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Phil Ridley
    Offline Phil Ridley over 2 years ago in reply to mayujsw

    Hi, can you please try with FCOPTIM=-mcpu=neoverse-n1 -Ofast -fopenmp -frecursive, with similar for FCNOOPT (but with -O0 in place of -Ofast) and FCBASEOPTS.

    (From GCC10 also add -fallow-argument-mismatch -fallow-invalid-boz to these flags and to the flags for FORMAT_FIXED and FORMAT_FREE)

    Thanks

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • mayujsw
    Offline mayujsw over 2 years ago in reply to Phil Ridley

    Thanks a lot for the reply, I've tried following your advise, the best score with these flags is 0.35 stably, not as good as 0.33. 

    The gcc version is 9.3.0, the same as the one on the arm gitlab, any other advise or config I should use to get this good score ? 

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Phil Ridley
    Offline Phil Ridley over 2 years ago in reply to mayujsw

    Hi, thanks for the update. Can you please try with GCC10.2

    The only other thing to try would be the free Arm Performance Libraries (https://developer.arm.com/tools-and-software/server-and-hpc/downloads/arm-performance-libraries). The libamath library contains better optimized versions of log, exp and pow and sometimes these can help. So LDFLAGS_LOCAL and FCOPTIM would need extra -L<path to Arm Performance Libraries installation> -lamath

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • mayujsw
    Offline mayujsw over 2 years ago in reply to Phil Ridley

    Thanks Phil, I suppose that all these scores are based on gcc 9.3 as mentioned in this blog. if there is the case, I'm wondering any other config I'm missing to replicate

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Phil Ridley
    Offline Phil Ridley over 2 years ago in reply to mayujsw

    The config should be okay, that's as far as it can go. This test case does rely on IO during the time steps and it's possibly giving different performance than on the system I used. 

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • mayujsw
    Offline mayujsw over 2 years ago in reply to Phil Ridley

    OK, Thanks Phil, I‘m using ubuntu 20.04, may i know which system you are using 

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Comment
  • mayujsw
    Offline mayujsw over 2 years ago in reply to Phil Ridley

    OK, Thanks Phil, I‘m using ubuntu 20.04, may i know which system you are using 

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Children
  • Phil Ridley
    Offline Phil Ridley over 2 years ago in reply to mayujsw

    That's interesting, I've been using RHEL 8 based OSs. Thanks, Phil.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
High Performance Computing (HPC) blog
  • AWS Graviton3 improves Cadence EDA tools performance for Arm

    Tim Thornton
    Tim Thornton
    In this blog we provide an update to our use of Cadence EDA tools in the AWS cloud, with a focus on Graviton3 performance improvements.
    • November 16, 2022
  • A case study in vectorizing HACCmk using SVE

    Brian Waldecker
    Brian Waldecker
    This blog uses the HACCmk benchmark to demonstrate the vectorization capabilities and benefits of SVE over NEON (ASIMD)
    • November 3, 2022
  • Bringing WRF up to speed with Arm Neoverse

    Phil Ridley
    Phil Ridley
    In this blog we examine the WRF weather model and examine the performance improvement available using AWS Graviton3 (Neoverse V1 core) compared to AWS Graviton2 (Neoverse N1 core).
    • October 19, 2022