The AWS Graviton4 CPU, built using the Arm Neoverse V2 core and Coherent Mesh Network (CMN-700), is now generally available. The first availability of Graviton4 is in the Amazon EC2 R8g instance family, which targets for memory-intensive workloads, such as large in-memory databases and big data analytics. Arm is evaluating R8g for our in house EDA workloads and have found that their improved performance and memory capabilities further reinforce the benefits of our migration of high performance engineering workloads to the Arm architecture.
The Graviton4 marks a significant leap forward in capability compared to its predecessor targeted at HPC, the Graviton3E, and we want to see how this performs in practice.
Each AWS Graviton vCPU is a full core.
Much of HPC application performance is a memory-bandwidth story, and as the specification table shows, the Graviton4 moves the dial significantly with 16.7% more main-memory bandwidth per core, and a doubling of L2 cache per vCPU which is enabled by the Arm Neoverse V2 core.
In this first blog on the platform, we will examine workloads from a number of HPC domains:
Using AWS Graviton3e (hpc7g.16xlarge) and AWS Graviton4 (r8g.24xlarge) single socket instances we compare improvement between the two generations of CPU.
While demonstrating the improvement in per-core performance, we caveat that the difference in core (vCPU) count between AWS Graviton3E (64-core) and AWS Graviton4 (96-core) prevents a perfect comparison. Higher core counts can introduce scaling inefficiency due to the testcase being distributed into smaller chunks which may mean higher overheads than running on lower core counts.
Nonetheless, our first chart shows the benchmarks as performance per core (total performance divided by number of cores), as this lets us account for the different number of cores per instance.
AWS Graviton3E is on Rocky Linux 9, and AWS Graviton 4 is on Ubuntu 22.04. Performance is shown as best-of across GNU 13.2 and Arm Compiler for Linux 24.04 – and across the range of MxN (M MPI tasks x N OpenMP), and Open MPI 5.0 and MPICH 4.1, all compiled using Spack (https://github.com/spack/spack) (develop branch, 2024-07).
The geomean across this set of workloads gives AWS Graviton4 a +24% per vCPU performance advantage over AWS Graviton3E.
Some of these performance improvements could be anticipated – such as OpenFOAM which, as a CFD application, is widely known to be memory bound and is sure to benefit from ~17% more memory bandwidth per vCPU.
Where our high expectations were exceeded, however, is on the LAMMPS and Relion tests – both compute-bound applications. LAMMPS improves by 32% and Relion by 41%. This suggests that, addition to the 7.5% faster clock speed and memory bandwidth improvements of AWS Graviton4, the micro-architecture enhancements made to our Neoverse V2 core have outsized benefits for some workloads.
WRF includes both memory-bound and compute-bound regions, and shows a healthy 24% boost per vCPU from AWS Graviton4.
As both 4th Gen AMD EPYC (“Genoa”) and AWS Graviton4 are available as 192-vCPU instances in AWS, we can do an exact comparison between the two.
We use a c7a.48xlarge – which is a dual-socket 192-vCPU AMD EPYC, and r8g.48xlarge which is the dual-socket 192-vCPU AWS Graviton4. Worth noting, unlike previous generations of AMD EPYC instances (ex, c6a), every vCPU on a c7a instance is a physical CPU core.
4th-gen AMD EPYC-based c7a instances have 7.5% less memory bandwidth per core - due to supporting DDR5-5200 - compared to Graviton4-based r8g instances which use DDR5-5600, and smaller L2 cache, but on the other hand benefits from a larger L3 cache.
As before, we used the best results from AMD’s AOCC, Intel One compiler, and GCC, with Spack as the build system.
Across the majority of our HPC workloads, on a like-for-like 192-core basis, it can be seen that AWS Graviton4 often delivers significantly better performance than the latest 4th-Gen AMD EPYC (“Genoa”)-based instances – with the geomean giving AWS Graviton4 a 15.2% advantage.
We are very excited about the general availability of AWS Graviton4 CPUs and would like to congratulate the teams at Amazon and Annapurna Labs on this achievement. As an early adopter of Arm Neoverse cores in AWS Graviton CPUs, AWS has systematically improved the capabilities of Graviton – releasing an incredible four new generations over five years, each with greater performance and capabilities to address the full spectrum of cloud workloads. In addition, Graviton4 is the most powerful and power-efficient chip AWS has ever built. We look forward to the continued pace of innovation of AWS Graviton and the positive impact each generation brings to performance and sustainable efficiency. While AI promises to bring transformation benefits across industries, for this to happen customers everywhere need to improve the efficiency of their core compute.
More Neoverse Blogs