AWS Graviton4 demonstrates leading performance for HPC

July 31, 2024

4 minute read time.

The AWS Graviton4 CPU, built using the Arm Neoverse V2 core and Coherent Mesh Network (CMN-700), is now generally available. The first availability of Graviton4 is in the Amazon EC2 R8g instance family, which targets for memory-intensive workloads, such as large in-memory databases and big data analytics. Arm is evaluating R8g for our in house EDA workloads and have found that their improved performance and memory capabilities further reinforce the benefits of our migration of high performance engineering workloads to the Arm architecture.

The Graviton4 marks a significant leap forward in capability compared to its predecessor targeted at HPC, the Graviton3E, and we want to see how this performs in practice.

	AWS Graviton3E	AWS Graviton4
Core	Arm Neoverse V1	Arm Neoverse V2
SIMD	4x128b Neon or 2x256b SVE	4x128b Neon or 4x128b SVE2
Frequency	2.6GHz	2.7GHz (192c) / 2.8GHz (96c)
vCPU per socket¹	64	96
Max sockets per instance	1	2
Max total vCPU per instance	64	192
Memory channels per socket	8 x DDR5-4800	12 x DDR5-5600
L2 Cache/vCPU	1MB per core	2MB per core
SLC Cache	32MB per socket	36MB per socket

Each AWS Graviton vCPU is a full core.

Much of HPC application performance is a memory-bandwidth story, and as the specification table shows, the Graviton4 moves the dial significantly with 16.7% more main-memory bandwidth per core, and a doubling of L2 cache per vCPU which is enabled by the Arm Neoverse V2 core.

In this first blog on the platform, we will examine workloads from a number of HPC domains:

Molecular Dynamics (domain): LAMMPS (workload)
Geosciences: SW4Lite
CFD: OpenFOAM
Cryo-EM: Relion
Weather: WRF

Improvement from AWS Graviton3E to AWS Graviton4

Using AWS Graviton3e (hpc7g.16xlarge) and AWS Graviton4 (r8g.24xlarge) single socket instances we compare improvement between the two generations of CPU.

While demonstrating the improvement in per-core performance, we caveat that the difference in core (vCPU) count between AWS Graviton3E (64-core) and AWS Graviton4 (96-core) prevents a perfect comparison. Higher core counts can introduce scaling inefficiency due to the testcase being distributed into smaller chunks which may mean higher overheads than running on lower core counts.

Nonetheless, our first chart shows the benchmarks as performance per core (total performance divided by number of cores), as this lets us account for the different number of cores per instance.

AWS Graviton3E is on Rocky Linux 9, and AWS Graviton 4 is on Ubuntu 22.04. Performance is shown as best-of across GNU 13.2 and Arm Compiler for Linux 24.04 – and across the range of MxN (M MPI tasks x N OpenMP), and Open MPI 5.0 and MPICH 4.1, all compiled using Spack (https://github.com/spack/spack) (develop branch, 2024-07).

HPC Workload Comparison Graviton3E to Graviton4

The geomean across this set of workloads gives AWS Graviton4 a +24% per vCPU performance advantage over AWS Graviton3E.

Some of these performance improvements could be anticipated – such as OpenFOAM which, as a CFD application, is widely known to be memory bound and is sure to benefit from ~17% more memory bandwidth per vCPU.

Where our high expectations were exceeded, however, is on the LAMMPS and Relion tests – both compute-bound applications. LAMMPS improves by 32% and Relion by 41%. This suggests that, addition to the 7.5% faster clock speed and memory bandwidth improvements of AWS Graviton4, the micro-architecture enhancements made to our Neoverse V2 core have outsized benefits for some workloads.

WRF includes both memory-bound and compute-bound regions, and shows a healthy 24% boost per vCPU from AWS Graviton4.

Dual socket AWS Graviton4 performance relative to 4^th Generation AMD EPYC

As both 4^th Gen AMD EPYC (“Genoa”) and AWS Graviton4 are available as 192-vCPU instances in AWS, we can do an exact comparison between the two.

We use a c7a.48xlarge – which is a dual-socket 192-vCPU AMD EPYC, and r8g.48xlarge which is the dual-socket 192-vCPU AWS Graviton4. Worth noting, unlike previous generations of AMD EPYC instances (ex, c6a), every vCPU on a c7a instance is a physical CPU core.

4^th-gen AMD EPYC-based c7a instances have 7.5% less memory bandwidth per core - due to supporting DDR5-5200 - compared to Graviton4-based r8g instances which use DDR5-5600, and smaller L2 cache, but on the other hand benefits from a larger L3 cache.

As before, we used the best results from AMD’s AOCC, Intel One compiler, and GCC, with Spack as the build system.

HPC Workload Comparison 4th Gen EYPC to Graviton4

Across the majority of our HPC workloads, on a like-for-like 192-core basis, it can be seen that AWS Graviton4 often delivers significantly better performance than the latest 4^th-Gen AMD EPYC (“Genoa”)-based instances – with the geomean giving AWS Graviton4 a 15.2% advantage.

Conclusion

We are very excited about the general availability of AWS Graviton4 CPUs and would like to congratulate the teams at Amazon and Annapurna Labs on this achievement. As an early adopter of Arm Neoverse cores in AWS Graviton CPUs, AWS has systematically improved the capabilities of Graviton – releasing an incredible four new generations over five years, each with greater performance and capabilities to address the full spectrum of cloud workloads. In addition, Graviton4 is the most powerful and power-efficient chip AWS has ever built. We look forward to the continued pace of innovation of AWS Graviton and the positive impact each generation brings to performance and sustainable efficiency. While AI promises to bring transformation benefits across industries, for this to happen customers everywhere need to improve the efficiency of their core compute.

More Neoverse Blogs

2 comments
0 members are here

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

AWS Graviton4 demonstrates leading performance for HPC

Improvement from AWS Graviton3E to AWS Graviton4

Dual socket AWS Graviton4 performance relative to 4^th Generation AMD EPYC

Conclusion

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

AWS Graviton4 demonstrates leading performance for HPC

Improvement from AWS Graviton3E to AWS Graviton4

Dual socket AWS Graviton4 performance relative to 4th Generation AMD EPYC

Conclusion

Dual socket AWS Graviton4 performance relative to 4^th Generation AMD EPYC