Isambard update: Arm rolling forward at scale for High Performance Computing

May 13, 2019

4 minute read time.

Co-authored by Darren Cepulis, HPC Segment Manager, Arm, and Simon McIntosh-Smith, Professor of HPC at the University of Bristol.

At the CUG’19 event in Montreal, Canada this week, Simon McIntosh-Smith provided an at-scale performance update on the Isambard supercomputer deployed by the GW4 Alliance and the Met Office.

What is Isambard?

Isambard is an HPC cluster of 168 nodes based on the Cray XC50 system design and using Marvell ThunderX2 CPUs. It uses Cray's Aries high-speed interconnect.

The CUG presentation and associated paper provide an early in-depth view of performance data for at-scale applications running on the Arm-based Marvell ThunderX2. Here we provide a recap of some of the data and salient points presented.

Isambard: key findings

1. Marvell continues to advance their existing ThunderX2 SoC with the rollout of its B2 stepping

In a nutshell, the good news is that ThunderX2 B2 silicon is scaling similarly to Skylake when both are using Cray’s Aries interconnect. This is what we expected, but the Isambard project has now provided the evidence to confirm the hypothesis.

2. Higher Arm-based core counts are adhering to power budgets without down clocking

In Isambard, the new B2 silicon ThunderX2s are running at their turbo clock speed of 2.5 GHz all the time, even when running HPL. This is in contrast to most experiences of variable clock speeds, where modern x86 CPUs tend to downclock when running intensive codes. With Isambard’s CPUs running at their turbo speeds all the time, we know that they’re using less than 175 Watts and their temperatures are staying below 94 C, so far no matter what we’ve run on them.

The base clock speed of Isambard’s CPUs is 2.1 GHz, so for compute bound codes, the turbo speed is gaining around 10-15% performance over what we had before.

3. As we strong scale, codes become more network bound and CPU performance matters less

At scale, with high node count and high core count conditions, most codes become more network bound. This results in many of our results levelling out - Skylake catches up with ThunderX2 on the bandwidth-bound codes, and ThunderX2 catches up with Skylake on the compute-bound codes. GROMACS is striking in this regard - it was the most extreme result on a single node, with a dual-socket node of Skylake 28 core being twice as fast as a dual-socket node of ThunderX2 32 core.

However, at realistic scale, ThunderX2 and Skylake have almost identical performance. It’s worth bearing in mind that ThunderX2 CPUs are generally available at a fraction of the price of comparable top-bin Skylake CPUs, giving Arm-based ThunderX2 CPUs a significant performance per dollar advantage, even for compute bound codes such as GROMACS.

4. The Arm at-scale ecosystem also continues to advance

Bristol found a couple of minor scaling issues when testing MPI performance over the Aries interconnect for ThunderX2, which appear to be mostly related to collective operations. This wasn’t a surprise, given that Isambard is one of the first Arm-based Cray systems to be deployed at scale.

Cray is working with the Isambard team to identify and fix these MPI performance issues, and we anticipate that the few examples which don’t scale quite as we’d expect on ThunderX2 should be resolved soon. ]

5. The Catalyst UK project is underway and advancing the Arm HPC ecosystem

While the Isambard system has focused on some of the key HPC applications for the UK and EU theatre as well as the Cray connectivity, further work is on-going at Bristol with a 64-node HPE Apollo 70 cluster. The on-going Catalyst UK project also draws in teams from EPCC in Edinburgh and the University of Leicester, each with their own similarly configured clusters. Working with Arm and partners such as HPE, SUSE, Marvell and Mellanox, the three university sites are each focusing on the scientific applications of their chosen fields of interest and work is driven by their scientists.

Besides further investigation into MPI connectivity performance, many more application are being ported and analyzed in terms of how well they run on these Arm based ThunderX2 platforms.

6. Early Catalyst observations show further potential for performance improvement

In comparing Isambard results to those in the Catalyst UK related whitepaper that the University of Edinburgh recently submitted to the PASC19 conference show some bridgeable gaps in performance. In similar experiments, the Isambard results outshine the Catalyst at similar scale, with the finger pointing at variations in the connectivity stacks, adapters, and CPU stepping. Tuning performance on these Catalyst systems is moving to the forefront and we expect additional leveling of the playing field as work proceeds.

The Arm HPC User Group (AHUG) is be benefiting greatly from all the work being done by Arm-based supercomputer users and partners world-wide.

Arm will be hosting an AHUG workshop next month at ISC19. We hope to see you there.

Learn more about Arm's HPC Ecosystem

Please see our dedicated event page for further information on Arm's ISC19 presence.

To learn about the Arm HPC Ecosystem, please visit our Developer page.

0 comments
0 members are here

Servers and Cloud Computing blog

How SiteMana scaled real-time visitor ingestion and ML inference by migrating to Arm-based AWS Graviton3

Peter Ma

Migrating to Arm-based AWS Graviton3 improved SiteMana’s scalability, latency, and costs while enabling real-time ML inference at scale.
- July 4, 2025
Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

Chris Goodyer

In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
- June 17, 2025
Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog