Co-authored by Darren Cepulis, HPC Segment Manager, Arm, and Simon McIntosh-Smith, Professor of HPC at the University of Bristol.
At the CUG’19 event in Montreal, Canada this week, Simon McIntosh-Smith provided an at-scale performance update on the Isambard supercomputer deployed by the GW4 Alliance and the Met Office.
Isambard is an HPC cluster of 168 nodes based on the Cray XC50 system design and using Marvell ThunderX2 CPUs. It uses Cray's Aries high-speed interconnect.
The CUG presentation and associated paper provide an early in-depth view of performance data for at-scale applications running on the Arm-based Marvell ThunderX2. Here we provide a recap of some of the data and salient points presented.
In a nutshell, the good news is that ThunderX2 B2 silicon is scaling similarly to Skylake when both are using Cray’s Aries interconnect. This is what we expected, but the Isambard project has now provided the evidence to confirm the hypothesis.
In Isambard, the new B2 silicon ThunderX2s are running at their turbo clock speed of 2.5 GHz all the time, even when running HPL. This is in contrast to most experiences of variable clock speeds, where modern x86 CPUs tend to downclock when running intensive codes. With Isambard’s CPUs running at their turbo speeds all the time, we know that they’re using less than 175 Watts and their temperatures are staying below 94 C, so far no matter what we’ve run on them.
The base clock speed of Isambard’s CPUs is 2.1 GHz, so for compute bound codes, the turbo speed is gaining around 10-15% performance over what we had before.
At scale, with high node count and high core count conditions, most codes become more network bound. This results in many of our results levelling out - Skylake catches up with ThunderX2 on the bandwidth-bound codes, and ThunderX2 catches up with Skylake on the compute-bound codes. GROMACS is striking in this regard - it was the most extreme result on a single node, with a dual-socket node of Skylake 28 core being twice as fast as a dual-socket node of ThunderX2 32 core.
However, at realistic scale, ThunderX2 and Skylake have almost identical performance. It’s worth bearing in mind that ThunderX2 CPUs are generally available at a fraction of the price of comparable top-bin Skylake CPUs, giving Arm-based ThunderX2 CPUs a significant performance per dollar advantage, even for compute bound codes such as GROMACS.
Bristol found a couple of minor scaling issues when testing MPI performance over the Aries interconnect for ThunderX2, which appear to be mostly related to collective operations. This wasn’t a surprise, given that Isambard is one of the first Arm-based Cray systems to be deployed at scale.
Cray is working with the Isambard team to identify and fix these MPI performance issues, and we anticipate that the few examples which don’t scale quite as we’d expect on ThunderX2 should be resolved soon. ]
While the Isambard system has focused on some of the key HPC applications for the UK and EU theatre as well as the Cray connectivity, further work is on-going at Bristol with a 64-node HPE Apollo 70 cluster. The on-going Catalyst UK project also draws in teams from EPCC in Edinburgh and the University of Leicester, each with their own similarly configured clusters. Working with Arm and partners such as HPE, SUSE, Marvell and Mellanox, the three university sites are each focusing on the scientific applications of their chosen fields of interest and work is driven by their scientists.
Besides further investigation into MPI connectivity performance, many more application are being ported and analyzed in terms of how well they run on these Arm based ThunderX2 platforms.
In comparing Isambard results to those in the Catalyst UK related whitepaper that the University of Edinburgh recently submitted to the PASC19 conference show some bridgeable gaps in performance. In similar experiments, the Isambard results outshine the Catalyst at similar scale, with the finger pointing at variations in the connectivity stacks, adapters, and CPU stepping. Tuning performance on these Catalyst systems is moving to the forefront and we expect additional leveling of the playing field as work proceeds.
The Arm HPC User Group (AHUG) is be benefiting greatly from all the work being done by Arm-based supercomputer users and partners world-wide.
Arm will be hosting an AHUG workshop next month at ISC19. We hope to see you there.
Please see our dedicated event page for further information on Arm's ISC19 presence.
To learn about the Arm HPC Ecosystem, please visit our Developer page.