Moving High Performance Computing (HPC) workloads to the cloud has been a trend for some time, but progress has advanced slowly facing resistance due to high costs and performance concerns based on lack of parallel file systems and low-latency networking. I was the co-author on one of the first papers to investigate the overhead of virtualization on HPC applications and we concluded virtualized environments imposed unacceptable overhead for performance-critical applications and systems.
But the cloud landscape and capabilities it offers have changed significantly. The allure of “virtually unlimited” resources and zero-wait batch queues is clearly desirable and despite some shortcomings, HPC workloads are beginning to make their way into the cloud. Industry analysts disagree on the size & momentum of this new business model. How fast will cloud adoption in HPC happen and will this transition cannibalize traditional HPC data centers? In the recent sale of Cray to HPE, the CEO of Cray stated that the impact of cloud caused future business to be in doubt.
Cloud providers have made steady improvement to their HPC offerings which are beginning to experience rapid growth. Microsoft Azure stood up dedicated Cray systems with tightly coupled, low-latency interconnects. AWS's position was that better networking would be deployed across their data centers, keeping the "sea of compute" homogenous which avoids the complexity of workload placement and resource fragmentation and isolation. AWS acquired Annapurna Labs and quickly made progress on network offload and acceleration with the AWS Nitro System SmartNIC implementation, which frees up expensive compute resources and improves workload performance.
On the parallel storage front, AWS began deploying Lustre images from Whamcloud in the mid-2010s and have done quite well with that offering. Recently, AWS unveiled Amazon FSx for Lustre - a fully managed Lustre offering that uses S3 to store data at rest.
These technology advancements and innovation are setting a stage to welcome HPC applications to the AWS cloud. Every step up in compute, networking, and I/O performance raises their platform’s applicability to a broader set of HPC workloads, attracting more business.
The remaining obstacle to broad adoption is cost, with the prevailing opinion that a fully utilized HPC data center must be more cost-effective than outsourcing to the cloud. The breakthrough here is the arrival of the Arm Neoverse-based AWS Graviton2 processor and Amazon EC2 M6g/C6g/R6g instance family. With the promise of up to 40% better price and performance than x86, AWS is tackling the HPC cost-of-cloud concern head-on. And independent experiments are validating those claims on benchmarks as well as on real-world workloads. The numbers do not lie - Arm-based technology is both faster and less expensive than competing x86 systems.
The combination of advances in networking, storage, and compute VIA Graviton2 make AWS a desirable platform for HPC applications. At Arm, our HPC team is working with open source and ISV applications vendors to study the reality of running HPC in the cloud. We are looking closely and cannot wait to report back on our findings.
[CTAToken URL = "https://www.arm.com/solutions/infrastructure" target="_blank" text="See Arm Infrastructure solutions for HPC" class ="green"]