We’ve written previously about Arm’s desire to run our business on the Arm-based environment, discussing our experience with cloud platforms such as the Graviton family, with servers like the RL300 from HPE powered by the Ampere® Altra® and Altra Max. We are driven to do this not just for strategic reasons but also to take advantage of the benefits of Arm-based infrastructure in price/performance and sustainability leadership.
Indeed, one other rationale for us to use Arm-based servers is so we can be our own worst customer. We want to ensure that end users have the best experience possible and so we act as early adopters, providing detailed feedback to our product teams and ecosystem partners. But we also act as a rational customer. Where it hasn’t made sense to use Arm architecture designs, we’ve not artificially done so.
In the Electronic Design Automation (EDA) domain, there is a broad range of tools and classes of workload, each with differing characteristics and performance demands. RTL simulation is a workload that makes up a large proportion of our computation but is relatively easy for a CPU to execute and consists of tens or hundreds of thousands of independent simulation jobs. Using Arm Neoverse N1-based CPUs (such as the Ampere Altra and AWS Graviton2), and subsequent Arm Neoverse offerings, Arm has achieved leading performance and perf/$ on these simulation workloads.
In the early generations of Arm servers, other platforms offered higher per-core performance, but Arm based platforms were often still the optimal choice as we could run more cores for less – less money, fewer racks and lower power consumption. By leveraging those benefits we enabled our engineers to achieve faster turnaround of simulation jobs while reducing their need for compute resources.
That performance gap has been closing with each generation of Arm core, to the point we are at now - where the latest Arm cores are as fast as their competitors, even when running at a much lower clock frequency (and thus reduced power consumption). As an end user ourselves, Arm has been excited to see recent launches from cloud partners with CPUs such as the AWS Graviton4 and Google Axion, both based on the Arm Neoverse V2 core. As well as accelerating the workloads we had already moved to Arm-based platforms, the ability of those platforms to execute floating point computation rapidly, in addition to their improved per-core performance, has provided Arm with an environment well suited to running the remaining workload types within the mix of EDA tools.
Arm has been working with EDA software vendors for some time to help port these more demanding tools, such as those needed for silicon implementation. We hope to see our partners release the fruits of this project in the coming months, allowing us to run the full chip design process in the cloud.
However, Arm runs a hybrid compute estate with significant core counts hosted in our own datacentres as well as using cloud-based compute. We have tens of thousands of Arm cores in our High Performance Compute estate on prem, allowing us to complete more RTL simulations per night than we could if we had deployed non-Arm servers.
Nonetheless, those on-prem cores are primarily based on the Neoverse N1. While functionally they can execute new tools requiring high performance, they’re not best suited to do so. With the NVIDIA Grace CPU Superchip-based servers, we unlocked the ability to deploy Arm compute with class-leading performance into our own datacentres, providing access to the same Neoverse V2 cores on-prem as we use in AWS or Google Cloud.
Implementation workloads have massive amounts of data associated with them, such as descriptions of the physical characteristics of a certain foundry process and the timing of electrical signals around the design. Once a project has started in a given location it can be hard to move. Often, it makes sense to “bring compute to the data” rather than the other way around. By having the ability to deploy high performance Arm to every compute location, projects on-prem are no longer stranded on legacy compute architectures. Similarly, projects already running on Arm in the cloud can be repatriated to our on-prem datacentres if the situation requires.
Because all three platforms share the same core processor, software vendors don’t need to optimise their code for multiple targets. By optimizing for Neoverse V2 they will see the benefits across providers.
With these advantages in mind, Arm is starting to integrate NVIDIA Grace CPU-based compute into our internal systems.
Although the NVIDIA GH200 Grace Hopper Superchip will be an outstanding platform for accelerated workloads, few EDA tools today are designed to take advantage of a GPU. Because of this, Arm has chosen to deploy servers using the NVIDIA Grace CPU Superchip, a single module consisting of two NVIDIA Grace CPUs but no GPU. A single Grace CPU Superchip delivers 144 Arm Neoverse V2 cores, connected with the high-bandwidth NVIDIA Scalable Coherency fabric with up to 960GB of LPDDR5X memory to get optimal performance from the system in a compact module requiring only 500W of total power (compared to 900W or more for an equivalent x86 server). Deployed in the Supermicro ARS-121L-DNR, containing a pair of Grace CPU Superchips, we are able to attain a density of 288 cores per U in an air-cooled system.
The NVIDIA Grace CPU Superchip module contains the system memory as well as the processor, and by using LPDDR5X RAM the power consumption is kept to a minimum. EDA workloads are notoriously memory-hungry, and so Arm has chosen to deploy the high memory variant, providing 960GB capacity per Superchip with 768 GB/s of memory bandwidth, using only about 16W per chip, about 20% of the power required to get similar bandwidth from traditional DDR5-based memory.
This deployment is on order, but we have been testing an early access system with good results. We have run an RTL simulation workload used for testing one of our core designs on our NVIDIA Grace server, along with the existing compute servers installed in our HPC datacentres. While the x86 silicon is not the most recent generation, it is what our internal engineering teams have access to for their simulation work and so represents the user experience at Arm. What we see is that the throughput of the NVIDIA Grace system is more than 50% higher than our most recent x86 platform (AMD EPYC “Milan-X” 7773X) and four times greater than our most recent Intel Xeon “Skylake” based servers.
When combined with the expected power saving, we are planning to increase the density of servers per rack, leading to a significant uplift in our datacentre productivity.
When our deployment is in production next year and our EDA software partners have released the next set of tooling, we hope to report again on our progress and findings.