Breaking down Arm Neoverse performance leadership

April 27, 2021

5 minute read time.

Co-authors: Andrea Pellegrini – distinguished Engineer, Arm Infrastructure Line of Business, Steve Demski - Marketing Manager, Hyperscale, and HPC, Arm Infrastructure Line of Business.

One of the fundamentals measurements of goodness for a computer system is performance. This is often defined as the ratio between amount of work completed by the machine and the time necessary to complete it. For server and networking systems this is often a measure of throughput. For example, “how many e-commerce transactions per second a system can handle?”. This definition is often too simple for real world scenarios, since systems are often required to complete work within a certain latency. And more complete performance evaluations require measuring how much throughput is achieved by a system within the boundary of a service level agreement (SLA). For the sake of simplicity, here we will ignore any latency constraints, and only focus on throughput-based performance metrics, such as the ones produced by benchmarks such as SPEC CPU2017 “rate”. For more details on Arm's use of SPEC CPU see our companion blog, The how and why of SPEC CPU estimates for Arm Neoverse cores and reference designs.

Three metrics are paramount when we evaluate performance of a computer system in a datacenter:

Performance per socket: this is a measurement of throughput on a system under test. Although traditional server systems often adopt dual socket configurations, it is common to normalize estimates to single socket. This metric is relevant for cloud vendors and OEMs to estimate how much performance can be delivered on a server rack or a node. This metric is often a primary input to Total Cost of Ownership (TCO) models.
Performance per thread: this metric represents how much performance each HW thread contributes to the overall score and is computed as the ratio of the performance per socket and the number of active HW threads. It provides a measurement on how much performance customers can achieve on a single thread while the system is fully loaded. This is of paramount important to cloud computing users since it determines how much performance they can extract per unit of price.
Performance variability per thread: reducing performance variability is important for all users in general, and even more so for cloud customers that lease compute resources in the cloud on machines that might be shared between multiple tenants. The lower the performance variability, the more consistent and predictable will be the performance achievable by a customer on a given workload. This, in turn, allows the customer to more easily provision and budget cloud resources.

With all flavors of cloud computing (public, private, hybrid) becoming the standard for IT services delivery, let us look at each of these three metrics in the context of available, modern CPUs.

With some exceptions, the most common type of cloud CPU uses a high core-count in combination with Simultaneous Multi-Threading (SMT) and often some amount of “Turbo” capability. Depending on the legacy vendor, these CPUs can score well on per-socket performance or on per-thread performance, but they rarely perform well on both simultaneously. Additionally, performance variability per thread can vary widely, depending on “noisy neighbors”, simultaneous thread competing for core resources, and the inherent unpredictability of “Turbo” modes.

With the Arm Neoverse family of CPU platforms we believe we offer a better cloud computing solution – both to the cloud operators and to their customers.

Delivering superior performance per-socket:
- Simply stated, because of Arm’s power efficiency, CPU designers can pack more full Arm cores into a given TDP than traditional architectures can pack threads. With Arm cores offering more performance than traditional threads, we expect Arm Neoverse to offer clearly superior performance per-socket.
Delivering superior performance per-thread:
- Because Arm Neoverse CPUs feature high IPC designs, large private caches, and do not use SMT, for most workloads a full Arm core will outperform a traditional SMT thread. And with the launch of our Arm Neoverse V1 and N2 cores we expect this core-to-thread performance delta to become even more significant.
Limiting or eliminating per thread performance variability:
- Unlike traditional CPU architectures Arm Neoverse does not use SMT for CPUs targeting cloud computing use cases. And for cloud CPUs, Arm Neoverse can achieve this because our industry-leading power efficiency can replace traditional SMT threads with fully fledged Neoverse cores.
- Unlike traditional CPU architectures Arm, Neoverse-based CPUs do not rely on extreme Turbo core frequencies to sporadically deliver much higher performance per thread for specific workloads under the right circumstances. Some Arm partners develop Turbo capabilities to extract more performance when possible, but Arm Neoverse platforms are engineered to reach high per-thread performance through high-performance micro-architectures and through providing a thread full access to a core and L2 cache resources.

These advantages are illustrated graphically in figure 1.
A plot of per-socket throughput (x-axis) and per-thread performance (y-axis)

Figure 1: Performance projection of Arm Neoverse vs. traditional CPU architectures based on a industry standard integer benchmark. Performance per socket is plotted along the X-axis, and performance per thread is plotted along the Y-axis. Designs that achieve the best scores on these two metrics will land on the top right portion of the graph. For the sake of simplicity, we do not show the third metric, performance variability per thread, on this graph.

By plotting performance per socket and performance per thread on a X-Y graph, we can compare our designs vs other competing parts within comparable silicon area and Thermal Design Power (TDP) envelops.

When we look at the evolution of Arm Neoverse platforms under this light, it becomes clear that the Neoverse N1 platform is still a leader in terms of performance per thread on typical cloud instances. Here we listed our projections based on simulated 64-core Neoverse N1 system, but higher core count Neoverse N1 systems are available on the market and can push further on aggregate performance per socket. The two new products we launch today, Neoverse V1 and Neoverse N2 provide two different ways to improve on both these metrics, enabling Arm partners to further the lead in the market on these performance metrics.

Why is achieving both high performance per socket and high performance per thread important? For cloud operators, a higher core count means you can host more customers per system and amortize cost over more users. That is a dual benefit – more revenue, less cost. But the same is true for the cloud customer. They benefit from predictable, scalable performance – getting exactly what they pay for – and from lower cost economics of Arm Neoverse.

Today we are launching the Arm Neoverse V1 and Neoverse N2 platforms. And we expect to see Arm partner silicon in market by the end of this year. We are excited to see how Arm’s partners turn this innovation and performance into solutions built for HPC, cloud, networking, edge and 5G markets.

Learn more about Neoverse V1 and N2

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Breaking down Arm Neoverse performance leadership

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC