Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Tools, Software and IDEs blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Breaking down Arm Neoverse performance leadership
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • performance
  • performance analysis
  • infrastructure
  • Neoverse
  • Neoverse Reference Design
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Breaking down Arm Neoverse performance leadership

Andrea Pellegrini
Andrea Pellegrini
April 27, 2021
5 minute read time.


Co-authors: 
Andrea Pellegrini – distinguished Engineer, Arm Infrastructure Line of Business, Steve Demski - Marketing Manager, Hyperscale, and HPC, Arm Infrastructure Line of Business.

One of the fundamentals measurements of goodness for a computer system is performance.  This is often defined as the ratio between amount of work completed by the machine and the time necessary to complete it. For server and networking systems this is often a measure of throughput.  For example, “how many e-commerce transactions per second a system can handle?”. This definition is often too simple for real world scenarios, since systems are often required to complete work within a certain latency.  And more complete performance evaluations require measuring how much throughput is achieved by a system within the boundary of a service level agreement (SLA). For the sake of simplicity, here we will ignore any latency constraints, and only focus on throughput-based performance metrics, such as the ones produced by benchmarks such as SPEC CPU2017 “rate”.  For more details on Arm's use of SPEC CPU see our companion blog, The how and why of SPEC CPU estimates for Arm Neoverse cores and reference designs.

Three metrics are paramount when we evaluate performance of a computer system in a datacenter:

  • Performance per socket: this is a measurement of throughput on a system under test. Although traditional server systems often adopt dual socket configurations, it is common to normalize estimates to single socket. This metric is relevant for cloud vendors and OEMs to estimate how much performance can be delivered on a server rack or a node. This metric is often a primary input to Total Cost of Ownership (TCO) models.
  • Performance per thread: this metric represents how much performance each HW thread contributes to the overall score and is computed as the ratio of the performance per socket and the number of active HW threads. It provides a measurement on how much performance customers can achieve on a single thread while the system is fully loaded. This is of paramount important to cloud computing users since it determines how much performance they can extract per unit of price.
  • Performance variability per thread: reducing performance variability is important for all users in general, and even more so for cloud customers that lease compute resources in the cloud on machines that might be shared between multiple tenants. The lower the performance variability, the more consistent and predictable will be the performance achievable by a customer on a given workload. This, in turn, allows the customer to more easily provision and budget cloud resources.

With all flavors of cloud computing (public, private, hybrid) becoming the standard for IT services delivery, let us look at each of these three metrics in the context of available, modern CPUs. 

With some exceptions, the most common type of cloud CPU uses a high core-count in combination with Simultaneous Multi-Threading (SMT) and often some amount of “Turbo” capability.  Depending on the legacy vendor, these CPUs can score well on per-socket performance or on per-thread performance, but they rarely perform well on both simultaneously.  Additionally, performance variability per thread can vary widely, depending on “noisy neighbors”, simultaneous thread competing for core resources, and the inherent unpredictability of “Turbo” modes.

With the Arm Neoverse family of CPU platforms we believe we offer a better cloud computing solution – both to the cloud operators and to their customers.

  1. Delivering superior performance per-socket:
    • Simply stated, because of Arm’s power efficiency, CPU designers can pack more full Arm cores into a given TDP than traditional architectures can pack threads. With Arm cores offering more performance than traditional threads, we expect Arm Neoverse to offer clearly superior performance per-socket.
  2. Delivering superior performance per-thread:
    • Because Arm Neoverse CPUs feature high IPC designs, large private caches, and do not use SMT, for most workloads a full Arm core will outperform a traditional SMT thread. And with the launch of our Arm Neoverse V1 and N2 cores we expect this core-to-thread performance delta to become even more significant.
  3. Limiting or eliminating per thread performance variability:
    • Unlike traditional CPU architectures Arm Neoverse does not use SMT for CPUs targeting cloud computing use cases. And for cloud CPUs, Arm Neoverse can achieve this because our industry-leading power efficiency can replace traditional SMT threads with fully fledged Neoverse cores.
    • Unlike traditional CPU architectures Arm, Neoverse-based CPUs do not rely on extreme Turbo core frequencies to sporadically deliver much higher performance per thread for specific workloads under the right circumstances. Some Arm partners develop Turbo capabilities to extract more performance when possible, but Arm Neoverse platforms are engineered to reach high per-thread performance through high-performance micro-architectures and through providing a thread full access to a core and L2 cache resources.

 These advantages are illustrated graphically in figure 1.
 A plot of per-socket throughput (x-axis) and per-thread performance (y-axis)

Figure 1: Performance projection of Arm Neoverse vs. traditional CPU architectures based on a industry standard integer benchmark.  Performance per socket is plotted along the X-axis, and performance per thread is plotted along the Y-axis. Designs that achieve the best scores on these two metrics will land on the top right portion of the graph. For the sake of simplicity, we do not show the third metric, performance variability per thread, on this graph.

By plotting performance per socket and performance per thread on a X-Y graph, we can compare our designs vs other competing parts within comparable silicon area and Thermal Design Power (TDP) envelops.

When we look at the evolution of Arm Neoverse platforms under this light, it becomes clear that the Neoverse N1 platform is still a leader in terms of performance per thread on typical cloud instances. Here we listed our projections based on simulated 64-core Neoverse N1 system, but higher core count Neoverse N1 systems are available on the market and can push further on aggregate performance per socket. The two new products we launch today, Neoverse V1 and Neoverse N2 provide two different ways to improve on both these metrics, enabling Arm partners to further the lead in the market on these performance metrics.

Why is achieving both high performance per socket and high performance per thread important?  For cloud operators, a higher core count means you can host more customers per system and amortize cost over more users.  That is a dual benefit – more revenue, less cost. But the same is true for the cloud customer. They benefit from predictable, scalable performance – getting exactly what they pay for – and from lower cost economics of Arm Neoverse. 

Today we are launching the Arm Neoverse V1 and Neoverse N2 platforms.  And we expect to see Arm partner silicon in market by the end of this year. We are excited to see how Arm’s partners turn this innovation and performance into solutions built for HPC, cloud, networking, edge and 5G markets.

Learn more about Neoverse V1 and N2

Anonymous
Architectures and Processors blog
  • What is new in LLVM 15?

    Pablo Barrio
    Pablo Barrio
    LLVM 15.0.0 was released on September 6, followed by a series of minor bug-fixing releases. Arm contributed support for new Arm extensions and CPUs.
    • February 27, 2023
  • Apache Arrow optimization on Arm

    Yibo Cai
    Yibo Cai
    This blog introduces Arm optimization practices with two solid examples from Apache Arrow project.
    • February 23, 2023
  • Optimizing TIFF image processing using AARCH64 (64-bit) Neon

    Ramin Zaghi
    Ramin Zaghi
    This guest blog shows how 64-bit Neon technology can be used to improve performance in image processing applications.
    • October 13, 2022