Today Arm is happy to announce the launch of our Neoverse V1 and N2 platforms. These platforms represent the continued execution of an Arm Neoverse roadmap first unveiled back in 2018. They also represent the achievement of the performance goals we communicated at that unveiling.
In 2018, we communicated a performance uplift target of +30% for each of the first few generations of Arm Neoverse platforms over their predecessor. With the Neoverse V1 platform, we have achieved a +50% Instruction-Per-Cycle (IPC) performance uplift over Neoverse N1. And with the Neoverse N2 platform we have achieved a +40% IPC performance uplift over Neoverse N1. So, we have met and exceeded our performance targets.
But which benchmark do we use as a yardstick? Our estimated performance improvements are based on the integer component of the SPEC CPU® 2006 benchmark suite. This blog will attempt to explain “why?” we chose SPEC CPU® 2006 as the basis for our performance uplift comparison and why this retired suite remains relevant for measuring performance of Arm-powered infrastructure systems. Additionally, we will discuss the “how” - our reasoning and intention for selecting and focusing on specific configurations of the SPEC CPU® 2006 and 2017 suites.
SPEC CPU® is the most popular industry-standard benchmark suite designed to provide a comparative measure of compute-intensive performance across the widest range of hardware using benchmarks developed from real user applications. It comprises of CPU-intensive benchmarks that stress the system’s cores, memory subsystems, and compiler toolchains. It is maintained by the Standard Performance Evaluation Corporation (SPEC) Open Systems Group (OSG) and includes benchmarks to measure integer (SPECint®) and floating-point (SPECfp®) performance.
The SPEC CPU benchmark suites are more than 30 years old, with the first SPEC CPU benchmark suite being announced in 1989. To keep pace with the increasing compute capabilities, silicon technology advancements, software improvements, and emerging applications, the suite is now in its sixth generation. SPEC introduced new versions of SPEC CPU in 1992, 1995, 2000, 2006, and 2017. The latest iteration of this popular benchmark suite is SPEC CPU 2017 and, like previous versions, consists of integer and floating-point benchmarks, each with SPECspeed® and SPECrate® components.
SPEC CPU 2017 has been upgraded from SPEC CPU 2006, uses different reference machines, and adds new benchmarks, hence scores for CPU 2006 cannot be directly compared with CPU 2017 scores.
In order to be more representative of the larger and more complex workloads of the modern era, the applications in the CPU 2017 suite have increased in size and complexity. Compared to CPU 2006, the CPU 2017 suites increased the lines of C++, C, and Fortran lines of code by 2×, 2.5× and 1.5×, respectively. Dynamic instruction count for CPU 2017 is in the trillions of instructions. Instructions-Per-Cycle (IPC) achieved on the workloads on the integer “rate” suite on a single core ranges from above 3 instructions/cycle down to 0.7 instructions/cycle. Many cloud and server applications tend to show IPC signatures on the lower end of the spectrum of CPU 2017.
Out of the ten integer benchmarks in the CPU2017 suite, six have been carried over from from CPU2006, albeit with updated algorithms and input sets: perlbmk, gcc, mcf, sjeng, omnetpp, and xalanckbmk. These benchmarks are still representative of a spectrum of real-world applications (covering programming languages, compilers, XML processing, optimization algorithms, games, and simulators) and remain relevant for performance analysis.
Despite CPU 2006 being retired, is still relevant to evaluate performance of a loaded system. A first interesting benchmark from CPU 2006, present in all SPEC CPU suites so far, is 403.gcc. It exhibits similar characteristics across all the generations of SPEC CPU suites. 403.gcc has a large code footprint and the control flow is representative of many real-world applications. It stresses one of the contested resources in the system: the shared system cache. As more cores execute more copies of this workload on the “rate” configuration, the lower is the system level cache capacity available for each core. This increases the runtime for each core to complete their work – the loss in cache capacity causes the cores to fetch data and instructions from the slower DDR memory subsystem.
An interesting benchmark that was not carried over from CPU2006 to CPU2017 is 462.libquantum. When compiled without exotic optimizations, this benchmark puts a lot of stress on the memory subsystem. To the point that, typically, a handful of cores executing it in parallel are enough to exhaust all memory bandwidth available in a system. As a result, the “rate” score for this workload barely increases if more copies of this workload are executed on a system once the available memory bandwidth is exhausted – there is just not enough memory bandwidth for the additional cores to bring in data so they can process it.
Since CPU2006 puts more stress on the shared system resources, on the same system, we record lower performance scaling for CPU2006 than CPU2017. For instance, when we compare the score achieved by a single core (“rate=1”) on a high core count Neoverse N1 system vs the score achieved by one of the cores on the fully loaded system (“rate=n” /n), we see that the score per core on a loaded system for CPU 2006 degrades much more quickly than for CPU2017:
rate=1
rate=n / n
CPU2006
100%
55%
CPU2017
70%
In summary, SPEC CPU2006 still provides significant value as a well-known industry standardized benchmark for evaluating overall system performance and performance scaling. However, to keep the comparisons up-to-date and fresh against other platforms, Arm is upgrading performance estimates to the SPEC CPU2017 suite.
Since most cloud and server applications primarily perform integer computation, we mainly focus on the integer component of the SPEC CPU suite.
The SPEC suite comprises of two metrics: “speed” and “rate”. When SPECint® is run on a system under test, it computes a “score”, which is a metric of how much faster a workload completed on the system under test compared against a reference machine (for “speed”), or how much more work was done by the system under test against a reference machine (for “rate”). For “rate”, the user can decide how many copies of the workload are executed, and the score is a measurement of throughput computed as the ratio of the execution time on the reference machine vs the machine under test multiplied by the number of copies executed.
Since performance metrics for CPUs targeting server and cloud deployments typically focus on throughput capabilities, our work, and our estimates on SPEC CPU focus on “rate”. Hence, even single core estimates for Arm Neoverse cores are based on a “rate” configuration of SPEC CPU running one copy. Additionally, since “speed” allows auto parallelization (hence a single workload in the suite might be able to achieve a higher “speed” score because multiple cores might speed up execution time), this metric is a better fit to evaluate performance of a fully integrated SoC, and not as valuable to measure single-thread performance of a core IP, like the ones Arm develops and makes available to infrastructure partners.
As with real workloads, compiler optimizations can play a significant role in the scores achievable by a system under test.
SPEC CPU allows two types of compiler setting:
Although sometimes users may want to adopt aggressive compiler configurations to squeeze every ounce of performance from their systems on a particular workload (i.e. HPC users and workloads), in our experience, most server and cloud users gravitate towards more conservative compiler configurations that balance performance, stability, compatibility, and debuggability. Although we are fully aware that estimated scores can be higher with “peak”, our goal for the Arm Neoverse platforms is to deliver the performance most users can benefit from directly, hence we focus on “base” compiler configurations.
All Arm’s SPEC CPU estimates are based on open source, stock compilers. Although we are aware that higher performance is sometimes achievable by using specialized or proprietary compilers and toolchains, our choice is driven by our experience working on real deployments with real cloud customers. Most cloud and server customers use compilers and toolchains based on either gcc (GCC, the GNU Compiler Collection - GNU Project - Free Software Foundation (FSF)) or llvm (The LLVM Compiler Infrastructure Project), and our choices of compilers, toolchains, and compiler configurations matches as closely as possible what we see users adopting in the field.
The SPEC CPU suites continue to deliver high value to the industry as one of the golden standards to evaluate and compare performance. However, it only represents a sample of the spectrum of benchmarks and applications Arm uses for performance evaluation and for establishing product landing zones. Since SPEC CPU alone is not enough to capture many of the behaviors that define performance in real deployments, we relentlessly work on augmenting our scope and reach with more workloads that can give us more insights on how we can improve our IPs and enable our partners’ success. As Arm partners invest and grow into the infrastructure market (server and networking), the portfolio of Arm performance workloads continues to expand to incorporate more and more diverse applications, benchmarks and use cases.
[CTAToken URL = "www.arm.com/products/silicon-ip-cpu/neoverse" target="_blank" text="Learn more about Neoverse V1 and N2" class ="green"]
References:
HPCA_SPEC17_ShuangSong.pdf (utexas.edu)
SPEC2017_ISPASS18.pdf (arizona.edu)
SPEC CPU® 2017
SPEC CPU® 2006