Processors based on Arm technology have come a very long way in terms of over 180 billion chips shipped for applications ranging from embedded devices, smartphones, laptops, cloud servers and now to the world’s fastest supercomputer. Of notable success has been the deployment of the Neoverse N1 platform consisting of the N1 CPU and CMN-600 mesh interconnect. This was the very first in the line of CPUs exclusively designed by Arm to meet the aggressive requirements of the infrastructure server and networking markets. The Neoverse N1 platform has performed extremely well in the market, delivering very competitive single-thread performance at low cost as well as scalability and power efficiency to enable best-in-class SoCs for deployments from the cloud to the edge.
Today, I have the pleasure of introducing a new performance tier – the Neoverse V1 Platform comprising of the Neoverse V1 CPU, CMN-700 mesh interconnect and supporting system IP, designed to deliver the highest performance from any Arm-designed core. This is achieved by a radical re-design of the CPU micro-architecture using wider and deeper pipelines and by supporting a 2x256bit wide vector unit executing the Scalable Vector Extension (SVE) instructions with support for the new bfloat16 data type for AI/ML-assisted workloads. The impact of these enhancements is evident from improved benchmark scores, but more importantly, speedups in real server and HPC workloads that demand very high per-core performance. The Neoverse V1 platform itself is also extremely flexible enabling multi-chiplet and multi-socket solutions with best-in class DDR5/HBM3 memory, PCIe5 IO and CXL2.0-attached memory or coherent accelerators.
While inheriting all the features of Armv8.2 on which Neoverse N1 platform was based, Neoverse V1 implements several new capabilities from Armv8.3 through Armv8.6 to improve scalability, security and performance. Some key scalability features from this list of architectural enhancements are explained below.
MPAM, or Memory Partitioning and Management, is a mechanism to allocate, control and monitor utilization of these resources using a CPU-assigned Partition ID to each request that’s made into the system. MPAM implements a performance monitoring group that controls access to each particular resource based on the capacity allocated to the Partition ID. The design blocks owning the shared resources (eg. Memory controller) provide the necessary feedback on the resource utilization to the software, making it an excellent end-to-end feedback mechanism for achieving the fairness goals established by the system designer.
The purpose of the CBusy feature is to provide regulation to CPU traffic in a large system with shared resources. Without this mechanism a CPU could potential try to hog the resources by issuing a stream of transactions – for example, an aggressive prefetcher algorithm implemented in the system could inundate traffic to memory stealing bandwidth away from normal ld/st traffic. With CBusy, the CPU gathers feedback from system and tunes or throttles back its rate of issuing transactions to adhere to the ‘busyness’ of the system. CBusy mechanism has been benchmarked to provide as much as 15% improvement in overall system performance in our reference designs.
Neoverse V1 also implements a series of performance features. The Nested Virtualization enables nested hypervisors – ie. multiple guest virtual environments all under control of a single host hypervisor. This enables new deployment use cases such as on-premise or hybrid cloud environments.
The new Int8, Bfloat16 and complex number data types supported with dot product and matrix multiply-accumulate instructions accelerate ML and scientific calculations. Data gathering hints ensures that writes to IO devices are fully optimized by providing a hint to the end device that gathering of stores is completed and shipping the packet to the device can begin. This is a light-weight mechanism compared to implementation of a full-blown synchronization barrier.
There’s also support for relaxed memory consistency models – these are new instructions to support Release Consistency processor consistent (RCpc) model. The support of RCpc allows re-ordering of Store-Release followed by Load-Acquire to a non-conflicting address. Load and store atomicity is enabled to allow cacheable ld/st within an 16B aligned boundary to be performed atomically. Finally, Neoverse V1 extends the non-volatile memory management capabilities of Neoverse N1 by enabling another deeper level of persistence whereby multiple in-flight transactions to non-volatile memory are supported with feedback mechanism to the system when they are finally committed to memory.
Scalable Vector Extension is the most significant enhancement in Neoverse V1 for HPC applications. It doubles the computation bandwidth of N1 to 2x256b while continuing to support legacy 4x128 NEON modes. SVE is the next-generation SIMD architecture and results in vector-length agnostic programming model that relies on predicated execution. The vector pipes themselves are capable of being packed with a variety of different instructions operating on different data types, depending on the application.
The Neoverse V1 CPU micro-architecture achieves its scalability and performance goals through several key micro-architectural improvements in each of the front-end, mid-core and back-end of the CPU pipeline, as described below.
Front-end – Branch prediction and stalls
The Neoverse V1 CPU front-end design includes the branch predictor which is de-coupled from the instruction fetch pipeline and capable of pre-fetching instructions into the instruction cache (icache). The branch prediction unit has been widened to 2x32 blocks per cycle while the capacity of the branch target buffers (BTB) has been increased significantly to capture more branches with larger instruction footprints and to reduce the taken-branch latency for smaller and tighter kernel code.
The accuracy of the prediction of conditional branches has been improved. Another significant change is doubling the number of concurrent code regions tracked in the front-end of the design which results in a large speedup for Java-type workloads with sparse code regions. Net result of all these improvements is up to 90% reduction in branch mis-predicts (for BTB misses) and up to 50% reduction in front-end stalls for common server/HPC workloads, compared to Neoverse N1 CPU.
The Neoverse V1 CPU introduces a new L0-decoded micro-opcode cache for optimizing performance of smaller kernels. Overall bandwidth for fetching and dispatching from the Mop cache is 8 instructions/cycle compared to 5 instructions/cycle out of the icache. Moreover, the fetch and dispatch pipeline in the Mop cache is 1 stage smaller than that of the icache improving latency. The out-of-order (OoO) window size in Neoverse V1 has increased more than 2x to exploit parallelism in both core and memory accesses. An ALU unit has been added or a total of 4 units and a second branch execution unit. The impact of these features together is a 25% increase in integer performance. When combined with the doubling of raw vector and floating-point bandwidth (2x256 SVE or 4x128 NEON/FP), support for new bfloat16 data type and INT8 2x2 matrix multiply instructions, we see a massive 4x improvement in ML performance, for a much-improved on-CPU ML execution experience.
For store data, bandwidth has been doubled to 32 bytes/cycle. Load and store buffer sizes have been correspondingly increased to match the growth in the size of the out-of-order window. The number of outstanding transactions to external memory has been doubled to 96 to allow for better latency tolerance for diverse partner implementations of the memory subsystem. MMU capacity has been increased by 67% to allow for larger number of translations for accesses to memory. The latency for accessing L2 cache has been reduced to 10 cycles (load to use) with an improved L2 replacement policy resulting in a significant reduction in average load latency across several workloads.
The data prefetch algorithms have been significantly improved and allow for dynamic behavior modifications to adapt to aggressiveness of different system implementations while delivering excellent overall throughput. Finally, with more efficient usage of System-Level Cache (SLC) in CMN-700 as a victim cache, the Neoverse V1 CPU delivers up to 15% reduction in L2 and SLC fills with up to 50% reduction in L2 to SLC traffic on critical SPEC and memory benchmarks.
In an overall sense, the Neoverse V1 CPU is massively superscalar CPU with 15-wide issue across branch, ALU, load-store address generation, store data and vector FP pipelines. Despite this, we have still been able to hold the line with a 11-stage pipeline from front to the branch-execute stage.
The micro-architecture enhancements described above result in a 50% IPC (instructions per cycle) increase over Neoverse N1 with a 70% increase in core area (iso-process) to accommodate the larger vector pipeline and performance-enhancement features – a reasonably good tradeoff considering that Neoverse V1 can also be deployed in 5nm process technology which provides headroom for additional gates.
The Neoverse V1 CPU has added 2 new capabilities for power management enabling the system software to make power-performance trade-offs to fit execution within a pre-defined power and thermal envelope defined by the TDP limit. These are relatively low-latency tools, fully programmable and go above and beyond the standard Dynamic Voltage Frequency Support (DVFS) already available in previous generations.
MPMM’s power cap mechanism allows partners to continue pushing frequency of operation as a performance lever, knowing that they can safely hit turbo frequency modes of operation for some CPUs in their SoC without exceeding their power or thermal budgets. It ensures that the SoC designer doesn’t have to provision the system for the worst-case power consumption scenario, which would result in an extremely conservative and non-competitive design. Neoverse V1 has 3 different MPMM gears or capping points implemented, ranging from throttling of power viruses (workloads designed to consume maximum power in the SoC) to throttling of vector workloads that consume significant bandwidth (and power) to completely throttling back vector/fp workloads.
The other mechanism introduced in Neoverse V1 is Dispatch Throttling which superposes additional throttling during periods of high IPC activity. Our analyses on this feature on several workloads indicates a very small loss in performance due to increased DT for a commensurately larger savings in power.
Keeping in mind that Neoverse V1 platform is licensed by our partners who in turn further configure, tune and optimize performance while adding third party IP such as memory and PCIe sub-systems, we showcase here some estimates of benchmark and workload performance data collected from Arm’s own simulation/emulation efforts. All comparisons are done relative to Neoverse N1 using identical settings for CPU GHz, memory bandwidth, etc.
On the left graph above, we see Neoverse V1 relative performance over N1 for a mix of SPEC CPU® 2017 components and selective server and networking workloads that are possible to run on an emulator in a reasonable amount of time. The performance uplifts are sorted from least to maximum with an aggregate score showing ~48% IPC improvement over Neoverse N1 (very close to the 50% SPEC CPU 2006 estimated score). The steep rise of relative performance of Neoverse V1 over N1 is for workloads that benefit from the doubling of the vector datapath and deployment of SVE (eg. Crypto, packet processing and others that deploy SIMD). On the right graph we rank SVE workload performance for Neoverse V1 over NEON performance for N1, showing an impressive 60% to 300% improvement in performance due to SVE vector datapath.
Next, we further zoom into SVE versus NEON performance for a common HPC workload – HACCmk. The data on the left graph clearly shows that SVE itself results in an impressive ~3X performance boost over NEON and Neoverse V1 has ~7X improvement over N1 and nearly twice that of Neoverse N2 (as expected due to the double vector width). The right graph shows additional HPC benchmark scores for Neoverse V1 over N1 with performance improvements ranging from 40% to 80%.
The performance and scalability of Neoverse V1 is complemented by CMN-700, a scalable mesh network to link CPUs, coherent accelerators, DDR5/HBM2e/HBM3 memory controllers, IO controllers, CCIX 2.0 bridges to external chiplets or sockets and CXL 2.0 bridges to external memory or accelerators. CMN-700 can support 128+ CPUs with 40 dual-channel memory controllers, 128+ PCIe5/CCIX/CXL lanes and up to 40 general-purpose IO nodes for peripherals connected over Arm’s non-coherent network interconnect NI-700. CMN-700 also supports the System Level Cache (SLC) going up to 256MB of distributed SLC.
The Neoverse V1 Reference Design (RD) integrates 32-128 Neoverse V1 CPUs, CMN-700 and supporting System IP (GIC-700, MMU-600/700) along with system and management control processors to create a fully-bootable compute sub-system that serves as a interoperability testing (with 3rd party IP) and a training platform for Neoverse V1 based systems. A comprehensive RD integration, configuration and performance analyses report is shared by Arm with its partners enabling them to understand the engineering done behind creating this fully optimized sub-system.
The Neoverse V1 CPU is all about pushing peak performance per-core of the system in a constrained execution environment while unlocking new capabilities for efficiently managing the power. The design of the microarchitecture for V1 CPU ensures that there’s plenty of front-end, mid-processing and back-end memory bandwidth along with extra-wide vector units to process instructions at the maximum possible rate. If you would like more information about Neoverse V1 and the V1 RD refer to the Neoverse Reference Design page and Neoverse V1 developer page. If you are interested in knowing more about Neoverse N2 product please read our Neoverse N2 blog and Neoverse N2 developer page. As usual, your comments and feedback on this blog are most welcome.
Read Chris Bergey's Neoverse V1 and N2 launch blog