A transformation is underway in the public cloud and network architecture towards a more distributed internet infrastructure. We began speaking publicly about our views on this ongoing transformation at Arm Techcon 2018 with the unveiling of the Arm Neoverse brand and the vision and strategy of the infrastructure business at Arm: enabling the cloud to edge infrastructure foundation for a world of one trillion intelligent devices.
We believe in the growth of public cloud as the dominant server compute model, coupled with the extension of the modern cloud architecture to include compute capability deployed at the network provider edge. Deployment of compute closer to the edge is needed to enable the lower latencies specified by 5G. In addition, compute at the edge is absolutely required to cope with the amount of generated data from the IoT: intelligent IP connected devices that are increasingly part of a mesh of devices supporting users, rather than devices that users carry and personally maintain. The launch of the Neoverse N1 platform will prove to be a pivotal moment in this ongoing transformation.
The Neoverse N1 platform is the first compute platform from Arm capable of servicing the wider range of data center workloads with performance levels competitive with the legacy architectures used in the public cloud. In addition to data center-class performance, Arm's platform enablement approach provides a number of other advantages – diversity of business models, power efficiency enabling more performance at the edge and lower TCO in the cloud by maximizing compute per socket, higher performance system architectures through on-chip and off-chip heterogeneity, and design freedom.
The Neoverse N1 is the first Arm platform specifically designed from the ground up for infrastructure, on a roadmap committed to delivering more than 30% higher performance per generation. The performance of N1 substantially overdelivered, with at least 60% and higher performance increases over the Cortex-A72 based Cosmos platform at the same frequency.
The performance of the Neoverse N1 on real server workloads is substantially higher, up to 2.5x in cloud native workloads.
The Neoverse N1 is primarily optimized for high performance, but also designed for efficiency, achieving a 30% power efficiency increase over Cortex-A72 in the same process. This power efficiency enables full-frequency sustained performance in infrastructure use cases, compute density in the cloud, and performance at the edge. The Neoverse N1 is a 64b CPU with support for Aarch32 at user level, which delivers benefits for applications like running Android workloads in the cloud.
Every facet of the N1 design has been optimized for no-compromise sustained performance. The pipeline is an 11-stage accordion pipeline, which shortens in the presence of branch misses, and lengthens in normal operation. It uses a 4-wide front-end, with 8-wide dispatch/issue, three full 64-bit integer ALUs and a dedicated branch unit. The Neon Advanced SIMD pipeline is substantially wider, with dual 128b data paths. The ability to feed the SIMD engine is also widened substantially to dual 128-bit load / store pipeline with decoupled address/data, enabling sustained 2x128 performance.
The SIMD unit is designed to enable vector processing without the frequency throttling seen on legacy architectures in the data center. The combination of 2x128 SIMD with unthrottled performance and >2x core count vs. the competition enables scale-out vector performance. The design achieves 95% compute efficiency on compute dense workloads using Arm optimized math libraries. It shows up to 6x gains in machine learning workloads utilizing an 8-bit dot product instruction. Hyperscalers are increasingly using “free” CPU ML capability during lower loading cycles in non-peak use times, and generally for inference workloads in the cloud.
The Neoverse N1 caching structures are sized for large, branch-heavy infrastructure workloads, starting with 64K L1I/D, out through low-latency private 512K/1MB L2 caches, and backed by up to a 64-bank 128M system-level cache.
Continuing through the front end, the branch predictor includes a 6K Branch Target Buffer, 5K+8K direction predictors, and a high-capacity hybrid indirect-branch predictor. Page translation is sped up by 48-entry instruction and data L1 TLBs, and backed by a 1280-entry L2 TLB. The Neoverse N1 is the first Arm CPU with coherent I-cache, critical for performance on large-scale many-core systems. The N1 has been optimized for accelerating Type1 and Type 2 hypervisors and minimized the setup/teardown overhead of VM and container switching and migration.
The overall memory hierarchy is designed from top to bottom for low latency, high bandwidth, and scalability. This is evidenced by features like predict-directed-fetch front-end, 46 outstanding system-transactions, 32 outstanding non-prefetch transactions, and 68 in-flight LDs, 72 in-flight STs. The system cache features 22ns load to use in typical system, 1TB/s bandwidth, and supports DRAM-target prefetch to manage bandwidth utilization.
Networking performance is enhanced with PCIe Gen4 full bandwidth support and data preloading (“Cache stashing”) in core caches.
The Neoverse roadmap of products is being developed with software-driven hardware design, where performance analysis and feedback from cloud-native workloads is fed to the design teams to optimize performance. Hardware-software co-design is fundamental to understanding and extracting more performance from the full-system.
The Neoverse N1 has been optimized for cloud and network workloads like these, and it shows up in the performance numbers relative to the Cortex-A72, which has been successful deployed in the cloud and in numerous network instantiations.
The Neoverse N1 has been designed with specific platform features for the cloud to edge infrastructure. It uses a flexible high-performance coherent mesh interconnect with advanced routing, memory and caching features, supporting seamless on-chip or off-chip fully coherent acceleration over CCIX. The platform is supported by an ecosystem of deployment ready v8.2 software, with the same software across all configurations.
The design features a comprehensive RAS architecture, including write-once error-handler software, and seamless kernel support with ServerReady compliance. ServerReady enables cloud operators to certify that their Arm servers will boot and run in a standard way in their environment, just like any other server – simplifying the life of operators who want to lower TCO by moving to Arm. The RAS architecture supports consistent and architected error-logging with increased resilience/availability with full-system data poisoning at double word (64b) granularity, carried throughout system and cache-hierarchy. No error is exposed until/unless poisoned data is actually consumed. The design employs a consistent error-injection microarchitecture enabling qualification of error-handlers, with SECDED ECC protection on caches.
The N1 platform supports Armv8.2 Virtual Host Extensions (VHE), with scalability up to 64K VMs, to avoid exception level transitions for Type 2 hypervisors (e.g. KVM). The virtualization implementation in the N1 features latency-optimized hyper-traps, to minimize world-switch overhead. The MMU is optimized for virtualized nested paging, and the design is optimized for heavy OS/Hypervisor activity with low-latency switching, high-bandwidth context save/restore, and minimized context-synchronization serialization. This maps well to the VM-based and containerized workloads in cloud datacenters and modern virtualized network equipment.
The design features intelligent performance management, with a per-core activity monitoring unit (AMU) enabling fully-informed DVFS/DFS thread management. The design also includes features for active mitigation of power-virus scenarios, where hardware throttling reduces virus conditions to ~TDP levels, with negligible performance impact across non-virus workloads. This enables designers to build SoC power-delivery network to TDP conditions, not virus conditions.
The Neoverse N1 has been designed to unleash performance in large-scale systems (64+ cores), while also scaling down to as few as 8 cores in edge designs.
The power efficiency results in TDP headroom that enables high-core-count, unthrottled performance for designs that scale from 64 to 128 N1 cores within a single coherent system. The system itself can scale beyond that, however real systems will architect around memory bandwidth and likely come in at 64 to 96 cores with 8ch DDR4 and 96 to 128 cores with DDR5. Designers can employ chiplets over CCIX links for cost-effective manufacturability. The Neoverse N1 is scalable, enabling the industry’s highest core count at 128 cores for hyperscale down to 8 cores for edge with power ranging from <200W to <20W. Not only can you scale to high core counts, but the compact design, coherent mesh interconnect, and industry leading power efficiency of the Neoverse N1 give our partners the flexibility to build diverse compute solutions by using the available silicon area and power headroom to add accelerators or other features with their own custom silicon.
To sum up, the Neoverse N1 platform will accelerate the ongoing transformation of the public cloud and network infrastructure. It will be deployed in the public cloud as an alternative architecture for main compute nodes enabling lower TCO for data center operators and edge installations of cloud compute while delivering greater design diversity. It will also find a home in more advanced network, storage, and security processing, often referred to as SmartNICs. Neoverse N1 will be deployed in edge compute installations deployed by network operators with design points starting at 8 core edge nodes. Neoverse N1 can also be deployed in control plus data plane combination with Neoverse E1. One thing is clear, the Neoverse N1 will unlock innovation in cloud data centers, edge compute, and the entire cloud to edge infrastructure. To learn more visit the Neoverse N1 page on developer, and read about our recent product announcement.
This benchmark presentation made by Arm Ltd and its subsidiaries (Arm) contains forward-looking statements and information. The information contained herein is therefore provided by Arm on an "as-is" basis without warranty or liability of any kind. While Arm has made every attempt to ensure that the information contained in the benchmark presentation is accurate and reliable at the time of its publication, it cannot accept responsibility for any errors, omissions or inaccuracies or for the results obtained from the use of such information and should be used for guidance purposes only and is not intended to replace discussions with a duly appointed representative of Arm. Any results or comparisons shown are for general information purposes only and any particular data or analysis should not be interpreted as demonstrating a cause and effect relationship. Comparable performance on any performance indicator does not guarantee comparable performance on any other performance indicator.
Any forward-looking statements involve known and unknown risks, uncertainties and other factors which may cause Arm’s stated results and performance to be materially different from any future results or performance expressed or implied by the forward-looking statements.
Arm does not undertake any obligation to revise or update any forward-looking statements to reflect any event or circumstance that may arise after the date of this benchmark presentation and Arm reserves the right to revise our product offerings at any time for any reason without notice.
Any third-party statements included in the presentation are not made by Arm, but instead by such third parties themselves and Arm does not have any responsibility in connection therewith.