NVMe over Fabrics performance for storage appliances with Ampere Altra servers

September 15, 2021

5 minute read time.

Co-Authors: Ravi Malhotra and Yichen Jia

This blog describes work done by Arm to optimize NVMe over Fabrics (NVMe-oF) performance on Ampere® Altra® server platforms for typical storage applications in the data center and showcases the value-add that Arm brings to this space.

Background

Storage disaggregation has gained popularity [1][2] recently to reduce the significant computing burden put on host CPUs by high-performance non-volatile Memory Express (NVMe) based flash SSDs and I/O intensive applications. This burden can be reduced by separating the compute from the storage nodes. NVMe over Fabrics (NVMe-oF) is particularly attractive for storage considering its ultra-low remote access latency for RDMA transport with fast interconnect technologies, and easy deployment for TCP transport. It can further enable independent and flexible infrastructure tuning, reduce resource underutilization, and reduce monetary cost. Typically, the system is split between multiple initiators and the target, as shown in Figure 1. These are typically connected by a storage transport network, and there are multiple options. For this work, we focused on comparing RDMA over Ethernet (ROCE) and TCP/IP to highlight key differences between the two protocols in terms of processing requirements.

NVMe over Fabrics Architecture

Figure 1. NVMe over Fabrics Architecture

With the release of Arm Neoverse architecture, Neoverse N1-based cloud servers have been widely deployed by popular public cloud providers, such as Amazon Web Services [3] and Oracle OCI A1[4]. They are designed for enhanced computing capability and low power consumption, and they also offer multiple connectivity options for both network (multiple 100GbE/200GbE NICs), as well as storage devices. These features make them a particularly suitable platform for traditional server applications. In this blog, we show our benchmarking results for NVMe-oF on N1-based Ampere® Altra® to illustrate the efficiency and flexibility of Arm solutions in this .

While native support for NVME over Fabrics TCP initiator and target was included in Linux kernel 5.0 onwards, we selected SPDK (Storage Performance Development Kit).This is https://spdk.io, which is a user-space based storage stack, and as such avoids the typical copy and context switch overheads. SPDK provides a lockless, thread-per-core design. Arm has made significant contributions to SPDK to optimize performance on Arm 64-bit platforms.

Configuration used

To demonstrate the performance of Arm servers and high-end x86 machines on NVMe-oF, we choose an Arm Neoverse N1-based dual-socket Ampere Altra server and a dual-socket Intel Xeon Platinum 8268 (cascade-Lake)-based Dell EMC PowerEdge R740xd for comparison. Both are equipped with Mellanox ConnectX6 dual port 200GbE network interface cards (configured to be 100GbE) and two PCIe backplanes that could connect to 32 NVMe SSD drives. Both systems are tested in two different scenarios, each targeting a different leg of the storage stack processing.

NVMe-oF Initiator – both systems are used as machines initiating a mix of read and write requests to an NVMe-F target. One single 100Gb NIC is used on the NVMe-oF initiator.
NVMe-F Target – both systems are used as storage servers processing a mix of read and write requests received from multiple NVMe-oF initiators. Two 100GB NICs are used on the NVMe-oF target.

User-level NVMe drivers from SPDK are used on both systems, along with FIO tool for benchmarking performance. Both systems use the latest, upstreamed version of SPDK (v21.01). And hyper-threading is turned on for the Xeon Platinum 8268-based system.

NVMe-oF initiator performance

Using RDMA as transport, we find that it takes between 4-8 cores on both Xeon 8268 and Ampere Altra to fully saturate the network link bandwidth in different mixes of read and write request traffic.

Bandwidth for NVMe/RDMA with Initiator Scaling

Figure 2. Bandwidth for NVMe/RDMA with Initiator Scaling

Switching to TCP as transport, the higher processing cost of TCP stack kicks in, and we find that it takes between 16 to 24 cores on both Xeon 8268 and Ampere Altra to fully saturate the network link bandwidth in different mixes of read and write request traffic. Both systems showed good scaling across cores.

Bandwidth for NVMe/TCP with Initiator Scaling

Figure 3. Bandwidth for NVMe/TCP with Initiator Scaling

However, when system utilization is taken into account, we find that only 15% of the overall CPU bandwidth on Ampere Altra is used for NVMe over TCP processing, leaving over 85% available for other compute tasks. In the case of Xeon 8268, only 50% of overall CPU bandwidth is available as headroom.

Figure 4. System CPU Utilization for NVMe over TCP Initiator

NVMe-F target performance

When used as an NVMe-oF target, the CPU cores are responsible for processing both networking and storage traffic. Using RDMA as transport, it takes 6-8 cores on both systems to saturate the network link with a variety of read and write request mix.

Bandwidth for NVMe/RDMA with Server Core Scaling

Figure 5. Bandwidth for NVMe/RDMA with Altra Storage Server Core Scaling

With TCP as transport, the utilization is much higher – up to 32 cores are needed on both systems to saturate the network link.

Bandwidth for NVMe/TCP with Server Core Scaling

Figure 6. Bandwidth for NVMe/TCP with Altra Storage Server Core Scaling

There are typically other tasks in a storage server, such as compression, encryption, RAID/erasure-coding and higher layer block/object file-system stacks such as Ceph, OpenEBS etc, that take up a significant amount of processing bandwidth.

An Ampere Altra based system provides up to 80% headroom for these other tasks as compared to 50% in the Intel Xeon 8268 system.

System CPU Utilization for NVMe over TCP Target

Figure 7. System CPU Utilization for NVMe over TCP Target

Conclusion

With the wide adoption of NVMe-oF as the storage protocol for accessing remote storage devices over the network, Arm-based solutions are efficient and appealing for I/O intensive applications. Arm Neoverse N1-based Ampere Altra servers use only 15-20% of their overall CPU bandwidth for the NVMe over TCP stack processing, and provide the rest as headroom for other storage and compute tasks, as compared to competition, where only half the number of CPUs are available for processing other taks. Ampere Altra is available today in SKUs up to 80 cores. And Altra-Max is now sampling with SKUs of up to 128 cores. The data provided above can help storage system designers select the right SKU as per their performance/usage targets.

[CTAToken URL = "https://www.arm.com/solutions/infrastructure" target="_blank" text="Explore Arm Neoverse" class ="green"]

References:

[1] Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan, NVMe-over-Fabrics performance characterization and the path to low-overhead flash disaggregation. In Proceedings of the 10th ACM International Systems and Storage Conference, SYSTOR ’17, pages 16:1–16:9, New York, NY, USA, 2017. ACM.

[2] Yichen Jia; Eric Anger; Feng Chen, When NVMe over Fabrics Meets Arm: Performance and Implications, . In Proceedings of 2019 35th Symposium on Mass Storage Systems and Technologies (MSST), 2019, pp. 134-140, doi: 10.1109/MSST.2019.000-9.

[3] Amazon AWS Graviton Processors. https://aws.amazon.com/ec2/graviton/

[4] Oracle OCI A1, https://www.oracle.com/cloud/compute/arm/

Infrastructure Solutions blog

Virtual Networking Solution and Performance on Arm Neoverse

Yanqin Wei

An introduction to the Virtual Networking Solution and Performance on Arm Neoverse white paper.
- November 14, 2024
Use case: How to enable real-time sentiment analysis on Arm Neoverse-based Kubernetes Clusters

Na Li

Learn how to build a distributed kubernetes cluster on Arm Neoverse-based instances.
- November 11, 2024
Google’s Axion powered by Arm Neoverse: Faster inference and higher performance for AI workloads

Ashok Bhat

Google Axion is an excellent choice for AI inference, capable of handling a wide range of workloads from traditional machine learning tasks like XGBoost to generative AI applications such as LLaMa.
- October 30, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

NVMe over Fabrics performance for storage appliances with Ampere Altra servers

Background

Configuration used

NVMe-oF initiator performance

NVMe-F target performance

Conclusion

Virtual Networking Solution and Performance on Arm Neoverse

Use case: How to enable real-time sentiment analysis on Arm Neoverse-based Kubernetes Clusters

Google’s Axion powered by Arm Neoverse: Faster inference and higher performance for AI workloads