Co-Authors: Ravi Malhotra and Yichen Jia
This blog describes work done by Arm to optimize NVMe over Fabrics (NVMe-oF) performance on Ampere® Altra® server platforms for typical storage applications in the data center and showcases the value-add that Arm brings to this space.
Storage disaggregation has gained popularity [1][2] recently to reduce the significant computing burden put on host CPUs by high-performance non-volatile Memory Express (NVMe) based flash SSDs and I/O intensive applications. This burden can be reduced by separating the compute from the storage nodes. NVMe over Fabrics (NVMe-oF) is particularly attractive for storage considering its ultra-low remote access latency for RDMA transport with fast interconnect technologies, and easy deployment for TCP transport. It can further enable independent and flexible infrastructure tuning, reduce resource underutilization, and reduce monetary cost. Typically, the system is split between multiple initiators and the target, as shown in Figure 1. These are typically connected by a storage transport network, and there are multiple options. For this work, we focused on comparing RDMA over Ethernet (ROCE) and TCP/IP to highlight key differences between the two protocols in terms of processing requirements.
Figure 1. NVMe over Fabrics Architecture
With the release of Arm Neoverse architecture, Neoverse N1-based cloud servers have been widely deployed by popular public cloud providers, such as Amazon Web Services [3] and Oracle OCI A1[4]. They are designed for enhanced computing capability and low power consumption, and they also offer multiple connectivity options for both network (multiple 100GbE/200GbE NICs), as well as storage devices. These features make them a particularly suitable platform for traditional server applications. In this blog, we show our benchmarking results for NVMe-oF on N1-based Ampere® Altra® to illustrate the efficiency and flexibility of Arm solutions in this .
While native support for NVME over Fabrics TCP initiator and target was included in Linux kernel 5.0 onwards, we selected SPDK (Storage Performance Development Kit).This is https://spdk.io, which is a user-space based storage stack, and as such avoids the typical copy and context switch overheads. SPDK provides a lockless, thread-per-core design. Arm has made significant contributions to SPDK to optimize performance on Arm 64-bit platforms.
To demonstrate the performance of Arm servers and high-end x86 machines on NVMe-oF, we choose an Arm Neoverse N1-based dual-socket Ampere Altra server and a dual-socket Intel Xeon Platinum 8268 (cascade-Lake)-based Dell EMC PowerEdge R740xd for comparison. Both are equipped with Mellanox ConnectX6 dual port 200GbE network interface cards (configured to be 100GbE) and two PCIe backplanes that could connect to 32 NVMe SSD drives. Both systems are tested in two different scenarios, each targeting a different leg of the storage stack processing.
User-level NVMe drivers from SPDK are used on both systems, along with FIO tool for benchmarking performance. Both systems use the latest, upstreamed version of SPDK (v21.01). And hyper-threading is turned on for the Xeon Platinum 8268-based system.
Using RDMA as transport, we find that it takes between 4-8 cores on both Xeon 8268 and Ampere Altra to fully saturate the network link bandwidth in different mixes of read and write request traffic.
Figure 2. Bandwidth for NVMe/RDMA with Initiator Scaling
Switching to TCP as transport, the higher processing cost of TCP stack kicks in, and we find that it takes between 16 to 24 cores on both Xeon 8268 and Ampere Altra to fully saturate the network link bandwidth in different mixes of read and write request traffic. Both systems showed good scaling across cores.
Figure 3. Bandwidth for NVMe/TCP with Initiator Scaling
However, when system utilization is taken into account, we find that only 15% of the overall CPU bandwidth on Ampere Altra is used for NVMe over TCP processing, leaving over 85% available for other compute tasks. In the case of Xeon 8268, only 50% of overall CPU bandwidth is available as headroom.
Figure 4. System CPU Utilization for NVMe over TCP Initiator
When used as an NVMe-oF target, the CPU cores are responsible for processing both networking and storage traffic. Using RDMA as transport, it takes 6-8 cores on both systems to saturate the network link with a variety of read and write request mix.
Figure 5. Bandwidth for NVMe/RDMA with Altra Storage Server Core Scaling
With TCP as transport, the utilization is much higher – up to 32 cores are needed on both systems to saturate the network link.
Figure 6. Bandwidth for NVMe/TCP with Altra Storage Server Core Scaling
There are typically other tasks in a storage server, such as compression, encryption, RAID/erasure-coding and higher layer block/object file-system stacks such as Ceph, OpenEBS etc, that take up a significant amount of processing bandwidth.
An Ampere Altra based system provides up to 80% headroom for these other tasks as compared to 50% in the Intel Xeon 8268 system.
Figure 7. System CPU Utilization for NVMe over TCP Target
With the wide adoption of NVMe-oF as the storage protocol for accessing remote storage devices over the network, Arm-based solutions are efficient and appealing for I/O intensive applications. Arm Neoverse N1-based Ampere Altra servers use only 15-20% of their overall CPU bandwidth for the NVMe over TCP stack processing, and provide the rest as headroom for other storage and compute tasks, as compared to competition, where only half the number of CPUs are available for processing other taks. Ampere Altra is available today in SKUs up to 80 cores. And Altra-Max is now sampling with SKUs of up to 128 cores. The data provided above can help storage system designers select the right SKU as per their performance/usage targets.
[CTAToken URL = "https://www.arm.com/solutions/infrastructure" target="_blank" text="Explore Arm Neoverse" class ="green"]
References:
[1] Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan, NVMe-over-Fabrics performance characterization and the path to low-overhead flash disaggregation. In Proceedings of the 10th ACM International Systems and Storage Conference, SYSTOR ’17, pages 16:1–16:9, New York, NY, USA, 2017. ACM.
[2] Yichen Jia; Eric Anger; Feng Chen, When NVMe over Fabrics Meets Arm: Performance and Implications, . In Proceedings of 2019 35th Symposium on Mass Storage Systems and Technologies (MSST), 2019, pp. 134-140, doi: 10.1109/MSST.2019.000-9.
[3] Amazon AWS Graviton Processors. https://aws.amazon.com/ec2/graviton/
[4] Oracle OCI A1, https://www.oracle.com/cloud/compute/arm/