Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog NVMe over Fabrics performance for storage appliances with Ampere Altra servers
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Cloud Computing
  • storage
  • Neoverse N1
  • infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

NVMe over Fabrics performance for storage appliances with Ampere Altra servers

Ravi Malhotra
Ravi Malhotra
September 15, 2021
5 minute read time.


Co-Authors: Ravi Malhotra and Yichen Jia

This blog describes work done by Arm to optimize NVMe over Fabrics (NVMe-oF) performance on Ampere® Altra® server platforms for typical storage applications in the data center and showcases the value-add that Arm brings to this space.

Background

Storage disaggregation has gained popularity [1][2] recently to reduce the significant computing burden put on host CPUs by high-performance non-volatile Memory Express (NVMe) based flash SSDs and I/O intensive applications. This burden can be reduced by separating the compute from the storage nodes. NVMe over Fabrics (NVMe-oF) is particularly attractive for storage considering its ultra-low remote access latency for RDMA transport with fast interconnect technologies, and easy deployment for TCP transport. It can further enable independent and flexible infrastructure tuning, reduce resource underutilization, and reduce monetary cost. Typically, the system is split between multiple initiators and the target, as shown in Figure 1. These are typically connected by a storage transport network, and there are multiple options. For this work, we focused on comparing RDMA over Ethernet (ROCE) and TCP/IP to highlight key differences between the two protocols in terms of processing requirements.

NVMe over Fabrics Architecture

Figure 1. NVMe over Fabrics Architecture

With the release of Arm Neoverse architecture, Neoverse N1-based cloud servers have been widely deployed by popular public cloud providers, such as Amazon Web Services [3] and Oracle OCI A1[4]. They are designed for enhanced computing capability and low power consumption, and they also offer multiple connectivity options for both network (multiple 100GbE/200GbE NICs), as well as storage devices. These features make them a particularly suitable platform for traditional server applications. In this blog, we show our benchmarking results for NVMe-oF on N1-based Ampere® Altra® to illustrate the efficiency and flexibility of Arm solutions in this .

While native support for NVME over Fabrics TCP initiator and target was included in Linux kernel 5.0 onwards, we selected SPDK (Storage Performance Development Kit).This is https://spdk.io, which is a user-space based storage stack, and as such avoids the typical copy and context switch overheads. SPDK provides a lockless, thread-per-core design. Arm has made significant contributions to SPDK to optimize performance on Arm 64-bit platforms.

Configuration used

To demonstrate the performance of Arm servers and high-end x86 machines on NVMe-oF, we choose an Arm Neoverse N1-based dual-socket Ampere Altra server and a dual-socket Intel Xeon Platinum 8268 (cascade-Lake)-based Dell EMC PowerEdge R740xd for comparison. Both are equipped with Mellanox ConnectX6 dual port 200GbE network interface cards (configured to be 100GbE) and two PCIe backplanes that could connect to 32 NVMe SSD drives. Both systems are tested in two different scenarios, each targeting a different leg of the storage stack processing.

  1. NVMe-oF Initiator – both systems are used as machines initiating a mix of read and write requests to an NVMe-F target. One single 100Gb NIC is used on the NVMe-oF initiator.
  2. NVMe-F Target – both systems are used as storage servers processing a mix of read and write requests received from multiple NVMe-oF initiators. Two 100GB NICs are used on the NVMe-oF target.

User-level NVMe drivers from SPDK are used on both systems, along with FIO tool for benchmarking performance. Both systems use the latest, upstreamed version of SPDK (v21.01). And hyper-threading is turned on for the Xeon Platinum 8268-based system.

NVMe-oF initiator performance

Using RDMA as transport, we find that it takes between 4-8 cores on both Xeon 8268 and Ampere Altra to fully saturate the network link bandwidth in different mixes of read and write request traffic.

Bandwidth for NVMe/RDMA with Initiator Scaling

Figure 2. Bandwidth for NVMe/RDMA with Initiator Scaling

Switching to TCP as transport, the higher processing cost of TCP stack kicks in, and we find that it takes between 16 to 24 cores on both Xeon 8268 and Ampere Altra to fully saturate the network link bandwidth in different mixes of read and write request traffic. Both systems showed good scaling across cores.

 Bandwidth for NVMe/TCP with Initiator Scaling

Figure 3. Bandwidth for NVMe/TCP with Initiator Scaling

However, when system utilization is taken into account, we find that only 15% of the overall CPU bandwidth on Ampere Altra is used for NVMe over TCP processing, leaving over 85% available for other compute tasks. In the case of Xeon 8268, only 50% of overall CPU bandwidth is available as headroom.

System CPU utilization for NVMe over TPC Initiator

Figure 4. System CPU Utilization for NVMe over TCP Initiator

NVMe-F target performance

When used as an NVMe-oF target, the CPU cores are responsible for processing both networking and storage traffic. Using RDMA as transport, it takes 6-8 cores on both systems to saturate the network link with a variety of read and write request mix.

 Bandwidth for NVMe/RDMA with Server Core Scaling

Figure 5. Bandwidth for NVMe/RDMA with Altra Storage Server Core Scaling 

With TCP as transport, the utilization is much higher – up to 32 cores are needed on both systems to saturate the network link.

Bandwidth for NVMe/TCP with Server Core Scaling

Figure 6. Bandwidth for NVMe/TCP with Altra Storage Server Core Scaling 

There are typically other tasks in a storage server, such as compression, encryption, RAID/erasure-coding and higher layer block/object file-system stacks such as Ceph, OpenEBS etc, that take up a significant amount of processing bandwidth.

An Ampere Altra based system provides up to 80% headroom for these other tasks as compared to 50% in the Intel Xeon 8268 system.

System CPU Utilization for NVMe over TCP Target

 Figure 7. System CPU Utilization for NVMe over TCP Target

Conclusion

With the wide adoption of NVMe-oF as the storage protocol for accessing remote storage devices over the network, Arm-based solutions are efficient and appealing for I/O intensive applications. Arm Neoverse N1-based Ampere Altra servers use only 15-20% of their overall CPU bandwidth for the NVMe over TCP stack processing, and provide the rest as headroom for other storage and compute tasks, as compared to competition, where only half the number of CPUs are available for processing other taks. Ampere Altra is available today in SKUs up to 80 cores.  And Altra-Max is now sampling with SKUs of up to 128 cores. The data provided above can help storage system designers select the right SKU as per their performance/usage targets.

Explore Arm Neoverse

References:

[1] Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan, NVMe-over-Fabrics performance characterization and the path to low-overhead flash disaggregation. In Proceedings of the 10th ACM International Systems and Storage Conference, SYSTOR ’17, pages 16:1–16:9, New York, NY, USA, 2017. ACM.

[2] Yichen Jia; Eric Anger; Feng Chen, When NVMe over Fabrics Meets Arm: Performance and Implications, . In Proceedings of 2019 35th Symposium on Mass Storage Systems and Technologies (MSST), 2019, pp. 134-140, doi: 10.1109/MSST.2019.000-9.

[3] Amazon AWS Graviton Processors. https://aws.amazon.com/ec2/graviton/

[4] Oracle OCI A1, https://www.oracle.com/cloud/compute/arm/

Anonymous
Servers and Cloud Computing blog
  • Advancing Chiplet Innovation for Data Centers: Novatek’s CSS N2 SoC in Arm Total Design

    Marc Meunier
    Marc Meunier
    Novatek’s CSS N2 SoC, built with Arm Total Design, drives AI, cloud, and automotive innovation with chiplet-based, scalable compute.
    • September 24, 2025
  • How we cut LLM inference costs by 35% migrating to Arm-Based AWS Graviton

    Cornelius Maroa
    Cornelius Maroa
    The monthly wake-up call. Learn how Arm-based Graviton3 reduced costs 40%, cut power use 23%, and unlocked faster, greener AI at scale.
    • September 24, 2025
  • Hands-on with MPAM: Deploying and verifying on Ubuntu

    Howard Zhang
    Howard Zhang
    In this blog post, Howard Zhang walks through how to configure and verify MPAM on Ubuntu Linux.
    • September 24, 2025