Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog SPDK NVMe over TCP Optimization on Arm
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • storage
  • optimization
  • performance
  • Open Source Software
  • Tools and Software
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

SPDK NVMe over TCP Optimization on Arm

Rui Chang
Rui Chang
February 5, 2024
15 minute read time.
Co-authors: Rui Chang and Richael Zhuang

This blog introduces optimizing Storage Performance Development Kit (SPDK) NVMe (Non-Volatile Memory Express) over TCP (Transmission Control Protocol) on Arm, and how to maximize its performance.

 SPDK logo

What is SPDK? 

As storage media continues to evolve in IO performance, the percentage of total transaction time the storage software consumes becomes greater. Improving the performance and efficiency of the storage software stack is critical. SPDK is an open-source software framework that provides a set of libraries and tools for writing high performance, scalable, user-mode storage applications tailored to specific needs. SPDK unlocks the full potential of modern storage hardware, such as non-volatile memory (NVM) devices, solid-state drives (SSD), and networked storage devices. 

How does SPDK work? 

Traditional kernel I/O stack brings overhead due to context switch, data copy, interrupt, and resource synchronization. SPDK minimizes overhead during IO processing by:  

  • Using user mode for storage applications rather than kernel mode. After devices are bound to UIO or VFIO driver, SPDK operates the devices from user space, which eliminates costly context switch. Applications leveraging SPDK library communicate with the devices directly through the user space driver. 
  • Running in polled mode instead of interrupt mode. SPDK creates a thread on each core during initialization, which is called a reactor (Figure 1). Users register pollers on this reactor to poll hardware for completions instead of waiting for interrupts. This reduces the interrupt handling overhead and the latency. 
  • Using shared-nothing thread model. Each SPDK thread operates independently on its own set of data structures and resources, which avoids overhead for synchronization. An event ring is created on each reactor for necessary thread communication.

Diagram showing the SPDK threat model

Figure 1. SPDK thread model

SPDK framework 

SPDK includes several layers as shown in Figure 2. 

 SPDK Architecture diagram

Figure 2. SPDK architecture

  • Hardware Drivers: The NVMe driver is the foundational component for SPDK. This is a C library used for direct, zerocopy data transfer to and from NVMe devices. The virtio driver allows communicating with virtio devices.
  • Block storage: SPDK provides rich backend storage devices support. Including NVMe block device backed by NVMe SSDs, Linux Asynchronous I/O(AIO) to allow SPDK to interact with kernel devices like HDD and ceph RBD to allow ceph as a backend device for SPDK. 
  • Block storage services: SPDK block storage service layer provides flexible APIs for additional customer functionality including RAID and compression in the block layer.
  • Block storage protocols: Block storage protocols enable SPDK to expose its backend storage to remote clients, virtual machines, or other processes through different transmission protocols. iSCSI target is the implementation of the established specification for block -level SCSI data over TCP/IP connection. NVMe-oF target is the implementation of NVMe-oF specification in user space which presents a block device over a fabric. Vhost target enables SPDK to provide backend storage for Qemu-based virtual machines or Kata containers. Vfio-user allows SPDK to expose emulated NVMe devices to virtual machines, which leverage existing NVMe drivers to communicate with the devices.
  • File Storage Services: SPDK also provides a file system called BlobFS on top of its block allocator Blobstore. This works as the storage backend of MySQL and Rocksdb which makes the whole IO path in user space.

What is NVMe over TCP? 

  • NVMe is a protocol designed for Solid-state Drives (SSDs) to maximize performance by leveraging the capabilities of the Peripheral Component Interconnect Express (PCIe) interface. NVMe over PCIe is the initial purpose of NVMe protocol for local NVMe SSD access. This transfers data by mapping commands and responses to shared memory in the host over the PCIe interface protocol.
  • NVMe over Fabrics (NVMe-oF) enables remote sharing and access of NVMe storage devices over a network fabric. For example, Ethernet or Fibre Channel. NVMe-oF is the extension of NVMe over PCIe. NVMe-oF leverages a message-based model or combined model for communication between a host and a target storage device. The supported transport protocols are Fibre Channel, RDMA (Infiniband, ROCE, iWARP) and TCP (Figure 3). 

 NVMe over Fabrics model diagram

Figure 3. NVMe over Fabrics model

SPDK supports RDMA, TCP, and Fibre Channel transports. This consists of the initiator framework, and the target (Figure 4). If the initiator (host) and the NVMe SSDs are in the same server, the devices are accessed directly by PCIe. If not, the initiator must access the remote target devices through fabrics.

Among multiple fabric options, NVMe over TCP allows users to harness NVMe across a standard Ethernet network. This means lower deployment cost and design complexity thanks to the stability and portability of the mature TCP/IP stack.

We will focus on SPDK NVMe over TCP, which combines the advantages of NVMe over TCP and SPDK working mechanism. 

SPDK NVMe over Fabrics framework diagram

Figure 4. SPDK NVMe over Fabrics framework

When using a TCP transport(Figure 5), each host-side NVMe queue pair has a corresponding controller-side queue pair which is mapped to its own TCP connection. This is assigned to a separate CPU core. The Command Capsules are encapsulated into a TCP PDU (Protocol Data Unit) and sent over standard TCP/IP sockets by calling Linux syscalls including sendmsg. The controller-side reads the received data from socket buffer and constructs the receive CMD capsule. This includes the request information for further processing. After the request is processed, one RSP capsule is generated and sent through the socket. The response data arrives at the host-side socket buffer which is packeted into a receive RSP capsule.

NVMe over TCP data path diagram

Figure 5. NVMe over TCP data path

Optimization work on Arm 

SPDK NVMe over TCP is a high-performance solution exposing NVMe storages to remote clients through TCP/IP network. Although SPDK is lock free and the NVMe driver is in user space, the kernel-based TCP/IP stack is not lock free. Therefore, the system calls and memory copy between kernel and user space are inevitable. To leverage TCP/IP stack efficiently, SPDK has introduced several optimizations including: 

  • Batch write 
  • Pipe buffer 
  • Zerocopy 

Our optimization work is based on existing implementation, aims to further squeeze SPDK NVMe over TCP performance: 

  • Tune system configuration 
  • Improve data locality 
  • Balance zerocopy and non-zerocopy
  • Reduce power waste

Configuration optimization 

Appropriate system configuration is important for SPDK. The platform is configured according to the architecture and features, including: 

  • Linux kernel setting 
  • PCIe parameters 
  • NIC parameters

Linux kernel cmdline setting 

  • Hugepage: SPDK depends on the Data Plane Development Kit (DPDK) library to manage components including the hugepage memory and buffer pool. DPDK supports 2MB and 1GB to cover large memory areas without as many TLB misses, which leads to better performance. 
  • Core isolation: isolate CPUs from the kernel scheduler to reduce context switch. 
  • Iommu.passthrough: SPDK recommends using vfio-pci driver if IOMMU is available. Otherwise, use uio_pci_generic or igb_uio. To use uio_pci_generic or igb_uio driver, the IOMMU should be disabled or set to passthrough mode. Vfio-pci driver uses the IO virtual address (IOVA) for DMA if iommu.passthrough is not set. This is more secure with IOMMU’s translation. If “iommu.passthrough=1” is added to GRUB cmdline, the physical address is used for DMA. This provides better performance. 

For example, for four 1G hugepages, add the following parameters to GRUB cmdline. SPDK runs on CPU core 0-7 and IOVA is the physical address.

hugepagesz=1G hugepages=4 isolcpus = 0-7 iommu.passthrough=1

PCIe parameters tunning

  • PCIe Max Payload Size determines the maximal size of a PCIe packet. The manufacturer sets the maximum TLP payload size and the value is also dependent on the connected device. Add “pci=pcie_bus_perf” to the kernel cmdline to ensure the PCIe max payload size is used. 
  • PCIe Max Read Request determines the maximum PCIe read request allowed. The size of the PCIe max read request may affect the number of pending requests. Tune it according to your workload. 

Use the following commands for setting maximum request size:

lspci –vvv -s 0000:04:00.0 |grep MaxReadReq
    MaxPayload 256 bytes, MaxReadReq 256 bytes

setpci -s 0000:04:00.0 68.w  #get current configuration
    1963  # “1” means 256Bytes

setpci -s 0000:04:00.0 68.w=2963
(The first digit is the PCIe Max Read Request size selector. 
The acceptable values are: 0 - 128B, 1 - 256B, 2 - 512B, 3 - 1024B, 4 - 2048B and 5 - 4096B)

NIC parameter tuning

  • NIC queue number and queue depth: Normally, the NIC Rx/Tx queue number is set the same as the CPU number. A proper queue size is required, as a small queue size may result in packet loss. A large queue size may cause poor cache utilization if the ring size exceeds the cache size. Tune this according to your system resources and workloads.

ethtool -G ${nic} rx 1024 tx 1024
ethtool -L ${nic} combined 56

  • Hard interrupt affinity: IRQ affinity is a Linux ability which assigns some IRQs to specific processors.  A proper IRQ affinity setting makes the server work efficiently. For most cases, NIC’s IRQs should be bound to the same NUMA node where the NIC is. Irqbalance is a Linux daemon that helps to balance the CPU load generated by interrupts across all CPUs. To set the IRQ affinity, the irqbalance service should be stopped first. 

For example, use the following command to serve IRQ 40 on the upper 32-cores of a 64-core system:

service irqbalance stop
echo 0xffffffff,00000000 > /proc/irq/40/smp_affinity

#or with the script from https://github.com/Mellanox/mlnx-tools.git

./set_irq_affinity_cpulist.sh “8-16” ${nic} or ./set_irq_affinity_cpulist.sh “1,3,5,7” ${nic}

  • Hard interrupt coalescence: Interrupt coalescing is a way of controlling when a device raises an interrupt. The NIC collects incoming packets and waits a specific threshold before generating an interrupt. This reduces the overall number of interrupts the CPU must handle, which results in higher throughput, increased latency, and lower CPU usage. 

Use the following command to enable the adaptive irq coalescing:

ethtool -C eth0 adaptive-rx on

Use the ethtool -C to set the irq coalescing according to your own case:

adaptive-rx=off,adaptive-tx=off

rx-usecs 64 rx-frames 128 tx-usecs 128 tx-frames 128

The parameters are: 

  • rx-usecs: Number of usecs delaying an RX interrupt after a packet arrives. 
  • rx-frames: Maximum number of data frames received before an RX interrupt. 
  • rx-usecs-irq: Number of usecs delaying an RX interrupt while an interrupt is being serviced by the host. 
  • rx-frames-irq: Maximum number of data frames received before an RX interrupt is generated, while the system is servicing an interrupt. 
  • Soft interrupt coalescence

The New API (NAPI) is a mechanism for reducing the number of IRQs generated by network devices on packet arrival. This registers a poll function that the NAPI subsystem calls to harvest data frames.

Set “net.core.netdev_budget” and “net.core.netdev_budget_usecs” to limit the number of packets polled in one NAPI polling cycle. Netdev_budget is the maximum number of packets taken from all interfaces in one polling cycle. A polling cycle may not exceed netdev_budget_usecs microseconds, even if netdev_budget has not been exhausted. And dev_weight is the maximum number of packets the kernel can handle on a NAPI interrupt, it is a Per-CPU variable.

sysctl -w net.core.netdev_budget = 300
sysctl -w net.core.netdev_budget_usecs = 8000
sysctl -w net.core.dev_weight = 64

Refer to the Linux-network-performance-parameter for more information.

  • TCP socket buffer: The TCP socket buffer size is calculated automatically based on system memory by default. A small socket buffer size may result in packet drop when receiving data and frequent write operation blocked when sending data. To tune the buffer, use the following commands:

# Set 256MB buffers
net.core.rmem_max = 268435456
net.core.wmem_max = 268435456

# Increase autotuning TCP buffer limits 128MB
# min, max and default settings
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

Data locality optimization 

In TCP kernel space, when data arrives at the NIC and NIC DMA packets at RAM, a received ring is selected according to hash function in RSS (Receive Side Scaling). The reference of the packet is enqueued to the ring buffer. A hard IRQ is raised, which is processed by the CPU. If we set the IRQ affinity, it is the assigned CPU or else the irqbalance service will select one. By default, the soft IRQ is also triggered on the same CPU core as the hard IRQ, which schedules a NAPI to poll data from the receive ring buffer. The processing of this packet is carried out in the CPU core until it is enqueued to the socket receive buffer.

 TCP data receive flow diagram

Figure 6. TCP data receive flow

In SPDK NVMe over TCP, each connection from a client is assigned to a reactor (CPU core) during initiation. The socket read/write of this connection is completed on this CPU core. So, a semantic gap exists between the kernel space and user space in relation to the CPU core affinity. 

To guarantee the CPU core processing this socket data in kernel space is the same with core that reads this socket in user space (SPDK), in SPDK NVMe over TCP, we introduced CPU affinity based placement of socket . This obtains the CPU affinity of a socket and decides which CPU core this socket should be assigned to during connection initiation. For example, when a new connection (socket A) launches (Figure 6), we obtain the CPU affinity of socket A. This is CPU core 1, which is responsible for the kernel space processing of the packets for this socket. In SPDK, socket A is assigned to poll group in core 1, and the future read/write of socket A is executed on core 1. 

For example: 6 P4600 NVMe on target, the target uses 8 cores, NIC IRQs are bound to these 8 cores, and initiator side uses 24 and 32 cores. This results in 11%~17% randwrite performance boost.

Zerocopy optimization 

The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. The feature is currently implemented for TCP sockets. However, copy avoidance is not a free launch, as it causes extra page accounting and completion notification overhead during page pinning. 

In SPDK NVMe over TCP, zerocopy could be enabled or disabled during initiation. When enabled, all data is sent by zerocopy no matter what size it is. This has negative performance impact for small data like the request responses. So, it is important to balance memory copy overhead and page pinning overhead. Dynamic zerocopy was introduced to set the threshold for data to be sent either by zerocopy or not. Any data size greater than the threshold is sent by zerocopy while others are not.  

For example, with 16 P4610 NVMe SSD, two initiators, targets and initiators' configurations are the same as SPDK report. For posix socket, rw_percent=0(randwrite), there is a 2.4%~8.3% performance boost tested with target 1~40 cpu cores for 128 queue depth. This has no obvious influence when the read percentage is greater than 50%. For uring socket, rw_percent=0(randwrite), there is a 1.8%~7.9% performance boost tested with target 1~40 cpu cores and 128 queue depth (Figure 7). This has a 1%~7% improvement when the read percentage is greater than 50%.

 Graph displaying 4KB randwrite performance with qdepth=128

Figure 7. 4KB randwrite performance with qdepth=128

Power optimization 

Previously in SPDK, each thread on a CPU core works in poll mode regardless of the number of workloads to process. However, this may waste power in cases where workloads show large variations over time. To solve this problem, the dynamic scheduler framework was introduced for power saving and reduction of CPU utilization. 

The scheduler framework collects data for each thread and reactor dynamically, and performs actions including moving a thread, switching reactor mode and setting CPU core frequency. For example, if the pollers in reactor1 to reactorN are idle, the corresponding SPDK threads will migrate to reactor0 (Figure 8). Reactor1 to reactorN are then switched to interrupt mode. The CPU frequency of reactor0 is adjusted according to how busy this reactor is. This is called CPU frequency scaling.

The Linux kernel supports CPU performance scaling through the CPUFreq (CPU Frequency scaling) subsystem. This consists of three modules: 

  • the core 
  • scaling governors 
  • scaling drivers 

The scaling drivers communicate with the hardware. Cppc_cpufreq driver works on most arm64 platforms. This driver uses CPPC methods as outlined in the ACPIv5.1 specification. Collaborative Processor Performance Controls (CPPC) is based on an abstract continuous scale of CPU performance values. This allows the remote power processor to optimize flexibly for increased power and performance. 

To enable CPU frequency scaling on arm64, cppc_cpufreq driver support was added to the DPDK power library. SPDK leverages this to scale the CPU frequency as well as getting frequency information for scaling use. The options include highest_perf, nominal_perf and scaling_max_freq, scaling_min_freq and so on. It provides APIs for users to set the cpu frequency and enable or disable turbo boost. Refer to dpdk power library for more information on APIs.

 SPDK dynamic scheduler solution diagram

Figure 8. SPDK dynamic scheduler solution

Conclusion 

This blog introduced SPDK, SPDK NVMe over TCP, and how to optimize it. This includes system configuration optimization, data locality optimization, memory copy avoidance optimization, and power optimization. These could be used to solve performance-critical storage problems. 

Anonymous
Servers and Cloud Computing blog
  • Integrated Modular Firmware Solutions: A Vital Component of Custom Silicon Chiplet Architecture Designs

    Marc Meunier
    Marc Meunier
    Firmware is now the backbone of chiplet-based silicon—enabling modular integration, early validation, and secure, efficient system orchestration.
    • October 8, 2025
  • Scaling GenAI Infrastructure with proteanTecs and Arm’s Neoverse CSS

    Marc Meunier
    Marc Meunier
    proteanTecs successful integration of monitoring into Arm Neoverse CSS brings customer-ready solutions with accelerated time-to-market.
    • October 2, 2025
  • Accelerate LLM Inference with ONNX Runtime on Arm Neoverse-powered Microsoft Cobalt 100

    Na Li
    Na Li
    In this blog, we take a closer look at how Microsoft Cobalt 100 processors and Arm’s ONNX Runtime optimizations deliver significant performance gains for running LLMs.
    • October 1, 2025