This blog introduces optimizing Storage Performance Development Kit (SPDK) NVMe (Non-Volatile Memory Express) over TCP (Transmission Control Protocol) on Arm, and how to maximize its performance.
As storage media continues to evolve in IO performance, the percentage of total transaction time the storage software consumes becomes greater. Improving the performance and efficiency of the storage software stack is critical. SPDK is an open-source software framework that provides a set of libraries and tools for writing high performance, scalable, user-mode storage applications tailored to specific needs. SPDK unlocks the full potential of modern storage hardware, such as non-volatile memory (NVM) devices, solid-state drives (SSD), and networked storage devices.
Traditional kernel I/O stack brings overhead due to context switch, data copy, interrupt, and resource synchronization. SPDK minimizes overhead during IO processing by:
Figure 1. SPDK thread model
SPDK includes several layers as shown in Figure 2.
Figure 2. SPDK architecture
Figure 3. NVMe over Fabrics model
SPDK supports RDMA, TCP, and Fibre Channel transports. This consists of the initiator framework, and the target (Figure 4). If the initiator (host) and the NVMe SSDs are in the same server, the devices are accessed directly by PCIe. If not, the initiator must access the remote target devices through fabrics.
Among multiple fabric options, NVMe over TCP allows users to harness NVMe across a standard Ethernet network. This means lower deployment cost and design complexity thanks to the stability and portability of the mature TCP/IP stack.
We will focus on SPDK NVMe over TCP, which combines the advantages of NVMe over TCP and SPDK working mechanism.
Figure 4. SPDK NVMe over Fabrics framework
When using a TCP transport(Figure 5), each host-side NVMe queue pair has a corresponding controller-side queue pair which is mapped to its own TCP connection. This is assigned to a separate CPU core. The Command Capsules are encapsulated into a TCP PDU (Protocol Data Unit) and sent over standard TCP/IP sockets by calling Linux syscalls including sendmsg. The controller-side reads the received data from socket buffer and constructs the receive CMD capsule. This includes the request information for further processing. After the request is processed, one RSP capsule is generated and sent through the socket. The response data arrives at the host-side socket buffer which is packeted into a receive RSP capsule.
Figure 5. NVMe over TCP data path
SPDK NVMe over TCP is a high-performance solution exposing NVMe storages to remote clients through TCP/IP network. Although SPDK is lock free and the NVMe driver is in user space, the kernel-based TCP/IP stack is not lock free. Therefore, the system calls and memory copy between kernel and user space are inevitable. To leverage TCP/IP stack efficiently, SPDK has introduced several optimizations including:
Our optimization work is based on existing implementation, aims to further squeeze SPDK NVMe over TCP performance:
Appropriate system configuration is important for SPDK. The platform is configured according to the architecture and features, including:
For example, for four 1G hugepages, add the following parameters to GRUB cmdline. SPDK runs on CPU core 0-7 and IOVA is the physical address.
hugepagesz=1G hugepages=4 isolcpus = 0-7 iommu.passthrough=1
Use the following commands for setting maximum request size:
lspci –vvv -s 0000:04:00.0 |grep MaxReadReq MaxPayload 256 bytes, MaxReadReq 256 bytes setpci -s 0000:04:00.0 68.w #get current configuration 1963 # “1” means 256Bytes setpci -s 0000:04:00.0 68.w=2963 (The first digit is the PCIe Max Read Request size selector. The acceptable values are: 0 - 128B, 1 - 256B, 2 - 512B, 3 - 1024B, 4 - 2048B and 5 - 4096B)
ethtool -G ${nic} rx 1024 tx 1024 ethtool -L ${nic} combined 56
For example, use the following command to serve IRQ 40 on the upper 32-cores of a 64-core system:
service irqbalance stop echo 0xffffffff,00000000 > /proc/irq/40/smp_affinity #or with the script from https://github.com/Mellanox/mlnx-tools.git ./set_irq_affinity_cpulist.sh “8-16” ${nic} or ./set_irq_affinity_cpulist.sh “1,3,5,7” ${nic}
Use the following command to enable the adaptive irq coalescing:
ethtool -C eth0 adaptive-rx on
Use the ethtool -C to set the irq coalescing according to your own case:
adaptive-rx=off,adaptive-tx=off rx-usecs 64 rx-frames 128 tx-usecs 128 tx-frames 128
The parameters are:
The New API (NAPI) is a mechanism for reducing the number of IRQs generated by network devices on packet arrival. This registers a poll function that the NAPI subsystem calls to harvest data frames.
Set “net.core.netdev_budget” and “net.core.netdev_budget_usecs” to limit the number of packets polled in one NAPI polling cycle. Netdev_budget is the maximum number of packets taken from all interfaces in one polling cycle. A polling cycle may not exceed netdev_budget_usecs microseconds, even if netdev_budget has not been exhausted. And dev_weight is the maximum number of packets the kernel can handle on a NAPI interrupt, it is a Per-CPU variable.
sysctl -w net.core.netdev_budget = 300 sysctl -w net.core.netdev_budget_usecs = 8000 sysctl -w net.core.dev_weight = 64
Refer to the Linux-network-performance-parameter for more information.
# Set 256MB buffers net.core.rmem_max = 268435456 net.core.wmem_max = 268435456 # Increase autotuning TCP buffer limits 128MB # min, max and default settings net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_wmem = 4096 65536 134217728
In TCP kernel space, when data arrives at the NIC and NIC DMA packets at RAM, a received ring is selected according to hash function in RSS (Receive Side Scaling). The reference of the packet is enqueued to the ring buffer. A hard IRQ is raised, which is processed by the CPU. If we set the IRQ affinity, it is the assigned CPU or else the irqbalance service will select one. By default, the soft IRQ is also triggered on the same CPU core as the hard IRQ, which schedules a NAPI to poll data from the receive ring buffer. The processing of this packet is carried out in the CPU core until it is enqueued to the socket receive buffer.
Figure 6. TCP data receive flow
In SPDK NVMe over TCP, each connection from a client is assigned to a reactor (CPU core) during initiation. The socket read/write of this connection is completed on this CPU core. So, a semantic gap exists between the kernel space and user space in relation to the CPU core affinity.
To guarantee the CPU core processing this socket data in kernel space is the same with core that reads this socket in user space (SPDK), in SPDK NVMe over TCP, we introduced CPU affinity based placement of socket . This obtains the CPU affinity of a socket and decides which CPU core this socket should be assigned to during connection initiation. For example, when a new connection (socket A) launches (Figure 6), we obtain the CPU affinity of socket A. This is CPU core 1, which is responsible for the kernel space processing of the packets for this socket. In SPDK, socket A is assigned to poll group in core 1, and the future read/write of socket A is executed on core 1.
For example: 6 P4600 NVMe on target, the target uses 8 cores, NIC IRQs are bound to these 8 cores, and initiator side uses 24 and 32 cores. This results in 11%~17% randwrite performance boost.
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. The feature is currently implemented for TCP sockets. However, copy avoidance is not a free launch, as it causes extra page accounting and completion notification overhead during page pinning.
In SPDK NVMe over TCP, zerocopy could be enabled or disabled during initiation. When enabled, all data is sent by zerocopy no matter what size it is. This has negative performance impact for small data like the request responses. So, it is important to balance memory copy overhead and page pinning overhead. Dynamic zerocopy was introduced to set the threshold for data to be sent either by zerocopy or not. Any data size greater than the threshold is sent by zerocopy while others are not.
For example, with 16 P4610 NVMe SSD, two initiators, targets and initiators' configurations are the same as SPDK report. For posix socket, rw_percent=0(randwrite), there is a 2.4%~8.3% performance boost tested with target 1~40 cpu cores for 128 queue depth. This has no obvious influence when the read percentage is greater than 50%. For uring socket, rw_percent=0(randwrite), there is a 1.8%~7.9% performance boost tested with target 1~40 cpu cores and 128 queue depth (Figure 7). This has a 1%~7% improvement when the read percentage is greater than 50%.
Figure 7. 4KB randwrite performance with qdepth=128
Previously in SPDK, each thread on a CPU core works in poll mode regardless of the number of workloads to process. However, this may waste power in cases where workloads show large variations over time. To solve this problem, the dynamic scheduler framework was introduced for power saving and reduction of CPU utilization.
The scheduler framework collects data for each thread and reactor dynamically, and performs actions including moving a thread, switching reactor mode and setting CPU core frequency. For example, if the pollers in reactor1 to reactorN are idle, the corresponding SPDK threads will migrate to reactor0 (Figure 8). Reactor1 to reactorN are then switched to interrupt mode. The CPU frequency of reactor0 is adjusted according to how busy this reactor is. This is called CPU frequency scaling.
The Linux kernel supports CPU performance scaling through the CPUFreq (CPU Frequency scaling) subsystem. This consists of three modules:
The scaling drivers communicate with the hardware. Cppc_cpufreq driver works on most arm64 platforms. This driver uses CPPC methods as outlined in the ACPIv5.1 specification. Collaborative Processor Performance Controls (CPPC) is based on an abstract continuous scale of CPU performance values. This allows the remote power processor to optimize flexibly for increased power and performance.
To enable CPU frequency scaling on arm64, cppc_cpufreq driver support was added to the DPDK power library. SPDK leverages this to scale the CPU frequency as well as getting frequency information for scaling use. The options include highest_perf, nominal_perf and scaling_max_freq, scaling_min_freq and so on. It provides APIs for users to set the cpu frequency and enable or disable turbo boost. Refer to dpdk power library for more information on APIs.
Figure 8. SPDK dynamic scheduler solution
This blog introduced SPDK, SPDK NVMe over TCP, and how to optimize it. This includes system configuration optimization, data locality optimization, memory copy avoidance optimization, and power optimization. These could be used to solve performance-critical storage problems.