DPDK scalability analysis on Arm Neoverse V2

September 23, 2025

20 minute read time.

Introduction

Ideally, DPDK performance scales in proportion to the number of processing cores used. In other words, doubling the number of cores should double the throughput. However, real-world results often deviate from these theoretical expectations because of practical constraints.

DPDK performance is closely related to the bandwidth-delay product (BDP) of the application and system. BDP, a key networking concept, is the product of a network's bandwidth and its round-trip time (RTT). It shows how much data can be in transit at once, and how much capacity is needed to use the available bandwidth fully.

In DPDK L3FWD, RTT is the time needed to consume packets from an Rx queue, process them, and place them into a Tx queue. At a constant data receive rate of 148 million packets per second (MPPS) and 64 bytes per packet, the processing delay inherent in the BDP RTT becomes a critical determinant of the buffer size needed to sustain peak throughput.

When processing delays increase, more data needs to be stored temporarily. This increases the required capacity. Each packet requires one Rx queue descriptor. More queue descriptors are required to prevent packet loss and maintain maximum throughput as the BDP expands. In an ideal environment without limited cache sizes, finite memory availability, or hardware limitations, the BDP-to-descriptor relationship can be modeled with a mathematical equation. Such modeling provides a theoretical foundation to determine the optimal number of Rx queue descriptors necessary for peak performance of an ideal machine.

This study compares real-world DPDK performance with theoretical predictions based on the ideal model. Through empirical testing and analysis, the study examines how processing delays, queue configurations, cache and memory limitations, and operational bottlenecks impact BDP and real-world performance.

The relationship between BDP and the required number of Rx queue descriptors can be described mathematically as follows:

N = RT(n)/C

Where:

N is the total number of Rx queue descriptors required.
R represents packets received per nanosecond (0.148 for 148 MPPS).
T(n) denotes the time to process burst of n packets.
C is the number of processing cores used.

The delay in the BDP is calculated by T(n)/C. In the DPDK L3FWD application, each core loops through its assigned queues. Each handles packets up to the configured burst size. These are consumed, processed, and placed into a transmit (Tx) queue.

The total execution time for this sequence, T(n), affects how long incoming packets, at 148 MPPS, cannot be consumed. To achieve best throughput, the number of Rx queue descriptors must hold all packets arriving during this interval.

Methodology

The table below shows all evaluations for T(n). Initial tests showed that (T(n)) scales approximately linearly with packet burst size.

For example, in DPDK L3FWD, processing times were 726 ns for T(32), 1455 ns for T(64), and 2619 ns for T(128). These times include execution of both the Rx and Tx paths (rte_eth_rx_burst() and l3fwd_lpm_send_packets()) when using one core with continuous incoming traffic. Prefetching reduces latency, and avoids delays compared to scenarios that require frequent main memory or system-level cache accesses.

However, continuous incoming traffic conditions caused variation in packet count per Rx burst. To fix this, further testing was conducted using fixed, controlled bursts of incoming packets with predetermined packet counts. This enables precise assessment of processing delays. Under these controlled conditions, processing a single packet with a burst size of 32 averaged 512 ns, while processing 32 packets individually averaged 4053.33 ns. Similar tests for 64 packets with a burst size of 64 and 128 packets with a burst size of 128 resulted in processing times of 6058.67 ns and 4896 ns, respectively.

The results show that continuous traffic scenarios benefit significantly from cache optimizations like prefetching, reducing per-packet processing time. Conversely, discrete packet bursts incur overhead and reduced efficiency due to less effective cache use, resulting in higher per-packet delays.

External measurements using Ixia IxExplorer show significantly higher delays for continuous incoming traffic: 222236 ns for T(32), 204224 ns for T(64), and 189424 ns for T(128). These delays include additional latencies from components beyond the application, such as the NIC. These measurements help estimate the number of queue descriptors needed to maximize throughput under real-world conditions. This underscores the importance of context when determining accurate processing delays.

Measurement Method	Packet Burst	Packets Received	Avg Time (ns)
TSC Cycles in DPDK L3FWD: continuous traffic	32	N/A	726
	64	N/A	1455
	128	N/A	2619
TSC Cycles in DPDK L3FWD: fixed burst traffic	32	1	512
	32	4	768
	32	8	1258.67
	32	16	1738.67
	32	32	4053.33
	64	64	6058.67
	128	128	4896
Ixia IxExplorer: forwarding delay	32	N/A	222236
	64	N/A	204224
	128	N/A	189424

The measured values, especially when measured using IxExplorer, enable precise calculation of the best queue descriptor count necessary to maximize throughput under known bandwidth conditions. To find the ideal configuration of the other components of the BDP model, an experiment was carried out using the test setup below.

Device Under Test (DUT): Nvidia Grace
Traffic Generator: Ixia with 2 x 100GbE ports
Topology: 2 Ports, 1 NIC

Two bidirectional traffic flows were used:

Flow 1: Traffic ingress from port 0 on the NIC and egress from port 1.
Flow 2: Traffic ingress from port 1 on the NIC and egress from port 0.

We used the DPDK L3FWD application to test different packet burst sizes and queue counts on 1, 2, 4, and 8 cores. The goal was to find the configuration with the highest throughput, measured in millions of packets per second (MPPS). To maximize performance, Write Allocate to SCF cache was enabled during the testing.

The DPDK Tuning Guide says write allocation to system-level cache (SLC) allows storing of packets directly in SLC instead of memory. On the Nvidia Grace platform, Scalable Coherency Fabric (SCF) cache is the system-level-cache. It is referred to as Write Allocate to SCF cache.

The process to find the best configuration involved systematically tuning a few parameters. Starting with a packet burst of 32, different numbers of queues per core (e.g., 2, 4, and 8 queues per core) were tested with varying queue depths from 256 to 2048 for core count ranging from 1 to 8. We repeated the tests with larger burst sizes.

For each core count, the number of queues per core and queue depths were adjusted and tested until no significant improvement in throughput was observed. This testing helped identify the best configuration for each core count.

Performance observations

For each top-performing configuration identified, the ideal total queue depth was calculated using the previously discussed Bandwidth-Delay Product (BDP) model. This model assumes ideal conditions and that (T(n)) scales linearly with burst size and core count.

The maximum throughput was measured for two scenarios: with Write Allocate to SCF cache enabled and with Write Allocate to SCF cache disabled. The total queue depth calculated for best performance and the measured throughputs can be seen below.

Except for the 4-core configuration, Write Allocate to SCF cache provides a performance uplift. Most notable is the additional 14.87 MPPS obtained by enabling Write Allocate to SCF cache for the 2-core configuration. In the 4-core configuration, however, Write Allocate to SCF cache underperforms slightly: 73.18 MPPS with it disabled compared to 71.38 MPPS with it enabled. This shows that Write Allocate to SCF cache usually helps performance, but not in every case.

R - packets received per nanosecond	C - number of cores	n - packet burst size	T(n) - processing delay	Calculated Ideal Total Depth	Rounded Ideal Total Depth
0.148	1	128	189424	28034.752	32768
0.148	2	32	222236	16445.464	16384
0.148	4	32	222236	8222.732	8192
0.148	8	64	204224	3778.144	4096

Note: Ideal total depth was calculated using the model equation, then rounded to the nearest power of 2 for a valid DPDK queue configuration.

Graph showing maximum throughput when using calculated ideal queue depth

Figure 1: Maximum throughput when using calculated ideal queue depth

Real systems are not ideal, and the actual total queue depth needed differs from the calculated value. The graph below compares ideal and actual total queue depths needed for best performance.
The maximum throughput achieved with the actual required queue depth for best performance is overlayed with the maximum throughput observed with the calculated ideal queue depth. The required depths differ significantly. For example, at the 1-core configuration, the ideal queue depth (32,768 descriptors) greatly exceeds the actual depth required (4,096 descriptors).

This shows the BDP model ignores real-world constraints, such as cache limitations. As a result, when comparing throughput performance for ideal and actual queue depths, the actual required queue depths resulted in higher throughput across all core configurations. For instance, at 1 core, throughput rises from 32.88 MPPS (ideal) to 58.87 MPPS (actual), and at 8 cores it climbs from 91.45 MPPS (ideal) to 115.74 MPPS (actual).

Even with higher throughput, scaling of remains non-linear under both actual and ideal queue depths. The most linear gain occurs when moving from one to two cores with the ideal queue depth. For example, ideal queue depth with Write Allocate to SCF cache enabled yields an 88.78% increase in throughput when scaling from one to two cores, from 32.88 MPPS at one core to 62.08 MPPS at two cores. However, more cores fail to deliver similar gains.

Graph showing maximum throughput when using calculated ideal queue depth and actual best queue depth

Figure 2: Maximum throughput when using calculated ideal queue depth and actual best queue depth

To better understand how large ideal queue depths affect throughput, we collected memory PMU counters for two configurations: ideal depth with SCF enabled, and actual depth with SCF enabled. The PMU counters observed include:

cmem_rd_access: Counts the number of SCF Read accesses from CPU to DRAM
cmem_wr_total_bytes: Counts the total number of bytes transferred to local Grace DRAM by write-backs, write-unique and non-coherent writes from local or remote CPU
cmem_wb_access: Counts the number of SCF Write-Back accesses (clean/dirty) transferred from CPU to DRAM excluding Write-Unique/Non-Coherent writes
cmem_wr_access: Counts the total number of SCF write-unique and non-coherent write requests from local or remote CPU to local Grace DRAM

This increase mostly comes from significantly more SCF write-back accesses. Because ideal queue depths are much larger, SCF cache evictions are more likely when storing new packets. This results in more DRAM read access when processing packets, which is confirmed by the cmem_rd_access per-packet count being significantly higher for ideal queue depths. Overall, this extra DRAM increases processing delay and lowers throughput.

CMEM PMU Counter	Ideal 1 core w/Write Allocate to SCF cache	Actual 1 Core	Ideal 2 core w/Write Allocate to SCF cache	Actual 2 Core	Ideal 4 core w/Write Allocate to SCF cache	Actual 4 Core	Ideal 8 core w/Write Allocate to SCF cache	Actual 8 Core
cmem_rd_access	0.751	0.517	1.158	0.635	1.263	0.505	1.431	0.862
cmem_wr_total_bytes	8.288	2.332	11.657	2.606	12.186	4.488	8.800	3.327
cmem_wb_access	0.103	0.024	0.168	0.031	0.179	0.061	0.128	0.046
cmem_wr_access	0.027	0.013	0.014	0.010	0.012	0.009	0.009	0.006

Note: PMU count values have been normalized per packet for accurate comparison.

These results show that while the BDP model is a useful baseline, it does not fully account for the system-level constraints and latencies encountered in practical applications. So, tuning queue configurations beyond the ideal prediction is key for achieving optimal performance on real hardware like the Nvidia Grace platform.

DPDK scalability bottlenecks

SCF cache capacity

The first factor we examined for SCF cache capacity. Given the reliance on Write Allocate to SCF cache for optimal processing efficiency, understanding the constraints on SCF cache usage is essential. Several PMU counters were used to check for cache issues on the Grace platform:

scf_cache: Counts cache access events.
scf_cache_allocate: Counts cache line allocations.
scf_cache_refill: Counts cache refill events, indicating cache misses requiring data fetches from RAM.
scf_cache_wb: Count the number of capacity evictions from last-level cache.
cmem_rd_access: Counts the number of SCF Read accesses from CPU to DRAM.
cmem_wb_access: Counts the number of SCF Write-Back accesses (clean/dirty) transferred from CPU to DRAM excluding Write-Unique/Non-Coherent writes.

Ideally, if SCF cache evictions, cache misses, or memory accesses do not increase as the core count rises, it suggests efficient cache usage.. Initial analysis involved capturing these counters for configurations yielding maximum throughput at each core count. However, the results did not suggest a significant cache-related problem.

To gain deeper insights and contextualize the PMU counter observations, a top-down analysis was conducted. This analysis showed that the L3FWD application spends about 69.5% of total CPU cycles to the Tx path, which involves processing packets and placing them into Tx queues. About 28% of cycles are spent in the Rx path, which handles incoming packets in bursts. This cycle split reflects processing time.

More cycles spent on the Tx path shows that more time is spent on application processing, resulting in a higher bandwidth-delay product (BDP). Reducing or removing processing steps within the Tx path would decrease the time spent processing, lowering the BDP.

To accurately decide if an SCF cache capacity limitation existed, it was necessary to minimize the BDP by modifying the DPDK implementation. The flame graphs below shows where cycles are spent in the application for different implementations. Each change reduced the BDP by reducing application processing overhead, increasing the proportion of cycles dedicated to consuming packets.

First, removing the TX path and using rte_pktmbuf_free() to free mbufs increased the percentage of total cycles spent executing the Rx path from 28% to 45.89%. Further optimization with returning mbufs to mempool in bulk (rte_mempool_put_bulk()) significantly reduced the BDP, dedicating 82.46% of cycles to packet consumption.

Each BDP reduction led to a clear rise in throughput. When comparing the maximum 1 core throughput achieved when using 2 ports on 1 NIC, the throughput increased from 58 MPPS with the default L3FWD implementation to 132 MPPS with the fully reduced BDP.

Flame graph of default DPDK L3FWD application

Figure 3: Default L3FWD app implementation, 1 Core Max Throughput of 58 MPPS

Flame graph of modified DPDK L3FWD application

Figure 4: Modified L3FWD app, Rx + rte_pktmbuf_free(), 1 Core Max Throughput of 82 MPPS

Flame graph of modified DPDK L3FWD application

Figure 5: Modified L3FWD app, Rx + rte_mempool_put_bulk(), 1 Core Max Throughput of 132 MPPS

With reduced BDP and most cycles spent on packet consumption, a more accurate assessment of SCF cache behavior became possible. In these conditions, we found that untouched packets were being evicted from the SCF cache to RAM. If an untouched packet is evicted to RAM, then packet access will result in a SCF cache miss followed by subsequent read access to RAM to get the packet data. This adds latency and hurts performance.

The graph below contains throughput and PMU data that were collected for several RX queue configurations under 1-core operation. To stress the throughput and the system, we used two ports from a different NIC.

The results showed that as queue depth increased, SCF cache evictions (scf_cache_wb) also increased substantially. As SCF cache evictions rose, SCF cache misses (scf_cache_refill) and read accesses to RAM (cmem_rd_access) similarly increased. So, throughput began to decline despite the higher queue depth after the average empty poll percentage reached 0%. The average empty poll percentage stands for the proportion of packet polls from RX queues that yield zero consumed packets.

A 0% poll percentage shows that every poll successfully consumes packets, signifying full CPU use. So, increases in SCF cache evictions are more severe when the overall queue depth increases after the empty poll percentage reaches 0%. At that point, more packets build up in SCF, triggering more evictions as depth increases. Switching from 8 to 16 queues with 1024 descriptors caused a sharp increase (153.6%) in SCF cache evictions and a drop in throughput, suggesting an SCF cache limit had been crossed.

We can estimate SCF cache usage with this formula:

(Number of queues) * (Number of descriptors per queue) * (Packet size in bytes) * (Number of cache lines per packet).

If that limit is exceeded, full usage occurs at 8 queues with depth of 512 or 8 queues with depth of 1024. With an estimated 3 cache lines per packet, this places the cache size in a range between 768 and 1536 bytes per packet. Even at10 cache lines per packet, the upper bound would only reach approximately 5MB, a relatively small portion compared to the total available SCF cache of 114MB.

This suggests that Nvidia Grace's write-allocate to SCF cache uses hints or employs an extremely limited number of pre-configured cache sets.

Graph showing 1 Core performance statistics for varying Rx configurations

Figure 6: 1 Core performance statistics for varying Rx configurations

Note: PMU count values have been normalized per packet for accurate comparison.

This suggests that SCF cache is a bottleneck. Untouched data is evicted into RAM, increasing latency during packet processing due to SCF cache misses and subsequent RAM access. The throughput drop at high queue depth shows a practical upper bound, set by SCF cache limits rather than theoretical calculations alone.

In summary, initial PMU counters did not show cache inefficiencies but reducing BDP through application changes exposed the true limitation. SCF cache can evict unprocessed packets, leading to increased RAM access and lower performance. DPDK tuning must account for SCF cache behavior.

Further testing was done to find if more bottlenecks existed.

MMIO write latency

Earlier analysis showed that Tx path delays, SCF cache evictions, and RAM access reduce performance We ran more tests to find determine if other such latencies existed. Latencies within the Rx path were evaluated and it was discovered that another contributing factor to a larger BDP is the latency associated with a MMIO write.

A MMIO write occurs each time the PMD writes to doorbell registers to communicate buffer availability to the NIC. In one setup, writing to CQ and RQ doorbells could take up to 12% of the total time spent in the RX path. It is important to note that the RQ doorbell only updates when the number of buffers to be replenished is greater than the buffer replenishment threshold.

This investigation looks at the worst-case scenario when both the CQ and RQ doorbells are updated. Updating the doorbell registers in the Rx path consistently takes from around 37 to 38 nanoseconds, regardless of the configuration. So, the total percentage of the Rx path taken is more for configurations with a higher core count where the total time spent executing the Rx path is less.

The data in the graph below is collected using configurations with a fixed packet burst size of 32 and varying queue and core configurations. This data illustrates how latency could have more of an impact on the Rx path. For example, in one 4 core configuration, the 37.25 nanoseconds accounts for 50% of the time spent executing the Rx path for each queue where both doorbells are updated.

This latency increases BDP and raises the required Rx queue descriptor count for optimal throughput. With a data receive rate of 148 MPPS, where 0.148 of a packet is received in one nanosecond, around 5.6 packets are received during the 38 nanoseconds it takes to update both doorbell registers. Given that the delay occurs during the poll of each Rx queue, the overall delay would increase with the number of queues and would increase the actual required descriptor count by 6 per occurrence.

Graph showing Rx path latency and MMIO write latency

Figure 7: Rx path latency and MMIO write latency

Rx Burst Time: Average amount of time, in nanoseconds, spent executing Rx burst. This includes:
- Replenishing buffers
- Consuming received packets
- Updating CQ doorbell and RQ doorbell
Doorbell Update Time: Average amount of time, in nanoseconds, spent updating CQ and RQ doorbell register

Upon further observation, the latency introduced by MMIO writes causes more degradation in performance than just a delay.

RX queue descriptor availability: Difference between NIC and PMD

MMIO writes latency introduces a mismatch between the number of buffers that are available from the NIC and PMD perspectives.

The graph below shows the configurations for the best throughput at each core count along with other statistics. When seeing the configurations that result in maximum throughput for 4 or 8 cores, 0 packets are available for about 90% of the Rx queue polls which would indicate that the Rx queues are empty. However, for the same configurations, the NIC reports there are no buffers available. Typically, buffers would not be available when the Rx queues are full. This discrepancy results in the NIC not storing packets that could have otherwise been stored in buffers that are available.

Graph showing L3FWD performance statistics

Figure 8: L3FWD performance statistics

• Avg Empty Poll %: Avg amount of Rx queue polls that resulted in 0 packets being consumed from Rx queue due to packets not being available
• rx_out_of_buffer: NIC statistic that increments when no buffers are available when the NIC wants to store packets.

From the PMD's view, buffers availability never reaches zero. We collected a buffer availability histogram across all queues in a 8-core setup, which showed the largest NIC and PMD mismatch. The available buffer count, shown by the descriptor count in the histogram, was calculated by subtracting the producer index from the consumer index of the Rx queue.

In the PMD's Rx path, the consumer index increases when buffers are replenished, signaling increased availability. Conversely, when packets are consumed, the producer index increments, showing reduced availability. However, with 0% empty polls, where packets are received in every poll, the consumer and producer indices can increase at similar rates.

So the difference between them doesn’t reflect worst-case availability. So, buffer availability calculations were performed using a configuration with approximately 90% empty polls. In this scenario, the producer index is not incremented during most polls, leading to a greater difference between the producer and consumer indices. The histogram, there shows at least 903 buffers available to the NIC, despite a rx_out_of_buffer error.

Histogram of Rx queue descriptor availability

Figure 9: Rx queue descriptor availability

Descriptor Count: Number of descriptors available to the NIC during buffer replenishment. Calculated by using the RQ consumer index - producer index. Each descriptor points to one buffer
Number of instances: The number of times a specific descriptor count occurred

The mismatch between the NIC and PMD perspectives is attributed to high MMIO sync latency. This mismatch prevents the NIC from fully utilizing available buffers to store incoming packets. As a result, this inefficiency significantly contributes to non-linear throughput scaling as core increase.

Conclusion

This analysis highlights key factors that affect the scalability and performance of DPDK applications. The primary observation is that real-world DPDK performance differs from theory. Real-world BDP involves round-trip time delays caused by SCF cache behavior, MMIO writes, and NIC-PMD buffer mismatches. These practical limitations have been identified as key bottlenecks causing performance degradation and nonlinear scaling with increased core count. Importantly, the results indicate that these bottlenecks arise primarily from NIC, PCIe, and cache interactions rather than from CPU microarchitecture limitations.

Recommendations for Improving DPDK Performance:

Optimize Queue Depths:
- Queue depth that can compensate for packets received during latencies is required. Empirically determine optimal queue configurations to accommodate latencies and processing.
- Balance the memory required for the total queue depth with the system-level cache limit to prevent excessive cache evictions.
Minimize Processing Delays:
- Streamline the application processing path to reduce BDP and maximize cycles spent consuming packets.
- Example: Utilize bulk operations to lower processing overhead.
Mitigate MMIO Latencies:
- Explore methods to reduce the frequency or impact of doorbell updates, such as optimizing the buffer replenishment threshold.
Improve NIC-PMD Synchronization:
- Enhance synchronization mechanisms or adjust polling strategies to reduce mismatches in perceived buffer availability, thus minimizing unnecessary latency

System Configuration

Host Information

Host Configuration
CPU	Number of processors	2 x Grace A02
	CPU Max Frequency	3.1GHz
	Cores per socket	72
	Socket(s)	2
	Online CPU(s)	0-143
	Per CPU L1i/L1d/L2d	64KB/64KB/1MB
	L3	228MB (2 instances)
Memory	DDR version	LPDDR5
	Speed	4237MHz, 8532 MT/s
	Comma-separated list of Module count x Module size	2 x 240GB
PCIE	Version	5
	Number of root ports	13
	Max payload size supported by root port	256 bytes
BIOS	Descriptive string with names and versions	AMI Version 2.0, BIOS Revision 5.34
OS	Name and Version	Ubuntu 24.04 LTS
OS Settings	Huge Pages (hugepagesz, hugepages)	1G, 64
	IOMMU Passthrough setting	0
	CPU isolation settings	isolcpus=16-56,80-120 nohz_full=16-56,80-120 rcu_nocbs=16-56,80-120
	SLC stashing enabled	Enabled
	SPE enabled	Enabled
Software	Software Analyzed/GCC version used	DPDK 24.11.0-rc0/14.0.1
Device Information
Device	Type	NIC
	Name	Mellanox ConnectX-6 Dx 2x100GbE
	Kernel Driver Version	MLNX_OFED_LINUX-24.04-0.7.0.0
	PCIE (version, lanes)	4.0, 16

Servers and Cloud Computing blog

How Fujitsu implemented confidential computing on FUJITSU-MONAKA with Arm CCA

Marc Meunier

Discover how FUJITSU-MONAKA secures AI and HPC workloads with Arm v9 and Realm-based confidential computing.
- October 13, 2025
Pre-silicon simulation and validation of OpenBMC + UEFI on Neoverse RD-V3

odinlmshen

In this blog post, learn how to integrate virtual BMC and firmware simulation into CI pipelines to speed bring-up, testing, and developer onboarding.
- October 13, 2025
Accelerating early developer bring-up and pre-silicon validation with Arm Neoverse CSS V3

odinlmshen

Discover the Arm Neoverse RD-V3 Software Stack Learning Path—helping developers accelerate early bring-up and pre-silicon validation for complex firmware on Neoverse CSS V3.
- October 13, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog