Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

April 15, 2024

6 minute read time.

Coauthors: Yibo Cai & Ker Liu

A customer has asked to clarify the execution of the Eigen gemm benchmark. Specifically, when the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR. In this case, the relationship between PMU values of L1 data cache and L2 data cache is confusing.

Fullscreen

1
2
3
4
5
6
# perf stat -e L1D_CACHE_WR,L2D_CACHE_WR ./bench_multi_compilers.sh basicbench.cxxlist bench_gemm.cpp
Performance counter stats for './bench_multi_compilers.sh basicbench.cxxlist bench_gemm.cpp':
    2,814,835,051      L1D_CACHE_WR
   11,305,875,577      L2D_CACHE_WR
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# perf stat -e L1D_CACHE_WR,L2D_CACHE_WR ./bench_multi_compilers.sh basicbench.cxxlist bench_gemm.cpp

Performance counter stats for './bench_multi_compilers.sh basicbench.cxxlist bench_gemm.cpp':

    2,814,835,051      L1D_CACHE_WR
   11,305,875,577      L2D_CACHE_WR

Typically, we would expect writes to occur in the L1 data cache. Some write-backs from the L1 data cache may lead to L2 cache writes. Generally, L1 data cache writes should be greater than L2 cache writes. For some commonly used workloads like Redis and Nginx, if we monitor the PMU, we observe that L1 data cache writes exceed L2 cache writes. However, there are certain workloads where L1 data cache writes are lower than L2 cache writes.

In this blog post, we will analyze what L2D_CACHE_WR counts. And, in which cases the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR.

Verification platform

This blog post uses the Neoverse N2 server. The following table shows the version information of the hardware and software on this server.

Component name	Version
CPU	Neoverse N2
OS	Ubuntu 22.04
Kernel	6.5.0
GCC version	11.4.0

Investigation

For most of workloads, the PMU value of L1D_CACHE_WR is large than L2D_CACHE_WR. For example, with Redis there is one Redis process that runs on core 1. You can use Memtier clients to generate read-write mixed requests.

Fullscreen

1
2
3
4
5
6
# perf stat -e L1D_CACHE_WR,L2D_CACHE_WR -C 1
 Performance counter stats for 'CPU(s) 1':
       3,598,861,918      L1D_CACHE_WR
         628,169,832      L2D_CACHE_WR
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# perf stat -e L1D_CACHE_WR,L2D_CACHE_WR -C 1

 Performance counter stats for 'CPU(s) 1':

       3,598,861,918      L1D_CACHE_WR
         628,169,832      L2D_CACHE_WR

Below are the definitions of L1D_CACHE_WR and L2D_CACHE_WR in the N2 PMU guide [1].

L1D_CACHE_WR: This event counts any store operation which looks up in the L1 data cache. This event also counts accesses caused by a Data Cache Zero by Virtual Address (DC ZVA) instruction.
L2D_CACHE_WR: This event counts any memory write operation issued by the CPU which looks up in the unified L2 cache. This event counts whether the access hits or misses in the L2 cache. This event also counts any write-back from the L1 data cache that allocates into the L2 cache. This event treats Data Cache Zero by Virtual Address (DC ZVA) operations as a store instruction and counts those accesses. Snoops from outside the CPU are not counted.
From N2 Technical Reference Manual [2] we know that L1 cache and L2 cache become strictly inclusive. Any cache line present in the L1 cache is also present in the L2 cache.

After investigation, we made the following assumptions:

The PMU value of L2D_CACHE_WR is approximately equal to the sum of the PMU values of “L1 data cache refill”, “L1 instruction cache refill” and “L1 prefetch, refilled to L1”.

Why are we starting our analysis from these three events?

Because these three events will cause evictions of L1 data cache and L1 instruction cache, including both clean and dirty evictions. If you look at the definition of L2D_CACHE_WR, which counts any write-back from the L1 data cache that allocates into the L2 cache. L1 data cache write-back counts any write-back of dirty data from the L1 data cache to the L2 cache, the value is usually quite small, does not match up with the value of L2D_CACHE_WR, and there is not a PMU counter specifically for clean evictions. Therefore, we must approach the analysis from a different angle by aggregating all events that can lead to evictions (both clean and dirty), then we found the pattern.

PMU of L2 cache write	approximately equal to:	Sum of below events
L2D_CACHE_WR		L1 data cache refill
		L1 instruction cache refill
		L1 prefetch, refilled to L1

We verified this discovery in several typical scenarios. Most test cases do follow this pattern.

For the Redis case:

We use Memtier clients as the load generator to generate mixed read and write requests for Redis process, the PMU values follow the pattern.

PMU of L2 cache write	approximately equal to:	Sum of below events
L2D_CACHE_WR: 628,169,832		L1 data cache refill: 129,541,578
		L1 instruction cache refill: 494,207,953
		L1 prefetch, refilled to L1: 4,544,004

For the 'Telemetry: ustress: l1d_cache_workload' [3] case:

This benchmark only reads data, aims to stress L1 data cache with misses, the PMU values follow the pattern.

PMU of L2 cache write	approximately equal to:	Sum of below events
L2D_CACHE_WR: 225,411,787		L1 data cache refill: 225,253,507
		L1 instruction cache refill: 50,351
		L1 prefetch, refilled to L1: 5,569

For the 'Telemetry: ustress: l1i_cache_workload' [3] case:

This benchmark makes repeated calls to functions that are aligned to page boundaries, aims to stress CPU L1 instruction cache with misses, the PMU values follow the pattern.

PMU of L2 cache write	approximately equal to:	Sum of below events
L2D_CACHE_WR: 260,912,716		L1 data cache refill: 33,205
		L1 instruction cache refill: 260,690,663
		L1 prefetch, refilled to L1: 9,938

For the Eigen gemm case:

This benchmark contains many read operations. The PMU value of L1D_CACHE_WR is small, but the prefetch operation of L1 cache causes the PMU value of L2D to be large. So, we have the result that the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR. The PMU values also follow this pattern.

PMU of L2 cache write	approximately equal to:	Sum of below events
L2D_CACHE_WR: 2,463,205,229		L1 data cache refill: 216,485,192
		L1 instruction cache refill: 226,889,316
		L1 prefetch, refilled to L1: 1,993,590,120

But Streaming write is an exception, we use 'Telemetry: ustress: memcpy_workload' [3] benchmark, which stresses the load-store pipeline with a memcpy entirely within L1D cache, memcpy triggers streaming write that skips L1 and directly writes to L2, the PMU values don’t follow the pattern.

PMU of L2 cache write	not equal to:	Sum of below events
L2D_CACHE_WR: 818,907,881		L1 data cache refill: 98,941
		L1 instruction cache refill: 110,520
		L1 prefetch, refilled to L1: 48,514

Below are the descriptions of write streaming mode in N2 Technical Reference Manual [2].

The Neoverse N2 core supports write streaming mode, sometimes referred to as read allocate mode, both for the L1 and the L2 cache.

A cache line is allocated to the L1 and L2 cache on either a read miss or a write miss. However, writing large blocks of data can pollute the cache with unnecessary data. It can also waste power and performance when a linefill is performed only to discard the linefill data because the entire line was subsequently written by the memset(). In some situations, cache line allocation on writes is not required. For example, when executing the C standard library memset() function to clear a large block of memory to a known value.

To prevent unnecessary cache line allocation, the memory system can detect when the core has written a full cache line before the linefill completes. If this situation is detected on a configurable number of consecutive linefills, then it switches into write streaming mode.

When in write streaming mode, load operations behave as normal, and can still cause linefills. Writes still lookup in the cache, but if they miss then they write out to the L2 or system rather than starting a linefill.

Summary

On the N2 server, L2D_CACHE_WR counts all cache evictions from L1 cache, including clean evictions and dirty evictions, and streaming write as well.

For workloads that read large amounts of data, we will see that the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR.

References

[1] Arm® Neoverse N2 PMU Guide
[2] Arm® Neoverse N2 Core Technical Reference Manual
[3] Telemetry Solution

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Deep dive into the PMU value of L2D_CACHE_WR on the Neoverse N2 server

Verification platform

Investigation

Summary

References

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC