A customer has asked to clarify the execution of the Eigen gemm benchmark. Specifically, when the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR. In this case, the relationship between PMU values of L1 data cache and L2 data cache is confusing.
# perf stat -e L1D_CACHE_WR,L2D_CACHE_WR ./bench_multi_compilers.sh basicbench.cxxlist bench_gemm.cpp Performance counter stats for './bench_multi_compilers.sh basicbench.cxxlist bench_gemm.cpp': 2,814,835,051 L1D_CACHE_WR 11,305,875,577 L2D_CACHE_WR
Typically, we would expect writes to occur in the L1 data cache. Some write-backs from the L1 data cache may lead to L2 cache writes. Generally, L1 data cache writes should be greater than L2 cache writes. For some commonly used workloads like Redis and Nginx, if we monitor the PMU, we observe that L1 data cache writes exceed L2 cache writes. However, there are certain workloads where L1 data cache writes are lower than L2 cache writes.In this blog post, we will analyze what L2D_CACHE_WR counts. And, in which cases the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR.
This blog post uses the Neoverse N2 server. The following table shows the version information of the hardware and software on this server.
For most of workloads, the PMU value of L1D_CACHE_WR is large than L2D_CACHE_WR. For example, with Redis there is one Redis process that runs on core 1. You can use Memtier clients to generate read-write mixed requests.
# perf stat -e L1D_CACHE_WR,L2D_CACHE_WR -C 1 Performance counter stats for 'CPU(s) 1': 3,598,861,918 L1D_CACHE_WR 628,169,832 L2D_CACHE_WR
Below are the definitions of L1D_CACHE_WR and L2D_CACHE_WR in the N2 PMU guide [1].
L1D_CACHE_WR: This event counts any store operation which looks up in the L1 data cache. This event also counts accesses caused by a Data Cache Zero by Virtual Address (DC ZVA) instruction.L2D_CACHE_WR: This event counts any memory write operation issued by the CPU which looks up in the unified L2 cache. This event counts whether the access hits or misses in the L2 cache. This event also counts any write-back from the L1 data cache that allocates into the L2 cache. This event treats Data Cache Zero by Virtual Address (DC ZVA) operations as a store instruction and counts those accesses. Snoops from outside the CPU are not counted.From N2 Technical Reference Manual [2] we know that L1 cache and L2 cache become strictly inclusive. Any cache line present in the L1 cache is also present in the L2 cache.
After investigation, we made the following assumptions:
The PMU value of L2D_CACHE_WR is approximately equal to the sum of the PMU values of “L1 data cache refill”, “L1 instruction cache refill” and “L1 prefetch, refilled to L1”.Why are we starting our analysis from these three events?
Because these three events will cause evictions of L1 data cache and L1 instruction cache, including both clean and dirty evictions. If you look at the definition of L2D_CACHE_WR, which counts any write-back from the L1 data cache that allocates into the L2 cache. L1 data cache write-back counts any write-back of dirty data from the L1 data cache to the L2 cache, the value is usually quite small, does not match up with the value of L2D_CACHE_WR, and there is not a PMU counter specifically for clean evictions. Therefore, we must approach the analysis from a different angle by aggregating all events that can lead to evictions (both clean and dirty), then we found the pattern.
We verified this discovery in several typical scenarios. Most test cases do follow this pattern.
For the Redis case:
We use Memtier clients as the load generator to generate mixed read and write requests for Redis process, the PMU values follow the pattern.
For the 'Telemetry: ustress: l1d_cache_workload' [3] case:
This benchmark only reads data, aims to stress L1 data cache with misses, the PMU values follow the pattern.
For the 'Telemetry: ustress: l1i_cache_workload' [3] case:
This benchmark makes repeated calls to functions that are aligned to page boundaries, aims to stress CPU L1 instruction cache with misses, the PMU values follow the pattern.
For the Eigen gemm case:
This benchmark contains many read operations. The PMU value of L1D_CACHE_WR is small, but the prefetch operation of L1 cache causes the PMU value of L2D to be large. So, we have the result that the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR. The PMU values also follow this pattern.
But Streaming write is an exception, we use 'Telemetry: ustress: memcpy_workload' [3] benchmark, which stresses the load-store pipeline with a memcpy entirely within L1D cache, memcpy triggers streaming write that skips L1 and directly writes to L2, the PMU values don’t follow the pattern.
Below are the descriptions of write streaming mode in N2 Technical Reference Manual [2].
The Neoverse N2 core supports write streaming mode, sometimes referred to as read allocate mode, both for the L1 and the L2 cache.A cache line is allocated to the L1 and L2 cache on either a read miss or a write miss. However, writing large blocks of data can pollute the cache with unnecessary data. It can also waste power and performance when a linefill is performed only to discard the linefill data because the entire line was subsequently written by the memset(). In some situations, cache line allocation on writes is not required. For example, when executing the C standard library memset() function to clear a large block of memory to a known value.To prevent unnecessary cache line allocation, the memory system can detect when the core has written a full cache line before the linefill completes. If this situation is detected on a configurable number of consecutive linefills, then it switches into write streaming mode.When in write streaming mode, load operations behave as normal, and can still cause linefills. Writes still lookup in the cache, but if they miss then they write out to the L2 or system rather than starting a linefill.
On the N2 server, L2D_CACHE_WR counts all cache evictions from L1 cache, including clean evictions and dirty evictions, and streaming write as well.
For workloads that read large amounts of data, we will see that the PMU value of L1D_CACHE_WR is lower than L2D_CACHE_WR.
[1] Arm® Neoverse N2 PMU Guide[2] Arm® Neoverse N2 Core Technical Reference Manual[3] Telemetry Solution