I am using the Raspberry Pi 5 to explore the impact of cache prefetching on performance. The Raspberry Pi 5 features a Cortex-A76 CPU with a total of 4 cores, each equipped with L1 and L2 caches (64KB L1 cache per core, 512KB L2 cache per core), and sharing a single L3 cache (2MB).
I am working on a simple case, running a matrix column read program, and aiming to enhance performance by using the ARMv8 PRFM instruction to prefetch target elements. By adjusting various prefetch distances, I observed some performance improvements. However, data collected with perf revealed an issue: while the L2 dcache cache miss count decreased, the access frequency of the L3 cache increased. My understanding is that if a load hits in the cache, it shouldn't access the L3 cache, so this increase in L3 access is puzzling. Another related issue is that although the L2 cache hit rate improved, the overall program performance declined.
PRFM
The table below shows the cache metrics for different prefetch distances:
table1
table2
The above data represents cache metrics collected through perf stat. I find the following parts reasonable: the read and write counts for the L1D cache (l1d_cache_rd, l1d_cache_wr) in Table 1 are stable, indicating that the number of program accesses and read/write operations on the matrix is fixed. Additionally, the total number of instructions in the program remains stable (with an increase corresponding to the addition of prefetching code).
perf stat
l1d_cache_rd, l1d_cache_wr
However, the point of confusion is that, according to the data in Table 2, the number of L2D cache misses(l2d_cache_refill_rd) for 2 prefetch distance is lower than that for 1 prefetch distance . Therefore, one would expect that 2 prefetch distance should perform better than 1 prefetch distance. However, as shown in Table 1, the matrix read time for 1 prefetch distance is 4.8 seconds, while for 2 prefetch distance, it is 5.3 seconds. The cycles consumed by the program also indicate a performance gap between the two. What could be causing this counterintuitive result?
l2d_cache_refill_rd
2 prefetch distance
1 prefetch distance
Additionally, I am curious about why, even though the number of L2D cache misses decreases with 2 prefetch distance, the read counts for the L3D cache increase compared to 1 prefetch distance.
The matrix column read program I am using is as follows:
#include <stdio.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #define ROWS 500 #define COLS 38000 #define ITERATIONS 30 #define DISTANCE 0 int main() { printf("Prefetch Distance : %d\n", DISTANCE); int **matrix = (int **)malloc(ROWS * sizeof(int *)); for (int i = 0; i < ROWS; i++) { matrix[i] = (int *)malloc(COLS * sizeof(int)); for (int j = 0; j < COLS; j++) { matrix[i][j] = rand() % 100; } } int sum = 0; clock_t start = clock(); for (int iter = 0; iter < ITERATIONS; iter++) { for (int j = 0; j + 16 < COLS; j+=16) { for (int i = 0; i < ROWS; i++) { if(i + DISTANCE < ROWS) __asm__ volatile("prfm pldl2strm, [%0]" :: "r"(&matrix[i+DISTANCE][j])); for(int t = j; t < j + 16; t++) sum += matrix[i][t]; } } } clock_t end = clock(); printf("Sum: %d\n", sum); printf("Time taken: %lf seconds\n", (double)(end - start) / CLOCKS_PER_SEC); for (int i = 0; i < ROWS; i++) { free(matrix[i]); } free(matrix); return 0; }