What Causes Program Performance to Decline Despite an Increase in Cache Hit Rate?

am using the Raspberry Pi 5 to explore the impact of cache prefetching on performance. The Raspberry Pi 5 features a Cortex-A76 CPU with a total of 4 cores, each equipped with L1 and L2 caches (64KB L1 cache per core, 512KB L2 cache per core), and sharing a single L3 cache (2MB).

I am working on a simple case, running a matrix column read program, and aiming to enhance performance by using the ARMv8 PRFM instruction to prefetch target elements. By adjusting various prefetch distances, I observed some performance improvements. However, data collected with perf revealed an issue: while the L2 dcache cache miss count decreased, the access frequency of the L3 cache increased. My understanding is that if a load hits in the cache, it shouldn't access the L3 cache, so this increase in L3 access is puzzling. Another related issue is that although the L2 cache hit rate improved, the overall program performance declined.

The table below shows the cache metrics for different prefetch distances:

table1

prefetch distances l1d_cache_rd l1d_cache_wr l2d_cache_rd l2d_cache_wr l3d_cache_rd cycles instrctions time(s)
no prefetch 5801742333 1421583157 72903665 72510120 109592615 16683311907 14390792510 6.437592
1 5978719208 1421205756 72937072 72518228 109916026 13064846101 14425690322 4.826689
2 5978712724 1421327571 72938445 72519305 115868873 13653861469 14424978109 5.255271
3 5978063623 1421179641 72940910 72525954 156868751 18976102863 14424265905 7.529704

table2

prefetch distances l1d_cache_refill_rd l1d_cache_refill_wr l2d_cache_refill_rd l2d_cache_refill_wr l3d_cache_refill
no prefetch 71274958 1140772 38356619 0 102573169
1 71269281 1140391 34252836 0 102416664
2 71270464 1140287 30173944 0 103771682
3 71273830 1140920 56143836 0 128838982

The above data represents cache metrics collected through perf stat. I find the following parts reasonable: the read and write counts for the L1D cache (l1d_cache_rd, l1d_cache_wr) in Table 1 are stable, indicating that the number of program accesses and read/write operations on the matrix is fixed. Additionally, the total number of instructions in the program remains stable (with an increase corresponding to the addition of prefetching code).

However, the point of confusion is that, according to the data in Table 2, the number of L2D cache misses(l2d_cache_refill_rd) for 2 prefetch distance is lower than that for 1 prefetch distance . Therefore, one would expect that 2 prefetch distance should perform better than 1 prefetch distance. However, as shown in Table 1, the matrix read time for 1 prefetch distance is 4.8 seconds, while for 2 prefetch distance, it is 5.3 seconds. The cycles consumed by the program also indicate a performance gap between the two. What could be causing this counterintuitive result?

Additionally, I am curious about why, even though the number of L2D cache misses decreases with 2 prefetch distance, the read counts for the L3D cache increase compared to 1 prefetch distance.

The matrix column read program I am using is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#define ROWS 500        
#define COLS 38000        
#define ITERATIONS 30
#define DISTANCE 0

int main() {
    printf("Prefetch Distance : %d\n", DISTANCE);
    int **matrix = (int **)malloc(ROWS * sizeof(int *));
    for (int i = 0; i < ROWS; i++) {
        matrix[i] = (int *)malloc(COLS * sizeof(int));
        for (int j = 0; j < COLS; j++) {
            matrix[i][j] = rand() % 100;
        }
    }

    int sum = 0;
    clock_t start = clock();
    for (int iter = 0; iter < ITERATIONS; iter++) {
        for (int j = 0; j + 16 < COLS; j+=16) {
            for (int i = 0; i < ROWS; i++) {
                if(i + DISTANCE < ROWS)
                    __asm__ volatile("prfm pldl2strm, [%0]" :: "r"(&matrix[i+DISTANCE][j]));
                for(int t = j; t < j + 16; t++)
                    sum += matrix[i][t];
            }
        }
    }
    clock_t end = clock();

    printf("Sum: %d\n", sum);
    printf("Time taken: %lf seconds\n", (double)(end - start) / CLOCKS_PER_SEC);

    for (int i = 0; i < ROWS; i++) {
        free(matrix[i]);
    }
    free(matrix);

    return 0;
}