This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

What Causes Program Performance to Decline Despite an Increase in Cache Hit Rate?

y say 9 months ago

I am using the Raspberry Pi 5 to explore the impact of cache prefetching on performance. The Raspberry Pi 5 features a Cortex-A76 CPU with a total of 4 cores, each equipped with L1 and L2 caches (64KB L1 cache per core, 512KB L2 cache per core), and sharing a single L3 cache (2MB).

I am working on a simple case, running a matrix column read program, and aiming to enhance performance by using the ARMv8 PRFM instruction to prefetch target elements. By adjusting various prefetch distances, I observed some performance improvements. However, data collected with perf revealed an issue: while the L2 dcache cache miss count decreased, the access frequency of the L3 cache increased. My understanding is that if a load hits in the cache, it shouldn't access the L3 cache, so this increase in L3 access is puzzling. Another related issue is that although the L2 cache hit rate improved, the overall program performance declined.

The table below shows the cache metrics for different prefetch distances:

table1

prefetch distances	l1d_cache_rd	l1d_cache_wr	l2d_cache_rd	l2d_cache_wr	l3d_cache_rd	cycles	instrctions	time(s)
no prefetch	5801742333	1421583157	72903665	72510120	109592615	16683311907	14390792510	6.437592
1	5978719208	1421205756	72937072	72518228	109916026	13064846101	14425690322	4.826689
2	5978712724	1421327571	72938445	72519305	115868873	13653861469	14424978109	5.255271
3	5978063623	1421179641	72940910	72525954	156868751	18976102863	14424265905	7.529704

table2

prefetch distances	l1d_cache_refill_rd	l1d_cache_refill_wr	l2d_cache_refill_rd	l3d_cache_refill
no prefetch	71274958	1140772	38356619	102573169
1	71269281	1140391	34252836	102416664
2	71270464	1140287	30173944	103771682
3	71273830	1140920	56143836	128838982

The above data represents cache metrics collected through perf stat. I find the following parts reasonable: the read and write counts for the L1D cache (l1d_cache_rd, l1d_cache_wr) in Table 1 are stable, indicating that the number of program accesses and read/write operations on the matrix is fixed. Additionally, the total number of instructions in the program remains stable (with an increase corresponding to the addition of prefetching code).

However, the point of confusion is that, according to the data in Table 2, the number of L2D cache misses(l2d_cache_refill_rd) for 2 prefetch distance is lower than that for 1 prefetch distance . Therefore, one would expect that 2 prefetch distance should perform better than 1 prefetch distance. However, as shown in Table 1, the matrix read time for 1 prefetch distance is 4.8 seconds, while for 2 prefetch distance, it is 5.3 seconds. The cycles consumed by the program also indicate a performance gap between the two. What could be causing this counterintuitive result?

Additionally, I am curious about why, even though the number of L2D cache misses decreases with 2 prefetch distance, the read counts for the L3D cache increase compared to 1 prefetch distance.

The matrix column read program I am using is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#define ROWS 500        
#define COLS 38000        
#define ITERATIONS 30
#define DISTANCE 0

int main() {
    printf("Prefetch Distance : %d\n", DISTANCE);
    int **matrix = (int **)malloc(ROWS * sizeof(int *));
    for (int i = 0; i < ROWS; i++) {
        matrix[i] = (int *)malloc(COLS * sizeof(int));
        for (int j = 0; j < COLS; j++) {
            matrix[i][j] = rand() % 100;
        }
    }

    int sum = 0;
    clock_t start = clock();
    for (int iter = 0; iter < ITERATIONS; iter++) {
        for (int j = 0; j + 16 < COLS; j+=16) {
            for (int i = 0; i < ROWS; i++) {
                if(i + DISTANCE < ROWS)
                    __asm__ volatile("prfm pldl2strm, [%0]" :: "r"(&matrix[i+DISTANCE][j]));
                for(int t = j; t < j + 16; t++)
                    sum += matrix[i][t];
            }
        }
    }
    clock_t end = clock();

    printf("Sum: %d\n", sum);
    printf("Time taken: %lf seconds\n", (double)(end - start) / CLOCKS_PER_SEC);

    for (int i = 0; i < ROWS; i++) {
        free(matrix[i]);
    }
    free(matrix);

    return 0;
}