This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Running into unexpected behavior with Cache Way Partitioning on the A76/DSU

Hello,

I am attempting to get Cache Way partitioning to work on the Raspberry Pi-5 using the process defined here https://developer.arm.com/documentation/100453/0401/L3-cache/L3-cache-partitioning?lang=en. I am using the following kernel driver and firmware patch to perform this process: https://github.com/ColeStrickler/Pi5-CacheWayPartition/blob/main/way-partition.patch .

I initially tested this by allocating 1/2 of the 2MB LLC to a single task pinned to core 0 and monitoring LLC miss rates with the perf utility. I then allocated a working set size(WSS) that is greater than the L2 Cache size and less than the 1MB partition size. With this setup I get negligible LLC misses, which is expected. Then to confirm the setup is indeed only given 1MB of allocation, I increase the WSS to greater than the 1MB partition and re-run the experiment.The amount of LLC misses increases dramatically. This tells me that the partition is working when I have a solo task.

I run into trouble when I add co-runners that utilize the other 1/2 of the cache. I am running these co-runners with WSS such that they will not fit in their 1MB partition and their accesses will cause a lot of LLC evictions, but due to the partitioning I expect that they will not evict the victim's cache lines. My set up is a victim task on core0 allocated 1/2 of the cache with WSS less than the partition size, and 3 co-runners on cores 1,2,3 with the other 1/2. I expect that due to the partition that they will not evict the cache lines of the victim task running on the other partition, but I am getting a lot of evictions (>90% of total cache acesses) as measured with perf.

I also get the same result when I use software for cache set partitioning using this utility https://github.com/heechul/palloc/tree/master. This utility works fine on other versions of the Linux kernel and the Pi4.

I disabled all pre-fetching using the CPUECTLR(developer.arm.com/.../CPUECTLR-EL1--CPU-Extended-Control-Register--EL1- register to see if for some reason the pre-fetchers were not following the way partition directives. When I did this and re-ran the experiment I got interesting results. When the co-runners were set to only issue read requests, disabling the pre-fetchers lowered LLC miss rate by >30%, but did not reduce it to the expected rate of ~0%. When the co-runners are set to only issue writes, disabling the pre-fetchers had no effect on the LLC miss rate. This hints to me that the pre-fetchers are indeed not following the way-partitioning scheme since disabling them reliably causes a large change in the number of LLC misses.

I noticed that the DSU has performance counters for its pre-fetchers(https://developer.arm.com/documentation/100453/0401/Performance-Monitoring-Unit/PMU-events?lang=en) so I decided to run my experiments, while turned on and off, and observe these values to see if I could gain any further insight.

With Pre-fetchers Disabled, Measuring Reads on the victim:

SCU_PFTCH_CPU_ACCESS = 240
SCU_PFTCH_CPU_MISS = 2
SCU_PFTCH_CPU_HIT = 340,737,700
SCU_PFTCH_CPU_MATCH = 244,603,964
SCU_PFTCH_CPU_KILL = 0

With Pre-fetchers Enabled, Measuring reads on the victim:

SCU_PFTCH_CPU_ACCESS = 150,482,444
SCU_PFTCH_CPU_MISS = 209,264,662
SCU_PFTCH_CPU_HIT = 201,659,744
SCU_PFTCH_CPU_MATCH = 8,210,196
SCU_PFTCH_CPU_KILL = 0

With Pre-fetchers Disabled, Measuring writes on the victim:

SCU_PFTCH_CPU_ACCESS = 182
SCU_PFTCH_CPU_MISS = 2
SCU_PFTCH_CPU_HIT = 50,785,065
SCU_PFTCH_CPU_MATCH = 20,480,592
SCU_PFTCH_CPU_KILL = 0

With Pre-fetchers Enabled, Measuring Writes on the victim:

SCU_PFTCH_CPU_ACCESS = 3,031
SCU_PFTCH_CPU_MISS = 647
SCU_PFTCH_CPU_HIT = 317,860,936
SCU_PFTCH_CPU_MATCH = 552,832
SCU_PFTCH_CPU_KILL = 0

I am not entirely sure how to understand the pre-fetch performance counters, but it still looks like pre-fetching activity is still going on after I disable the user controllable ones in CPUECTLR. Am I understanding this wrong?

Does pre-fetching not respect way-partitioning, or am I making the wrong assumptions from my experiments, or need a config change?

When I disabled the pre-fetchers, I observed a ~30% decline in LLC miss rate when using read only co-runners. What could be causing the other ~60% of evictions?

Thank you for your help in advance.