We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
I am attempting to get Cache Way partitioning to work on the Raspberry Pi-5 using the process defined here https://developer.arm.com/documentation/100453/0401/L3-cache/L3-cache-partitioning?lang=en. I am using the following kernel driver and firmware patch to perform this process: https://github.com/ColeStrickler/Pi5-CacheWayPartition/blob/main/way-partition.patch .
I initially tested this by allocating 1/2 of the 2MB LLC to a single task pinned to core 0 and monitoring LLC miss rates with the perf utility. I then allocated a working set size(WSS) that is greater than the L2 Cache size and less than the 1MB partition size. With this setup I get negligible LLC misses, which is expected. Then to confirm the setup is indeed only given 1MB of allocation, I increase the WSS to greater than the 1MB partition and re-run the experiment.The amount of LLC misses increases dramatically. This tells me that the partition is working when I have a solo task.
I run into trouble when I add co-runners that utilize the other 1/2 of the cache. I am running these co-runners with WSS such that they will not fit in their 1MB partition and their accesses will cause a lot of LLC evictions, but due to the partitioning I expect that they will not evict the victim's cache lines. My set up is a victim task on core0 allocated 1/2 of the cache with WSS less than the partition size, and 3 co-runners on cores 1,2,3 with the other 1/2. I expect that due to the partition that they will not evict the cache lines of the victim task running on the other partition, but I am getting a lot of evictions (>90% of total cache acesses) as measured with perf.
I also get the same result when I use software for cache set partitioning using this utility https://github.com/heechul/palloc/tree/master. This utility works fine on other versions of the Linux kernel and the Pi4.
I disabled all pre-fetching using the CPUECTLR(developer.arm.com/.../CPUECTLR-EL1--CPU-Extended-Control-Register--EL1- register to see if for some reason the pre-fetchers were not following the way partition directives. When I did this and re-ran the experiment I got interesting results. When the co-runners were set to only issue read requests, disabling the pre-fetchers lowered LLC miss rate by >30%, but did not reduce it to the expected rate of ~0%. When the co-runners are set to only issue writes, disabling the pre-fetchers had no effect on the LLC miss rate. This hints to me that the pre-fetchers are indeed not following the way-partitioning scheme since disabling them reliably causes a large change in the number of LLC misses.
I noticed that the DSU has performance counters for its pre-fetchers(https://developer.arm.com/documentation/100453/0401/Performance-Monitoring-Unit/PMU-events?lang=en) so I decided to run my experiments, while turned on and off, and observe these values to see if I could gain any further insight.
With Pre-fetchers Disabled, Measuring Reads on the victim:
With Pre-fetchers Enabled, Measuring reads on the victim:
With Pre-fetchers Disabled, Measuring writes on the victim:
With Pre-fetchers Enabled, Measuring Writes on the victim:
I am not entirely sure how to understand the pre-fetch performance counters, but it still looks like pre-fetching activity is still going on after I disable the user controllable ones in CPUECTLR. Am I understanding this wrong?
Does pre-fetching not respect way-partitioning, or am I making the wrong assumptions from my experiments, or need a config change?
When I disabled the pre-fetchers, I observed a ~30% decline in LLC miss rate when using read only co-runners. What could be causing the other ~60% of evictions?
Thank you for your help in advance.