Running into unexpected behavior using Cache Way-Partitioning on A76/DSU

Hello,

I am attempting to get Cache Way partitioning to work on the Raspberry Pi-5 using the process defined here https://developer.arm.com/documentation/100453/0401/L3-cache/L3-cache-partitioning?lang=en. I am using the following kernel driver and firmware patch to perform this process: https://github.com/ColeStrickler/Pi5-CacheWayPartition/blob/main/way-partition.patch . 

I initially tested this by allocating 1/2 of the 2MB LLC to a single task pinned to core 0 and monitoring LLC miss rates with the perf utility. I then allocated a working set size(WSS) that is greater than the L2 Cache size and less than the 1MB partition size. With this setup I get negligible LLC misses, which is expected. Then to confirm the setup is indeed only given 1MB of allocation, I increase the WSS to greater than the 1MB partition and re-run the experiment.The amount of LLC misses increases dramatically. This tells me that the partition is working when I have a solo task.

I run into trouble when I add co-runners that utilize the other 1/2 of the cache. I am running these co-runners with WSS such that they will not fit in their 1MB partition and their accesses will cause a lot of LLC evictions, but due to the partitioning I expect that they will not evict the victim's cache lines. My set up is a victim task on core0 allocated 1/2 of the cache with WSS less than the partition size, and 3 co-runners on cores 1,2,3 with the other 1/2. I expect that due to the partition that they will not evict the cache lines of the victim task running on the other partition, but I am getting a lot of evictions (>90% of total cache acesses) as measured with perf. 

I also get the same result when I use software for cache set partitioning using this utility https://github.com/heechul/palloc/tree/master. This utility works fine on other versions of the Linux kernel and the Pi4. 

I disabled all pre-fetching using the CPUECTLR(developer.arm.com/.../CPUECTLR-EL1--CPU-Extended-Control-Register--EL1- register to see if for some reason the pre-fetchers were not following the way partition directives. When I did this and re-ran the experiment I got interesting results. When the co-runners were set to only issue read requests, disabling the pre-fetchers lowered LLC miss rate by >30%, but did not reduce it to the expected rate of ~0%. When the co-runners are set to only issue writes, disabling the pre-fetchers had no effect on the LLC miss rate. This hints to me that the pre-fetchers are indeed not following the way-partitioning scheme since disabling them reliably causes a large change in the number of LLC misses. 

I noticed that the DSU has performance counters for its pre-fetchers(https://developer.arm.com/documentation/100453/0401/Performance-Monitoring-Unit/PMU-events?lang=en) so I decided to run my experiments, while turned on and off, and observe these values to see if I could gain any further insight. 

With Pre-fetchers Disabled, Measuring Reads on the victim:

  • SCU_PFTCH_CPU_ACCESS = 240
  • SCU_PFTCH_CPU_MISS = 2
  • SCU_PFTCH_CPU_HIT = 340,737,700
  • SCU_PFTCH_CPU_MATCH = 244,603,964
  • SCU_PFTCH_CPU_KILL = 0

With Pre-fetchers Enabled, Measuring reads on the victim:

  • SCU_PFTCH_CPU_ACCESS = 150,482,444
  • SCU_PFTCH_CPU_MISS = 209,264,662
  • SCU_PFTCH_CPU_HIT = 201,659,744
  • SCU_PFTCH_CPU_MATCH = 8,210,196
  • SCU_PFTCH_CPU_KILL = 0

With Pre-fetchers Disabled, Measuring writes on the victim:

  • SCU_PFTCH_CPU_ACCESS = 182
  • SCU_PFTCH_CPU_MISS = 2
  • SCU_PFTCH_CPU_HIT = 50,785,065
  • SCU_PFTCH_CPU_MATCH = 20,480,592
  • SCU_PFTCH_CPU_KILL = 0

With Pre-fetchers Enabled, Measuring Writes on the victim:

  • SCU_PFTCH_CPU_ACCESS = 3,031
  • SCU_PFTCH_CPU_MISS = 647
  • SCU_PFTCH_CPU_HIT = 317,860,936
  • SCU_PFTCH_CPU_MATCH = 552,832
  • SCU_PFTCH_CPU_KILL = 0

I am not entirely sure how to understand the pre-fetch performance counters, but it still looks like pre-fetching activity is still going on after I disable the user controllable ones in CPUECTLR. Am I understanding this wrong?

Does pre-fetching not respect way-partitioning, or am I making the wrong assumptions from my experiments, or need a config change?

When I disabled the pre-fetchers, I observed a ~30% decline in LLC miss rate when using read only co-runners. What could be causing the other ~60% of evictions? 

Thank you for your help in advance

Parents
  • Hi, you might have already made more progress, but here are a few thoughts.:Firstly, any CPU-issued reads (including prefetches) should respect the DSU L3 partition policy..If you are seeing more evictions than expected, it could be due to TLB pressure. With a large working set, and 4K pages, you may be experiencing frequent TLB misses and page walks. Depending on your access pattern your effective working set could be significantly larger than your program-visible working set. You could count CPU TLB misses to check. Using huge pages can reduce TLB misses. Also,  those SCU_PFTCH numbers look odd.  As implemented, every ACCESS is either HIT or MISS, so those should add up. Could it be that you are reading the CPU PMU not the DSU PMU (which is separate)? Cortex-A76 ignores bit 10 in the event code so you are actually counting events 0x100 to 0x104. To count DSU PMU events you need the arm_dsu PMU driver.

Reply
  • Hi, you might have already made more progress, but here are a few thoughts.:Firstly, any CPU-issued reads (including prefetches) should respect the DSU L3 partition policy..If you are seeing more evictions than expected, it could be due to TLB pressure. With a large working set, and 4K pages, you may be experiencing frequent TLB misses and page walks. Depending on your access pattern your effective working set could be significantly larger than your program-visible working set. You could count CPU TLB misses to check. Using huge pages can reduce TLB misses. Also,  those SCU_PFTCH numbers look odd.  As implemented, every ACCESS is either HIT or MISS, so those should add up. Could it be that you are reading the CPU PMU not the DSU PMU (which is separate)? Cortex-A76 ignores bit 10 in the event code so you are actually counting events 0x100 to 0x104. To count DSU PMU events you need the arm_dsu PMU driver.

Children
No data