Good time of day. I apologise in advance if I am in the wrong section of the forum. I was profiling my Pixel 5 phone (with Snapdragon 765 SOC, whose cpu consists of 1 x Cortex A-76 at 2400 MHz, 1 Cortex x A-76 at 2200 MHz, and 6 x Cortex A-55 at 1800 MHz). I was profiling different applications and I noticed an interesting thing that I find hard to explain. I am doing the profiling with Simpleperf Android tool that collects hardware counters from CPUs PMU. I collected data accesses from 4 levels in memory hierarchy: L1_cache, L2_cache, L3_cache and mem_access.
When I profiled the Android system_server process (while opening files/native apps), I got the following results:
When I profiled Chrome web browser (while briefly browsing some web pages), I got the following:
When I profiled Youtube process (while watching a video), I got the following:
1. We see that the number of mem accesses is larger than the number of cache accesses (Armv8 architecture documentation states that mem_access hardware event counts accesses to L1 and L2), which means that some accesses did not even attempt to access the cache and went directly to memory. I know that GPUs do that because one can confidently say that data will not be reused and it makes no sense to put that data into cache. Is this what is happening here? I thought that reported mem_accesses were for cpu only.
2. Also If we look at L2 and L3 accesses, we see that there are more L3 accesses than there are L2 accesses, which suggests that some accesses bypassed at least L2 cache and went directly to L3. Does that mean the there is cache bypassing with some accesses bypassing levels of cache hierarchy and going directly to L3? Is there a way to count those accesses?
3. My ultimate goal is to calculate cache misses for some workloads (cache_miss hardware events are not implemented on Pixel 5). I am planning to do that by noting the number of accesses to different levels of memory hierarchy and subtracting appropriate numbers. For example, to calculate L1 misses I am planning to subtract L2 accesses from L1 accesses. This would be a normal method under normal circumstances, but above results got me suspecting that there might be cache bypassing at different levels. If so, is there a way to note how many accesses bypassed each cache level and where they went? So that I can calculate cache misses etc...
Best Regards,
Pavel.
Aarch64 has instructions that bypass the caches when caching makes no sense. AFAIK these are called temporal.
I see. Thank you. Do you know if there is a way to detect how many of those were executed from the CPU's PMU? Do they show up as temporal on a system trace? Do you know if it bypasses all caches or only some levels?
If it bypasses all caches, this still doesn't explain why L3 has more accesses than L2. Do you know what could be the reason for that?
I would assume, that at least the 3rd level cache is unified. But regarding PMU, I suggest the manual.