I developed a small test to understand the effects of L1 snoops on performance. The test was simple:
1. A writer thread on core 0 that repeatedly writes to a contiguous 2K range of RAM in a loop for 40 times
2. A reader thread on core 1 that concurrently reads from the same 2K region of RAM in a loop for 40 times
Synchronization between the reader and writer is course grain (they run at the same time, but accesses to each specific memory address are not synchronized)
Observations:
1. As expected we see many snoops as indicated by event 200d when the two threads run concurrently whereas we see very few snoops if the threads run on different cores but not concurrently
2. A total of 80Kb of data is touched by both threads in the test. We see about 10K snoops when the two threads are running concurrently (about 1 snoop per 8 bytes accessed)
3. The writer thread is barely impacted when the two threads run concurrently vs. sequentially in terms of its execution time (about 0.5ns extra time per snoop at 1GHz or 0.5 clock/snoop)
4. The reader thread is impacted by about 12 to 15 clocks per snoop (coincidentally about the cost of an L2 access)
Questions:
1. We also see a large number of L2 cache accesses as indicated by event 22d (about 10K extra L2 cache accesses or one per snoop). This seems to indicate that L2 is utilized when cache lines are moved from one core's L1 to another in response to a snoop. Could you confirm that L2 is indeed always involved in such a transfer? I had expected that L1 to L1 transfers could be completed without disturbing L2
2. This question may be moot if the answer to the question above is "yes". I've read in ARM documentation that for data accesses, the L2 cache is mostly exclusive of L1 data cache and is populated by L1 cast-offs. The behavior above could however be explained if L2 is inclusive of L1 caches. Is L2 mostly exclusive of L1 for data accesses or inclusive?
I can provide code snippets if that is helpful, but I'm expecting you may know the answers without running the example.
Thanks!
Although the A53 TRM I have doesn't call it out in the L1/L2 description, the description of CPUACTLR_EL1 indicates that the Cortex-A53 implements a write-streaming optimization. After a certain number of write misses, the L1 (and eventually L2) disable write allocation.
Between L1 and L2 is a store-merge buffer, which probably explains why you see one snoop per 8 bytes written. That's assuming you're writing a byte a time, which it sounds like you might be.
In all likelihood, the snoops triggered by the reader turned enough of the writes in your writer into misses that the write streaming optimization kicked in. That would turn your cacheline ping test into an L1/L2 affair, rather than L1/L1 affair.