I am using the Cortex-A53 processor (Xilinx Zynq Ultrascale+ SoC).
I have a problem that I get high BUS_ACCESS_LD count with write-streaming/read-allocate mode if I do a memset (it is a self-written memset in assembly). On the Xilinx chip I can also measure write byte count and read byte count to the DDR memory controller ports and I can see that the actual read byte count is not that high.
Testcase 1: memset of 1.085.440 bytes, write-streaming disabled:
L2D_CACHE: 32776BUS_ACCESS_LD: 65547L1D_CACHE_REFILL: 16386L1D_DACHE_WB: 16386L2D_CACHE_REFILL: 16388L2D_CACHE_WB: 16375
DDRC.S1 Write Byte Count: 524160DDRC.S1 Read Byte Count: 524480DDRC.S2 Write Byte Count: 523840DDRC.S2 Read Byte Count: 524416
One cacheline is 64 bytes. BUS_ACCESS counts beats, data width of the bus is 16 bytes. These values seem to make sense.
Testcase 2: memset of 1.085.440 bytes, write-streaming enabled:
L2D_CACHE: 16388BUS_ACCESS_LD: 16419L1D_CACHE_REFILL: 6L1D_DACHE_WB: 0L2D_CACHE_REFILL: 9L2D_CACHE_WB: 16255
DDRC.S1 Write Byte Count: 520128DDRC.S1 Read Byte Count: 384DDRC.S2 Write Byte Count: 520192DDRC.S2 Read Byte Count: 64
The count values of L2D cache access, L2D cache write-back and BUS_ACCESS_LD are close together. It makes sense that cache refill is low and L1 write-back is also low. But I do not understand why BUS_ACCESS_LD is so large in this case. I can see that on the DDR memory controller ports there are only a few bytes read.
There is an errata notice for the Cortex-A53 regarding "PMU counter values might be inaccurate when monitoring certain events". But only BUS_ACCESS and BUS_ACCESS_ST are mentioned there. Is there an error with BUS_ACCESS_LD and write-streaming?