Currently working on Xilinx Zynq US+ soc where R5(2 cores in lock step) and A53 (4 cores) , PL and GPU are mounted onto a single chip.so far we were using the concept of software based cache coherency mechanism to communicate between R5 and A53 worlds. We do perform explicit cache operations at software level. Now we got to know that our soc has in built Cache coherent (CCI-400) by which R5 can snoop to A53 cache's. By doing this in one way we could avoid the cache maintenance oerations at software level. As it supports onw way I/O coherency method, where R5 cache can snoop through A53 cache (here we can skip cache clean on A53 and cache ilvalidation on R5)We enabled this feature and started measuring the performance and bandwidth. It found to be that the 20% of R5 performance is decreased when we just enabled the coherency. Though there is nothing running in the system (a53, GPU....) the standalone RPU performance (write rate) became 3x times slower. can this be possible ? and is there settings I am missing in CCI I am not sure.
Cache coherence eats bandwidth as the cache controller must keep the other caches in sync. But 20% is too much IMO, so I also think there is some setting wrong.
Are you sure, the R5 can snoop the A53 caches? I just check the TRM and see no such info.
yes. 20% is too much and we are not able to opt for this method as our system is already having a CPU usage of 86%
Yes. Please refer Zynq US+ soc TRM for the details. We do have one way I/O cache coherency, where R5 can snoop through A53 Caches but the other way is not possible !!
I doubled checked from Xilinx team as well and they confirmed that it is possible :)
If there might be a chance that R5 performance is affected by PL coherency settings ? as I do see PL is also part of the coherency mechanism. I mean there is a two way cache coherency between A53-PL.
Did you consider to mark the shared memory as WRITE-THROUGH and invalidate the RPU cache?Still, 20% degradation seems way too much. But I have not worked with the RPU yet, so good to remember ...
MPU as writeback and write allocated , shared is configured