Currently working on Xilinx Zynq US+ soc where R5(2 cores in lock step) and A53 (4 cores) , PL and GPU are mounted onto a single chip.so far we were using the concept of software based cache coherency mechanism to communicate between R5 and A53 worlds. We do perform explicit cache operations at software level. Now we got to know that our soc has in built Cache coherent (CCI-400) by which R5 can snoop to A53 cache's. By doing this in one way we could avoid the cache maintenance oerations at software level. As it supports onw way I/O coherency method, where R5 cache can snoop through A53 cache (here we can skip cache clean on A53 and cache ilvalidation on R5)We enabled this feature and started measuring the performance and bandwidth. It found to be that the 20% of R5 performance is decreased when we just enabled the coherency. Though there is nothing running in the system (a53, GPU....) the standalone RPU performance (write rate) became 3x times slower. can this be possible ? and is there settings I am missing in CCI I am not sure.
yes. 20% is too much and we are not able to opt for this method as our system is already having a CPU usage of 86%
Did you consider to mark the shared memory as WRITE-THROUGH and invalidate the RPU cache?Still, 20% degradation seems way too much. But I have not worked with the RPU yet, so good to remember ...
MPU as writeback and write allocated , shared is configured