ARM cortex R5 Performance is decreased by 20% after enabling Cache coherence

Currently working on Xilinx Zynq US+ soc where R5(2 cores in lock step) and A53 (4 cores) , PL and GPU are mounted onto a single chip.

so far we were using the concept of software based cache coherency mechanism to communicate between R5 and A53 worlds. We do perform explicit cache operations at software level. Now we got to know that our soc has in built Cache coherent (CCI-400) by which R5 can snoop to A53 cache's. By doing this in one way we could avoid the cache maintenance oerations at software level. As it supports onw way I/O coherency method, where R5 cache can snoop through A53 cache (here we can skip cache clean on A53 and cache ilvalidation on R5)

We enabled this feature and started measuring the performance and bandwidth. It found to be that the 20% of R5 performance is decreased when we just enabled the coherency. Though there is nothing running in the system (a53, GPU....) the standalone RPU performance (write rate) became 3x times slower.

can this be possible ? and is there settings I am missing in CCI I am not sure.