Hey,
on our development board we use PCIe to exchange data between the two Tegras on a NVIDIA Drive PX2.
Basically the data coming across NT ports acts like a DMA engine writing to system RAM. With an interface function from the API of the PCIe-chip we allocate memory. In its definition the API-function uses "dma_alloc_coherent" from the Linux kernel. In our application we now can use the address of the allocated memory area and do our work. With memory barriers the right order of execution between reads/writes is guaranteed.
We are facing the problem, that (how it looks) new data is not polled out of the RAM and we read old data from the CPUs cache. Unfortunately the MMU is disabled, as we can't use the PCIe-driver when it is activated.
I have come across this document about Cache coherency but I am not exactly sure, if this can help us. In addition I am a complete newbie with programming on ARM on such a low level.
Any help is appreciated, thanks in advance.
Jan
As I'm not familiar with your SoC design do you have any cache coherent network/interconnect built around the CPUs?
My point is if you have CCN/CCI and your "DMA engine" is an ACE Lite type device through the RN-I (I/o Requesting Node) then is should be able to send coherent data. If you don't then you should probably invalidate memory ranges "DMA engine" writes to L1 through L2, invalidate the L1 instruction code to PoC and let the cpu go.
For CMO operations make sure you do the right ones that broadcast the operation to the other cores within the same shareable domain. Usually invalidate by VA is the one to go with. The TRM you copied the link for gives the details.
Off the record... I had a use case in which GP DMA write to L3 cache I needed to use by cpu. The DMA was connected to CCN through RN-I so that I didn't need to do anything to keep the coherency across the GP DMA and the cpu (cpu0 in this case as it was in Uboot).