Hey,
on our development board we use PCIe to exchange data between the two Tegras on a NVIDIA Drive PX2.
Basically the data coming across NT ports acts like a DMA engine writing to system RAM. With an interface function from the API of the PCIe-chip we allocate memory. In its definition the API-function uses "dma_alloc_coherent" from the Linux kernel. In our application we now can use the address of the allocated memory area and do our work. With memory barriers the right order of execution between reads/writes is guaranteed.
We are facing the problem, that (how it looks) new data is not polled out of the RAM and we read old data from the CPUs cache. Unfortunately the MMU is disabled, as we can't use the PCIe-driver when it is activated.
I have come across this document about Cache coherency but I am not exactly sure, if this can help us. In addition I am a complete newbie with programming on ARM on such a low level.
Any help is appreciated, thanks in advance.
Jan
It doesn't seem possible the data caches are active without the MMU and data cache enabled. Without MMU on you may only cache the instructions if enabled. All the controls are in sctlr of each of the exception levels but EL0.
I think I saw a case in which MMU and data cache got disabled without clean, invalidate or clean&invalidate in which then data cached got retained in the cache and active upon MMU and data cache re-enable and messed with the memory but it was due to bad coding.
I am sorry, but I made a mistake: The MMU is enabled, we have the IOMMU (SMMU) disabled.
As I'm not familiar with your SoC design do you have any cache coherent network/interconnect built around the CPUs?
My point is if you have CCN/CCI and your "DMA engine" is an ACE Lite type device through the RN-I (I/o Requesting Node) then is should be able to send coherent data. If you don't then you should probably invalidate memory ranges "DMA engine" writes to L1 through L2, invalidate the L1 instruction code to PoC and let the cpu go.
For CMO operations make sure you do the right ones that broadcast the operation to the other cores within the same shareable domain. Usually invalidate by VA is the one to go with. The TRM you copied the link for gives the details.
Off the record... I had a use case in which GP DMA write to L3 cache I needed to use by cpu. The DMA was connected to CCN through RN-I so that I didn't need to do anything to keep the coherency across the GP DMA and the cpu (cpu0 in this case as it was in Uboot).
Check your TRM if the PCIe is really part or the cache-coherence or if by-passes the caches.Anyway, an easy way to check is to invalidate the cache before reading.
Thank you all for your help!
With
asm volatile("dc civac, %0" : : "r" (addr) : "memory");
Note: Only cleaning or invalidating the cache didn't to the trick, we must do both.
Truly sounds weird to me unless you run "dc civac" before getting data from the PCIe. If you have run it after and the PCIe isn't part of coherency you may have overridden what the PCIe sent over. Or another unless you also have the outer caches as L3 then "dc civac" should push data up to L3 and to clean and invalidate from there you should use the power transitioning mechanism of the cache coherency interconnect, that is for my CCN I use a power transition from FAM (Fully Associative L3 Memory) to NOL3 (No L3).
Either way good "dc civac" worked with you.
It is indeed used before getting the data.
Like I said before, I am a newbie with low level programming on ARM. I should read some docs on caches and co...