This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Flushing all L1 & L2 caches under Linux (kernel space) - optimizing dma-mapping API

Hi,

In my system (CycloneV - 2 cores of Cortex-A9) I require large DMA transfers, and currently I can't connect DMA via ACP, so cache coherency becomes SW problem. I know that the proper way of doing it under Linux is using the DMA-MAPPING API, and indeed it did the job when I integrated it in my driver. However, since I an working with large buffers (32MB, 64MB,128MB) the time of syncing the buffers for CPU or DEVICE is too long and I'd like to optimize it. Also, I'd like to note that I need these buffers to be cached, since I perform some processing on this memory area, so having them non-cached (coherent) impacts performance. The approach that I investigated is to replace the flushing APIs, which are range based, with flushing entire caches. Since the caches themselves are considerably smaller than the buffer size it makes sense. I used the following APIs that perform clean+invalidate for this purpose:

flush_cache_all() --> for L1

outer_flush_all() --> for L2

The problem is that trying to flush L2 I get a kernel panic. The documentation notes that this operation isn't atomic and therefore IRQs should be disabled and no other active L2 masters. This of-course is a problem in an MPcore system. BTW, When I disable second core, this API passes successfully.

So, is there a proper way to flush the entire L2 cache?

Any other idea is highly appreciated.

thx.

Top replies

Matt Sealey over 4 years ago +1 verified

Hi Eli, The simple answer is that there is no 'proper' way to flush the entire L2 cache at runtime - also the 'flush all of the cache because my buffers are bigger' is somewhat of a fallacy. Unfortunately...

Parents

+1 Matt Sealey over 4 years ago

Hi Eli,

The simple answer is that there is no 'proper' way to flush the entire L2 cache at runtime - also the 'flush all of the cache because my buffers are bigger' is somewhat of a fallacy. Unfortunately stepping over by VA in cache-line increments is 'slower' than invalidate by way or clean and/or invalidate by set/way operations (especially at L2 where there's an extra converstion from VA to PA) because of the number of operations, but since they are background operations it takes advantage of only stalling the L2 command interface vs. blocking it for some of the atomic maintenance ops.

Unfortunately, 'optimizing' cache operations can be dangerous. The idea of the full cache flush being faster helps on x86 since CLFLUSH is a single blocking instruction but we have nothing like that on Arm (outside the ARM926 and Cortex-A73). You create a software race condition - one for holding a lock on the L2 controller to prevent concurrent access (this is handled by using cache MO[E]SI state so you can immediately cause a false negative on the next lock attempt) and another few microarchitectural problems for indiscriminate cache access while still allowing L1 linefills, fighting the prefetchers and maintenance broadcasts through the SCU, etc.

Rather than fight the prefetchers and forwarding logic, it is best to simply maintain to PoC by VA since the range-based APIs will handle this relatively well. Optimization around dealing with large buffers should come from other methods. Perhaps before you read DMA data back into the cache you can invalidate stale data piecemeal as you process it in smaller sections, and for placing DMA source data in memory execute smaller back to back DMA operations? It really depends on how efficient your DMA controller and memory system is.

Ta,

Matt
Cancel
Up +1 Down

Cancel

Reply

+1 Matt Sealey over 4 years ago

Hi Eli,

The simple answer is that there is no 'proper' way to flush the entire L2 cache at runtime - also the 'flush all of the cache because my buffers are bigger' is somewhat of a fallacy. Unfortunately stepping over by VA in cache-line increments is 'slower' than invalidate by way or clean and/or invalidate by set/way operations (especially at L2 where there's an extra converstion from VA to PA) because of the number of operations, but since they are background operations it takes advantage of only stalling the L2 command interface vs. blocking it for some of the atomic maintenance ops.

Unfortunately, 'optimizing' cache operations can be dangerous. The idea of the full cache flush being faster helps on x86 since CLFLUSH is a single blocking instruction but we have nothing like that on Arm (outside the ARM926 and Cortex-A73). You create a software race condition - one for holding a lock on the L2 controller to prevent concurrent access (this is handled by using cache MO[E]SI state so you can immediately cause a false negative on the next lock attempt) and another few microarchitectural problems for indiscriminate cache access while still allowing L1 linefills, fighting the prefetchers and maintenance broadcasts through the SCU, etc.

Rather than fight the prefetchers and forwarding logic, it is best to simply maintain to PoC by VA since the range-based APIs will handle this relatively well. Optimization around dealing with large buffers should come from other methods. Perhaps before you read DMA data back into the cache you can invalidate stale data piecemeal as you process it in smaller sections, and for placing DMA source data in memory execute smaller back to back DMA operations? It really depends on how efficient your DMA controller and memory system is.

Ta,

Matt
Cancel
Up +1 Down

Cancel

Children

0 eli.z over 4 years ago in reply to Matt Sealey

Hi Matt,

Thanks for pointing out the problem in my suggestion.

From algorithmic POV, fetching data from DMA while processing it is also a challenge, so I'l have to explore other alternatives,

Eli.
Cancel
Up 0 Down

Cancel