Flushing all L1 & L2 caches under Linux (kernel space) - optimizing dma-mapping API


In my system (CycloneV - 2 cores of Cortex-A9) I require large DMA transfers, and currently I can't connect DMA via ACP, so cache coherency becomes SW problem. I know that the proper way of doing it under Linux is using the DMA-MAPPING API, and indeed it did the job when I integrated it in my driver. However, since I an working with large buffers (32MB, 64MB,128MB) the time of syncing the buffers for CPU or DEVICE is too long and I'd like to optimize it. Also, I'd like to note that I need these buffers to be cached, since I perform some processing on this memory area, so having them non-cached (coherent) impacts performance.  The approach that I investigated is to replace the flushing APIs, which are range based, with flushing entire caches. Since the caches themselves are considerably smaller than the buffer size it makes sense. I used the following APIs that perform clean+invalidate for this purpose:

flush_cache_all() --> for L1

outer_flush_all() --> for L2

The problem is that trying to flush L2 I get a kernel panic. The documentation notes that this operation isn't atomic and therefore IRQs should be disabled and no other active L2 masters. This of-course is a problem in an MPcore system. BTW, When I disable second core, this API passes successfully.

So, is there a proper way to flush the entire L2 cache?

Any other idea is highly appreciated.