ARMv7-A: Cache maintenance operation by VA, performance

Hi,

according to this talk, cache maintenance should always be performed by VA and not by set/way except during boot or shutdown. However, invalidating/cleaning a block of data by VA requires a loop to run over the entire memory block (in steps equal to the line size). If the memory block is large (say, a frame buffer that should subsequently be read by DMA when there is no hardware cache coherency), there is a significant overhead for the loop counting alone. If the memory block is much larger than the cache, most of the maintenance operations will be NOP's because the targeted addresses aren't cached, but the software can't know that. Maintenance by set/way and a loop that iterates through all ways and sets (such as the example given on p. 8-20 of the Cortex-A Programmer's Guide) has a fixed runtime independent of the actual buffer size, and will be faster for large buffers.

A framebuffer could probably simply be marked as uncached, but that is no general solution for every use case. So, how to correctly invalidate/clean cache for large memory buffers? If I am using a single-core Cortex-A8 with no L3 cache, would set/way be correct?

Thanks,
Niklas

Parents
  • I am not sure if you can find out which set/way a certain VA is in. But even if you can, I guess the overhead is much larger.

    You have to iterate through all VAs with steps of line size anyway. But additionally you need to compute set/way _and_ keep track if this set/way was already flushed or not.

    Since L1 + L2 cache are mostly larger than the buffer, I see no benefit at all.

    In the end, writing data back to main memory needs much more time than the loop overhead and/or computing of set/way.

Reply
  • I am not sure if you can find out which set/way a certain VA is in. But even if you can, I guess the overhead is much larger.

    You have to iterate through all VAs with steps of line size anyway. But additionally you need to compute set/way _and_ keep track if this set/way was already flushed or not.

    Since L1 + L2 cache are mostly larger than the buffer, I see no benefit at all.

    In the end, writing data back to main memory needs much more time than the loop overhead and/or computing of set/way.

Children
More questions in this forum