We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hi,
according to this talk, cache maintenance should always be performed by VA and not by set/way except during boot or shutdown. However, invalidating/cleaning a block of data by VA requires a loop to run over the entire memory block (in steps equal to the line size). If the memory block is large (say, a frame buffer that should subsequently be read by DMA when there is no hardware cache coherency), there is a significant overhead for the loop counting alone. If the memory block is much larger than the cache, most of the maintenance operations will be NOP's because the targeted addresses aren't cached, but the software can't know that. Maintenance by set/way and a loop that iterates through all ways and sets (such as the example given on p. 8-20 of the Cortex-A Programmer's Guide) has a fixed runtime independent of the actual buffer size, and will be faster for large buffers.
A framebuffer could probably simply be marked as uncached, but that is no general solution for every use case. So, how to correctly invalidate/clean cache for large memory buffers? If I am using a single-core Cortex-A8 with no L3 cache, would set/way be correct?
Thanks,Niklas
I am not sure if you can find out which set/way a certain VA is in. But even if you can, I guess the overhead is much larger.
You have to iterate through all VAs with steps of line size anyway. But additionally you need to compute set/way _and_ keep track if this set/way was already flushed or not.
Since L1 + L2 cache are mostly larger than the buffer, I see no benefit at all.
In the end, writing data back to main memory needs much more time than the loop overhead and/or computing of set/way.
Yes you can tell the set but not the way the address is. To find the value from the address you have to print out all the ways.
However that is not difficult as usually the caches are 12, 16 set way associative.
Thanks for the reply. I am working with a TI Sitara AM3358 Cortex-A8 processor. It has 32 KiB L1 D-Cache and 256 KiB L2 unified L2 cache. I am using an in-memory framebuffer for the LCD controller. For a resolution of 480x272 at 24bit RGB-color, I get a buffer size of 382 KiB, which is larger than both caches together. I am planning to increase the resolution to perhaps 1366x768, which will result in a buffer size of 3 MiB, which is much larger than the caches. I am using this routine for cleaning the buffer:
.type CleanDataCacheAreaPoC, %function .align 6 CleanDataCacheAreaPoC: cmp r0, r1 bxcs lr @ If r0 >= r1, return mrc p15, 0, r2, c0, c0, 1 @ Read CTR (Cache Type Register) ubfx r2, r2, #16, #4 @ Extract DminLine - Size of the smallest cache line in Log2(Words) mov r3, #4 @ Word Size mov r3, r3, lsl r2 @ Calculate cache line size in bytes sub r2, r3, #1 @ Convert to bit mask bic r0, r2 @ Clear bits in start address, i.e. truncate to begin of cache line 1: mcr p15, 0, r0, c7, c10, 1 @ DCCMVAC, Clean data or unified cache line by address to PoC add r0, r3 @ Increment by cache line size cmp r0, r1 blo 1b @ Did not reach end? Jump back dsb bx lr
Even for the low resolution, this takes twice the time as the set/way method from the Cortex-A programmer's guide. For the increased buffer size, it will be a lot slower, while the set/way method would be as fast as before.
Ok, for one, I would not cache the screen buffer. I guess the LCD controller will use DMA to read the data.
Of course it depends, how you update the screen buffer, but it is likely that you have no benefit or not much benefit from caching it.
My personal feel is that if you are having a single cpu then an operation by set/way should get quicker. Set/way is local to the cpu so other cpus aren't aware of it breaking the coherency but you should not worry about it as you are having a single cpu.
All the code in Uboot for CMOs always or almost always (excluding the cases I'm not aware of) use CMOs by set/way. The cost is that you have to perform it on each of the cache but should be indeed quicker than the adequate one using VA.
The only thing is that if you seek performing an operation on an address you have to compute the set from it and then run through all the ways on a given cache level, then do it again on a cache of an upper level. And by it you may eg. flush more than wanted.
That's right, but there might be other cases where a cache is virtually required, e.g. when receiving a large data block via USB, Ethernet, MMC/SDIO (e.g. SD-Card or SDIO-based WiFi-Adapter), because multiple read accesses profit from caches.Of course, for receiving, I need to do an invalidate instead of a clean, but the issue stays the same. According to the mentioned talk, two invalidate loops are necessary, making it worse.
That's exactly what I thought. For such a large buffer, all sets will be affected anways, so I can just clean/invalidate them all. It is my understanding that there is no way to determine the way for an MVA, so I'd have to clean/invalidate all of them. That might of course affect some other data, but usually this should not cause problems but just a small (in comparison) performance hit. Or could this accidentally invalidate and thereby incorrectly abandon some dirty data? Hmm.
I think you meant all the ways of a given set. Again set is an index that may be placed in one of the ways. All the ways composing an index are referred to as a set.
If you are going to do the invalidate than you may invalidate more than wanted. If it happens the data being invalidated isn't in the other caches (caches of the higher level) or the main memory than you may lose it. If you do clean and invalidate you are safe at all time.