We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
You'll get write buffer stalls if the memory you were writing was not actually cacheable, so it might be worth double checking that the memory being written is marked cachable in the MMU tables.==>Thanks for your remind, I checked the MMU and it is cacheable region.Otherwise, you may need to improve the spatial locality of your writes (search for "strip mining", "blocking" and/or "tiling").How much data are you reading/writing? How big is your data cache?==>The data cache setting is 16K-4way.What I want is to improve the write misses in motion compensation (copy pixel from addA and do some filter operations, then write to addB, I am pretty sure addrB is not in Dcache). I have tried loading addrB into cache before writing it, but this also introduces extra cache read misses. Could you give some advise on this situation? thanks.
If you're only writing to the output (not reading and writing) then I'm not sure there's much you can do besides - write the output in at least 32-bit chunks (maybe even larger, e.g. STM) -- writing bytes will stall the write buffer sooner==> Yes, write 32-bit chunks is better thas byte only. But what STM helps here, we know arm9's write buffer doesn't support write merge. Will STM make all store write to the same write buffer entry? - make sure you're only writing the output once==> yes, I am sure most of the cases are writing once. - write the output in consecutive ascending addresses (actually, that probably only helps if the output is already in the cache, which I'm guessing is not happening here)==> Does write order affect the performance if the data in the cache or not in the cache? - try to find out if the memory timing is set as fast as possible in whatever memory controller you're using
> Will STM make all store write to the same write buffer entry?The [url="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0198e/I31031.html"]write buffer on the 926[/url] can queue up 16 data words at 4 addresses. An STR (or STRH or STRB) that misses the cache (or is uncacheable) will use one data word and one address. An STM of N registers will use N data words and one address.Depending on your memory system, there may also be some benefit to using STM of 4 or 8 registers since that will allow the 926 to use [url="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0198e/Cacjgjec.html"]bursts on the external AHB bus[/url].> Does write order affect the performance if the data in the cache or not in the cache?I think I'm going to retract my "consecutive ascending addresses" comment. I was imagining a difference between consecutive ascending addresses and consecutive decending addresses, but I'm not sure it makes any difference, especially without write allocate (and maybe even with). For writes that miss the cache, except for the STM comments above, I don't think it will make any difference on the 926 (since it's not merging writes).
What should I do to load the cache line in advance with minimal cost.
The platform I am using is ARM926EJS. Cache policy is write-back and only read-allocate. From the profile result, the program I want to optimize has too many write misses (write buffer refill) Can anyone give me some guidelines or tricks to improve my program? thanks.