This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

[ARM926EJS] improve write miss

Note: This was originally posted on 5th October 2010 at http://forums.arm.com

Hello experts,

    The platform I am using is ARM926EJS. Cache policy is write-back and only read-allocate.
    From the profile result, the program I want to optimize has too many write misses (write buffer refill)
    Can anyone give me some guidelines or tricks to improve my program? thanks.

BR,
Stanley
  • Note: This was originally posted on 6th October 2010 at http://forums.arm.com

    You'll get write buffer stalls if the memory you were writing was not actually cacheable, so it might be worth double checking that the memory being written is marked cachable in the MMU tables.
    ==>Thanks for your remind, I checked the MMU and it is cacheable region.

    Otherwise, you may need to improve the spatial locality of your writes (search for "strip mining", "blocking" and/or "tiling").

    How much data are you reading/writing?  How big is your data cache?
    ==>The data cache setting is 16K-4way.
    What I want is to improve the write misses in motion compensation (copy pixel from addA and do some filter operations, then write to addB, I am pretty sure addrB is not in Dcache). I have tried loading addrB into cache before writing it, but this also introduces extra cache read misses. Could you give some advise on this situation? thanks.
  • Note: This was originally posted on 11th October 2010 at http://forums.arm.com

    If you're only writing to the output (not reading and writing) then I'm not sure there's much you can do besides
      - write the output in at least 32-bit chunks (maybe even larger, e.g. STM) -- writing bytes will stall the write buffer sooner
    ==> Yes, write 32-bit chunks is better thas byte only. But what STM helps here, we know arm9's write buffer doesn't support write merge. Will STM  make all store write to the same write buffer entry?
      - make sure you're only writing the output once
    ==> yes, I am sure most of the cases are writing once.
      - write the output in consecutive ascending addresses (actually, that probably only helps if the output is already in the cache, which I'm guessing is not happening here)
    ==> Does write order affect the performance if the data in the cache or not in the cache?
      - try to find out if the memory timing is set as fast as possible in whatever memory controller you're using
  • Note: This was originally posted on 8th November 2010 at http://forums.arm.com

    > Will STM make all store write to the same write buffer entry?
    The [url="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0198e/I31031.html"]write buffer on the 926[/url] can queue up 16 data words at 4 addresses. An STR (or STRH or STRB) that misses the cache (or is uncacheable) will use one data word and one address.  An STM of N registers will use N data words and one address.

    Depending on your memory system, there may also be some benefit to using STM of 4 or 8 registers since that will allow the 926 to use [url="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0198e/Cacjgjec.html"]bursts on  the external AHB bus[/url].

    > Does write order affect the performance if the data in the cache or not in the cache?
    I think I'm going to retract my "consecutive ascending addresses" comment.  I was imagining a difference between consecutive ascending addresses and consecutive decending addresses, but I'm not sure it makes any difference, especially without write allocate (and maybe even with).  For writes that miss the cache, except for the STM comments above, I don't think it will make any difference on the 926 (since it's not merging writes).


    Thanks for your reply.
    one more question. Is there any way to preload or load the cache line where write miss is going to happen? As I know, ARM9 didn't implement preload. what should I do to load the cache line in advance with minimal cost.
  • Note: This was originally posted on 8th November 2010 at http://forums.arm.com

    What should I do to load the cache line in advance with minimal cost.


    You can't on an ARM9. It's a fully in-order core, so if you issue a load to the memory to act as a preload it is still going to block waiting for that "preload" to fill the cache, so you are going to stall just as long, just earlier. For an ARM9 the best you can do is not cause that line to get evicted in the first place, and to minimize the number of lines you need to load.

    It's another case where a newer core would help - ARM11 and Cortex-R and A families decouple the load pipeline from the ALU execute, and only interlock when the data which is needed is not yet available. That said this is mostly useful for hiding a few cycles of latency, not for hiding many tens of cycles of cache miss overhead - preload is still a better solution for that.
  • Note: This was originally posted on 11th October 2010 at http://forums.arm.com

    It's probably worth saying that these are mostly symptoms of the ARM926 being a little bit long in the tooth; its age and desire for small area on process 10 years ago meant that the target gate count didn't allow many of these more advanced features.

    However, there are plenty of newer ARM core designs which do implement write-allocate caches, write buffer merging,larger numbers of write-buffer slots, etc. If you have the option of switching to something like an ARM11 MPCore,  Cortex-R4, or a Cortex-A*, then you can avoid most of these issues ...
  • Note: This was originally posted on 5th October 2010 at http://forums.arm.com

    The platform I am using is ARM926EJS. Cache policy is write-back and only read-allocate.
        From the profile result, the program I want to optimize has too many write misses (write buffer refill)
        Can anyone give me some guidelines or tricks to improve my program? thanks.


    You'll get write buffer stalls if the memory you were writing was not actually cacheable, so it might be worth double checking that the memory being written is marked cachable in the MMU tables.

    Otherwise, you may need to improve the spatial locality of your writes (search for "strip mining", "blocking" and/or "tiling").

    How much data are you reading/writing?  How big is your data cache?
  • Note: This was originally posted on 7th October 2010 at http://forums.arm.com

    If you're only writing to the output (not reading and writing) then I'm not sure there's much you can do besides
      - write the output in at least 32-bit chunks (maybe even larger, e.g. STM) -- writing bytes will stall the write buffer sooner
      - make sure you're only writing the output once
      - write the output in consecutive ascending addresses (actually, that probably only helps if the output is already in the cache, which I'm guessing is not happening here)
      - try to find out if the memory timing is set as fast as possible in whatever memory controller you're using
  • Note: This was originally posted on 11th October 2010 at http://forums.arm.com

    > Will STM make all store write to the same write buffer entry?
    The [url="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0198e/I31031.html"]write buffer on the 926[/url] can queue up 16 data words at 4 addresses. An STR (or STRH or STRB) that misses the cache (or is uncacheable) will use one data word and one address.  An STM of N registers will use N data words and one address.

    Depending on your memory system, there may also be some benefit to using STM of 4 or 8 registers since that will allow the 926 to use [url="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0198e/Cacjgjec.html"]bursts on  the external AHB bus[/url].

    > Does write order affect the performance if the data in the cache or not in the cache?
    I think I'm going to retract my "consecutive ascending addresses" comment.  I was imagining a difference between consecutive ascending addresses and consecutive decending addresses, but I'm not sure it makes any difference, especially without write allocate (and maybe even with).  For writes that miss the cache, except for the STM comments above, I don't think it will make any difference on the 926 (since it's not merging writes).