This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

[ARM926EJS] improve write miss

Note: This was originally posted on 5th October 2010 at http://forums.arm.com

Hello experts,

    The platform I am using is ARM926EJS. Cache policy is write-back and only read-allocate.
    From the profile result, the program I want to optimize has too many write misses (write buffer refill)
    Can anyone give me some guidelines or tricks to improve my program? thanks.

BR,
Stanley
Parents
  • Note: This was originally posted on 8th November 2010 at http://forums.arm.com

    What should I do to load the cache line in advance with minimal cost.


    You can't on an ARM9. It's a fully in-order core, so if you issue a load to the memory to act as a preload it is still going to block waiting for that "preload" to fill the cache, so you are going to stall just as long, just earlier. For an ARM9 the best you can do is not cause that line to get evicted in the first place, and to minimize the number of lines you need to load.

    It's another case where a newer core would help - ARM11 and Cortex-R and A families decouple the load pipeline from the ALU execute, and only interlock when the data which is needed is not yet available. That said this is mostly useful for hiding a few cycles of latency, not for hiding many tens of cycles of cache miss overhead - preload is still a better solution for that.
Reply
  • Note: This was originally posted on 8th November 2010 at http://forums.arm.com

    What should I do to load the cache line in advance with minimal cost.


    You can't on an ARM9. It's a fully in-order core, so if you issue a load to the memory to act as a preload it is still going to block waiting for that "preload" to fill the cache, so you are going to stall just as long, just earlier. For an ARM9 the best you can do is not cause that line to get evicted in the first place, and to minimize the number of lines you need to load.

    It's another case where a newer core would help - ARM11 and Cortex-R and A families decouple the load pipeline from the ALU execute, and only interlock when the data which is needed is not yet available. That said this is mostly useful for hiding a few cycles of latency, not for hiding many tens of cycles of cache miss overhead - preload is still a better solution for that.
Children
No data