This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Non-Temporal Writes in SIMD Instruction set

Note: This was originally posted on 21st March 2011 at http://forums.arm.com

X-86 platform supports what they term as non-temporal writes. This just means stores from the registers to memory that do not influence the cache. They are purported to run faster. Are there similar instructions for the NEON where we can speed up a simple memory copy by writing directly to memory and bypassing the cache?
  • Note: This was originally posted on 22nd March 2011 at http://forums.arm.com


    No, ARM does not have these types of instruction. The only direct programmer control of memory cacheability is via the page tables.

    However, do you have any numbers to suggest that writing to uncached buffered memory is any faster than writing to cached memory on ARM? Most recent ARM cores optimize caches for memcpy performance, as it is a pretty common use case.




    I have no evidence because as you said there are no instructions that would allow this to occur. I can tell you that on X86 platforms it makes a big difference.  It seems to me, based on what your have said, is that the only way ARM could have optimized for memcpy would be to not use the cache to write through. But I doubt that is the design. The steps we use on x86 are 1. tickle a cache ine by asking for a bit load from an aligned memory location. The cache will fill the line (or way? I can't say I understand the difference between a line and a way). 2. Fetch the full memory, this is now coming from the cache and 3. Write the data back out to non overlapping memory using instructions that bypass the cache because it is being "tickled" for the next batch of data. This is all happening on the SIMD processor so we get a lot of data using multi load instructions.
  • Note: This was originally posted on 22nd March 2011 at http://forums.arm.com

    Also, there is this article that I am told is irrelevant for the Cortex A9, but I have not been told why it is irrelevant. What has changed from the A-8 to the A-9 that renders the numbers in this article irrelevant?
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
  • Note: This was originally posted on 23rd March 2011 at http://forums.arm.com


    but I have not been told why it is irrelevant.


    Totally different memory system implementation in the two cores ...

    For sequential reads from cache the Cortex-A9 implements an integrated preload engine which is transparent to the programmer. It should always be "one step ahead" of the memcpy without the need for the programmer to tickle the buffer being read from. http://infocenter.ar...f/CHDFEFAH.html

    I've seen benchmarks that show copies from cached to uncached-buffered memory (equivalent to your uncached write) are slower than cached to cached copies. However, it does probably depend on memory latencies, bandwidths, etc to some degree ...
  • Note: This was originally posted on 22nd March 2011 at http://forums.arm.com

    No, ARM does not have these types of instruction. The only direct programmer control of memory cacheability is via the page tables.

    However, do you have any numbers to suggest that writing to uncached buffered memory is any faster than writing to cached memory on ARM? Most recent ARM cores optimize caches for memcpy performance, as it is a pretty common use case.