Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.

We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.

Thank you for your understanding.


This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

memory copy using ARM NEON does not better than memcpy (a little improvement)

I have re-implemented buffer(cropped) copy using ARM NEON. But it seems not to improve significantly compared to memcpy.

https://godbolt.org/z/zv5aeTW1f

- I can see ld4 and st4 instructions for arm neon version

average elapsed time for 430 iterations (frames) is like the below:

cpu   : 2150.4 microsec   

neon : 2074.7 microsec

source buffer is 4608x1366 and dest(cropped) buffer is 1120x1366

 

In fact, I was expecting more than 2x but there was no such improvement. Is it wrong expectation for such memory copy only instruction(memcpy)?

Thanks!