We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I have re-implemented buffer(cropped) copy using ARM NEON. But it seems not to improve significantly compared to memcpy.
https://godbolt.org/z/zv5aeTW1f
- I can see ld4 and st4 instructions for arm neon version
average elapsed time for 430 iterations (frames) is like the below:
cpu : 2150.4 microsec
neon : 2074.7 microsec
source buffer is 4608x1366 and dest(cropped) buffer is 1120x1366
In fact, I was expecting more than 2x but there was no such improvement. Is it wrong expectation for such memory copy only instruction(memcpy)?
Thanks!