I have re-implemented buffer(cropped) copy using ARM NEON. But it seems not to improve significantly compared to memcpy.
https://godbolt.org/z/zv5aeTW1f
- I can see ld4 and st4 instructions for arm neon version
average elapsed time for 430 iterations (frames) is like the below:
cpu : 2150.4 microsec
neon : 2074.7 microsec
source buffer is 4608x1366 and dest(cropped) buffer is 1120x1366
In fact, I was expecting more than 2x but there was no such improvement. Is it wrong expectation for such memory copy only instruction(memcpy)?
Thanks!