Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
I have re-implemented buffer(cropped) copy using ARM NEON. But it seems not to improve significantly compared to memcpy.
https://godbolt.org/z/zv5aeTW1f
- I can see ld4 and st4 instructions for arm neon version
average elapsed time for 430 iterations (frames) is like the below:
cpu : 2150.4 microsec
neon : 2074.7 microsec
source buffer is 4608x1366 and dest(cropped) buffer is 1120x1366
In fact, I was expecting more than 2x but there was no such improvement. Is it wrong expectation for such memory copy only instruction(memcpy)?
Thanks!