memory copy using ARM NEON does not better than memcpy (a little improvement)

I have re-implemented buffer(cropped) copy using ARM NEON. But it seems not to improve significantly compared to memcpy.

https://godbolt.org/z/zv5aeTW1f

- I can see ld4 and st4 instructions for arm neon version

average elapsed time for 430 iterations (frames) is like the below:

cpu   : 2150.4 microsec   

neon : 2074.7 microsec

source buffer is 4608x1366 and dest(cropped) buffer is 1120x1366

 

In fact, I was expecting more than 2x but there was no such improvement. Is it wrong expectation for such memory copy only instruction(memcpy)?

Thanks!