Hello everyone,
I'm currently converting some simple normal image processing functions to NEON functions for increase in performance. However, when I try with a simple one, the output image is not the same as the original image:
static void foo_neon( unsigned char* dst, const unsigned char* src, int xs, int ys) { int i,j; uint8x16_t vectA, vectB, vectC; for(i=0;i<=ys/2-1;i++) { for(j=0;j<=xs/2-1;j+=16) { vectA = vshrq_n_u8(vld1q_u8(&src[(i*2)*xs+(j*2)]), 1); vectB = vshrq_n_u8(vld1q_u8(&src[(i*2)*xs+(j*2+1)]), 1); vectC = vhaddq_u8(vectA, vectB); vst1q_u8(&dst[i*(xs/2)+j], vectC); } } } static void foo( unsigned char* dst, const unsigned char* src, int xs, int ys) { int i,j; for(i=ys/2-1;i>=0;i--) { for(j=xs/2-1;j>=0;j--) dst[i*(xs/2)+j] =(unsigned char)((src[(i*2)*xs+(j*2)]+ src[(i*2)*xs+(j*2+1)])/2); } }
This code takes the input image buffer (RAW format), do some simple processings and output to another image buffer. The xs is width and ys is height of the image.
Is there something wrong with the conversion to NEON or not? From what I see, there is nothing wrong. The only thing I can doubt about is that there may be some saturation when I add and then halve the result in NEON. However, I really need your help to clarify it for me.
Thank you.
Ah problem solved. It turns out that the bug is in src[ ... + (2*j)] . In NEON function, I load 8 adjacent elements instead of "take one ignore one".