Hello everyone,
I'm currently converting some simple normal image processing functions to NEON functions for increase in performance. However, when I try with a simple one, the output image is not the same as the original image:
static void foo_neon( unsigned char* dst, const unsigned char* src, int xs, int ys) { int i,j; uint8x16_t vectA, vectB, vectC; for(i=0;i<=ys/2-1;i++) { for(j=0;j<=xs/2-1;j+=16) { vectA = vshrq_n_u8(vld1q_u8(&src[(i*2)*xs+(j*2)]), 1); vectB = vshrq_n_u8(vld1q_u8(&src[(i*2)*xs+(j*2+1)]), 1); vectC = vhaddq_u8(vectA, vectB); vst1q_u8(&dst[i*(xs/2)+j], vectC); } } } static void foo( unsigned char* dst, const unsigned char* src, int xs, int ys) { int i,j; for(i=ys/2-1;i>=0;i--) { for(j=xs/2-1;j>=0;j--) dst[i*(xs/2)+j] =(unsigned char)((src[(i*2)*xs+(j*2)]+ src[(i*2)*xs+(j*2+1)])/2); } }
This code takes the input image buffer (RAW format), do some simple processings and output to another image buffer. The xs is width and ys is height of the image.
Is there something wrong with the conversion to NEON or not? From what I see, there is nothing wrong. The only thing I can doubt about is that there may be some saturation when I add and then halve the result in NEON. However, I really need your help to clarify it for me.
Thank you.
Yes, I know. But why do the shift right of the input data? In the simple algo you do c = (a+b)/2. In the NEON version you do c = (a/2 + b/2)/2.
Ah sorry, it seems I haven't deleted the shift part before posting the code here. But the wrong result happens even if I just load and halved add the data.
The non-NEON version uses "int" as the compiler propagates the expression. But the NEON version uses real 8 bits.Try c = ((a + b) &0xff)/2.
The point is that the non-NEON version outputs correct result and the NEON version doesn't. So it seems that the "int" is the problem, is there any solution to let the NEON version behave just like non-NEON? Maybe change the data type?
I load the data to NEON as uint8_t, then I move those data from uint8_t to uint16_t register (to avoid overflow). Then I perform vhaddq(). Finally, I move back from uint16_t to uint8_t register and vst1_u8(). However, the result is still wrong. It doesn't seem to be a problem of overflow or wrong alignment.
Ah problem solved. It turns out that the bug is in src[ ... + (2*j)] . In NEON function, I load 8 adjacent elements instead of "take one ignore one".