This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Output image is wrong when using NEON intrinsics?

Hello everyone,

I'm currently converting some simple normal image processing functions to NEON functions for increase in performance. However, when I try with a simple one, the output image is not the same as the original image:

static void foo_neon( unsigned char* dst, const unsigned char* src, int xs, int ys)
{
	int i,j;
	uint8x16_t vectA, vectB, vectC;
	for(i=0;i<=ys/2-1;i++)
	{
		for(j=0;j<=xs/2-1;j+=16)
		{
			vectA = vshrq_n_u8(vld1q_u8(&src[(i*2)*xs+(j*2)]), 1);
			vectB = vshrq_n_u8(vld1q_u8(&src[(i*2)*xs+(j*2+1)]), 1);
			vectC = vhaddq_u8(vectA, vectB);
			vst1q_u8(&dst[i*(xs/2)+j], vectC);
		}
	}
}

static void foo( unsigned char* dst, const unsigned char* src, int xs, int ys)
{
	int i,j;

	for(i=ys/2-1;i>=0;i--)
	{
		for(j=xs/2-1;j>=0;j--)
			dst[i*(xs/2)+j] =(unsigned char)((src[(i*2)*xs+(j*2)]+ src[(i*2)*xs+(j*2+1)])/2);
	}
}

This code takes the input image buffer (RAW format), do some simple processings and output to another image buffer. The xs is width and ys is height of the image.

Is there something wrong with the conversion to NEON or not? From what I see, there is nothing wrong. The only thing I can doubt about is that there may be some saturation when I add and then halve the result in NEON. However, I really need your help to clarify it for me.

Thank you.

Top replies

thanhvu94 over 7 years ago in reply to thanhvu94 +2 verified

Ah problem solved. It turns out that the bug is in src[ ... + (2*j)] . In NEON function, I load 8 adjacent elements instead of "take one ignore one".

Parents

0 42Bastian Schick over 7 years ago

Why do you divide twice by two? In foo() you divide the result of the addition. In foo_neon() you also divide the original values by 2.
Cancel
Up 0 Down

Cancel

Reply

0 42Bastian Schick over 7 years ago

Why do you divide twice by two? In foo() you divide the result of the addition. In foo_neon() you also divide the original values by 2.
Cancel
Up 0 Down

Cancel

Children

0 thanhvu94 over 7 years ago in reply to 42Bastian Schick

vhaddq_u8() function already halves the result of addition before storing it back already.
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 7 years ago in reply to thanhvu94

Yes, I know. But why do the shift right of the input data? In the simple algo you do c = (a+b)/2. In the NEON version you do c = (a/2 + b/2)/2.
Cancel
Up 0 Down

Cancel
0 thanhvu94 over 7 years ago in reply to 42Bastian Schick

Ah sorry, it seems I haven't deleted the shift part before posting the code here. But the wrong result happens even if I just load and halved add the data.
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 7 years ago in reply to thanhvu94

The non-NEON version uses "int" as the compiler propagates the expression. But the NEON version uses real 8 bits.
Try c = ((a + b) &0xff)/2.
Cancel
Up 0 Down

Cancel
0 thanhvu94 over 7 years ago in reply to 42Bastian Schick

The point is that the non-NEON version outputs correct result and the NEON version doesn't. So it seems that the "int" is the problem, is there any solution to let the NEON version behave just like non-NEON? Maybe change the data type?
Cancel
Up 0 Down

Cancel
0 thanhvu94 over 7 years ago in reply to thanhvu94

I load the data to NEON as uint8_t, then I move those data from uint8_t to uint16_t register (to avoid overflow). Then I perform vhaddq(). Finally, I move back from uint16_t to uint8_t register and vst1_u8(). However, the result is still wrong. It doesn't seem to be a problem of overflow or wrong alignment.
Cancel
Up 0 Down

Cancel
+1 thanhvu94 over 7 years ago in reply to thanhvu94

Ah problem solved. It turns out that the bug is in src[ ... + (2*j)] . In NEON function, I load 8 adjacent elements instead of "take one ignore one".
Cancel
Up +2 Down

Cancel