This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON intrinsics mixed Native C

Hello

I tested my source code used NEON intrinsics.

I wanted to compare the performance between using NEON and using Native C.

the code doesn't have meaningful and simple.

get an image from storage and each pixels shift to right by 2.

and get an output image using 256 x 256 LUT.

========================

for loop

uint8x8x3_t rgb = vld3_u8(img_data[i * width + j]);

// shift to right by 2

rgb.val[0] = vshr_n_u8(rgb.val[0], 2);

rgb.val[1] = vshr_n_u8(rgb.val[1], 2);

rgb.val[2] = vshr_n_u8(rgb.val[2], 2);

// get the data from LUT

int index = 0;

int p = i * width + j;

uint8_t r, g;

// lane 0

r = vget_lane_u8(rgb.val[0], 0);

g = vget_lane_u8(rgb.val[1], 0);

b = vget_lane_u8(rgb.val[2], 0);

img_data[p + 0] = LUT[r][g];

img_data[p + 1] = LUT[g][b];

img_data[p + 2] = LUT[b][r];

// lane 1

r = vget_lane_u8(rgb.val[0], 1);

g = vget_lane_u8(rgb.val[1], 1);

b = vget_lane_u8(rgb.val[2], 1);

img_data[p + 3] = LUT[r][g];

img_data[p + 4] = LUT[g][b];

img_data[p + 5] = LUT[b][r];

// lane 2

...

// lane 8

========================

and I compared the time between using Native C and using NEON intrinsics.

Native C is faster than NEON about 15%

I want to know why it was.