Hello
I tested my source code used NEON intrinsics.
I wanted to compare the performance between using NEON and using Native C.
the code doesn't have meaningful and simple.
get an image from storage and each pixels shift to right by 2.
and get an output image using 256 x 256 LUT.
========================
for loop
uint8x8x3_t rgb = vld3_u8(img_data[i * width + j]);
// shift to right by 2
rgb.val[0] = vshr_n_u8(rgb.val[0], 2);
rgb.val[1] = vshr_n_u8(rgb.val[1], 2);
rgb.val[2] = vshr_n_u8(rgb.val[2], 2);
// get the data from LUT
int index = 0;
int p = i * width + j;
uint8_t r, g;
// lane 0
r = vget_lane_u8(rgb.val[0], 0);
g = vget_lane_u8(rgb.val[1], 0);
b = vget_lane_u8(rgb.val[2], 0);
img_data[p + 0] = LUT[r][g];
img_data[p + 1] = LUT[g][b];
img_data[p + 2] = LUT[b][r];
// lane 1
r = vget_lane_u8(rgb.val[0], 1);
g = vget_lane_u8(rgb.val[1], 1);
b = vget_lane_u8(rgb.val[2], 1);
img_data[p + 3] = LUT[r][g];
img_data[p + 4] = LUT[g][b];
img_data[p + 5] = LUT[b][r];
// lane 2
...
// lane 8
and I compared the time between using Native C and using NEON intrinsics.
Native C is faster than NEON about 15%
I want to know why it was.
It is really help for me.
I'm going to check the assembly code and try out the advice given from you!
and thank you for understanding my bad english and answer to me kindly
Thanks, Yasuhiko Koumoto!