This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON intrinsics mixed Native C

Hello

I tested my source code used NEON intrinsics.

I wanted to compare the performance between using NEON and using Native C.

the code doesn't have meaningful and simple.

get an image from storage and each pixels shift to right by 2.

and get an output image using 256 x 256 LUT.

========================

for loop

uint8x8x3_t rgb = vld3_u8(img_data[i * width + j]);

// shift to right by 2

rgb.val[0] = vshr_n_u8(rgb.val[0], 2);

rgb.val[1] = vshr_n_u8(rgb.val[1], 2);

rgb.val[2] = vshr_n_u8(rgb.val[2], 2);

// get the data from LUT

int index = 0;

int p = i * width + j;

uint8_t r, g;

// lane 0

r = vget_lane_u8(rgb.val[0], 0);

g = vget_lane_u8(rgb.val[1], 0);

b = vget_lane_u8(rgb.val[2], 0);

img_data[p + 0] = LUT[r][g];

img_data[p + 1] = LUT[g][b];

img_data[p + 2] = LUT[b][r];

// lane 1

r = vget_lane_u8(rgb.val[0], 1);

g = vget_lane_u8(rgb.val[1], 1);

b = vget_lane_u8(rgb.val[2], 1);

img_data[p + 3] = LUT[r][g];

img_data[p + 4] = LUT[g][b];

img_data[p + 5] = LUT[b][r];

// lane 2

...

// lane 8

========================

and I compared the time between using Native C and using NEON intrinsics.

Native C is faster than NEON about 15%

I want to know why it was.

Parents Reply Children
  • Hello,

    if you use a C compiler, -S compiling option will generate a assembly source code.

    If you are working under some IDE, the assembly code lists will be generated on a certain holder.

    At first, please look the number of lines of the assembly source codes.

    If you can neither, please dis-assembly the object code generated by the compiler by dis-assembler (such as the objdump of the GNU tool).

    Best regards,

    Yasuhiko Koumoto.

  • Thank you so much!!

    I am going to try this!

    the most importance of performance in Neon is the number of line of Neon and assembly codes.

    I understand it about your advice

    Is it alright?

    Anyway Thank you for your advice!!

  • Hello,

    you are right.

    Roughly speaking, you can assume the execution time of any instruction as one cycle.

    Of course there are my exception cases.

    For example, you should pay attention to register dependencies.

    It means the case that the latter instructions use the results of the former instructions.

    Also, the load or store instruction will take more than one cycle for the execution.

    If you could let us show the assemble codes, we can analyse them.

    Best regards,

    Yasuhiko Koumoto.

  • Thank you

    I understand to estimate the performance.

    If possible, I want to ask one more thing!

    It is a part of my code.

    ========================

    for loop

    uint8x8x3_t rgb = vld3_u8(img_data[i * width + j]);

    // shift to right by 2

    rgb.val[0] = vshr_n_u8(rgb.val[0], 2);

    rgb.val[1] = vshr_n_u8(rgb.val[1], 2);

    rgb.val[2] = vshr_n_u8(rgb.val[2], 2);

    // get the data from LUT

    int index = 0;

    int p = i * width + j;

    uint8_t r, g;

    // lane 0

    r = vget_lane_u8(rgb.val[0], 0);

    g = vget_lane_u8(rgb.val[1], 0);

    b = vget_lane_u8(rgb.val[2], 0);

    img_data[p + 0] = LUT[r][g];

    img_data[p + 1] = LUT[g][b];

    img_data[p + 2] = LUT[b][r];

    // lane 1

    r = vget_lane_u8(rgb.val[0], 1);

    g = vget_lane_u8(rgb.val[1], 1);

    b = vget_lane_u8(rgb.val[2], 1);

    img_data[p + 3] = LUT[r][g];

    img_data[p + 4] = LUT[g][b];

    img_data[p + 5] = LUT[b][r];

    // lane 2

    ...

    // lane 8

    ========================

    Is the part that get datas using LUT possible to make efficient?

    I think that the size of LUT is 256 x 256 bytes. so can't use the register.

  • Hello,


    I think the optimization should be left to the compiler.
    The table look ahead would be compiled as the following.

    #  r0  r
    #  r1  g
    #  r2  b
    #  r5  p
    #
    # for lane 0
      movw r3,#:lower16:LUT
      movt r3,#:upper16:LUT
      add r4,r1,r0, lsl #8  @ index [r][g]
      add r1,r2,r1, lsl #8  @ index [g][b]
      add r0,r0,r2, lsl #8  @ index [b][r]
      ldrb r4,[r3,r4]      @ LUT[r][g]
      ldrb r1,[r3,r1]      @ LUT[g][b]
      ldrb r0,[r3,r2]      @ LUT[b][r]
      movw r2,#:lower16:img_data
      movt r2,#:upper16:img_data
      add  r2,r2,r5    @ addr img_data[p]
      strb r4,[r2,#0]      @ img_data[p+0]
      strb r1,[r2,#1]      @ img_data[p+1]
      strb r0,[r2,#2]      @ img_data[p+2]
    

    That is, the reference would be done via pointers.
    Those codes are generated by my hand and they are not faster.
    To improve them, the order of instructions should be re-arranged to resolve register dependencies.


    Best regards,
    Yasuhiko Koumoto.

  • It is really help for me.

    I'm going to check the assembly code and try out the advice given from you!

    and thank you for understanding my bad english and answer to me kindly

    Thanks, Yasuhiko Koumoto!