SIMD Comparaison result and use

hi,

when i do the diff

    // Compute diff
    int32x4_t diff1_4 = vabsq_s32(vsubq_s32(A, B));

I got 4 result. One for each test (A1,B1)(A2,B2)(A3,B3) and (A4,B4)

And than i have to do the comparaison

    uint32x4_t mask1_4 = vcltq_s32(diff1_4, X);

So in "uint32x4_t mask1_4" i got the comparaison for each test, so 4 résult.

the answer on post "SIMD help for exemple" was to use

    if (vmaxvq_u32(mask1_4) > 0) { ... }

I thinks my sentence was confuse in the previous post. I wrote  

   " if (mask1_4[0] > 0 && mask1_4[1] > 0)  and if (mask1_4[2] > 0 && mask1_4[3] > 0) "

but it is not    if(  mask1_4[0] > 0  &&  mask1_4[1] > 0  &&   mask1_4[2] > 0  &&  mask1_4[3] > 0) )

i need to do 2 test

   if (mask1_4[0] > 0 && mask1_4[1] > 0){

        process data1

   }

  if (mask1_4[2] > 0 && mask1_4[3] > 0){

      process data2

 }

I think that vmaxvq_u32(mask1_4) will check all the comparaison. like

    if(  mask1_4[0] > 0 && mask1_4[1] > 0  &&   mask1_4[2] > 0 && mask1_4[3] > 0  )

PS: i think i should have post it in the old post

  • i think i may be found what i need but i am not sure.

    here is how i see the code:

        // Compute diff
        int32x4_t diff1_4 = vabsq_s32(vsubq_s32(A, B));

        // Comparee diff
        uint32x4_t mask1_4 = vcltq_s32(diff1_4, X);

        // extract [0] and [1] from mask1_4
        uint32x2_t mask1_4_01 = vget_high_u32(mask1_4); // not sure high is [0][1]

        // extract [2] and [3] from mask1_4
        uint32x2_t mask1_4_23 = vget_low_u32(mask1_4);

       if (vmaxvq_u32(mask1_4_01 ) > 0){ process data1 } // it is like i check   if (mask1_4[0] > 0 && mask1_4[1] > 0)

       if (vmaxvq_u32(mask1_4_23 ) > 0){ process data2 } // it is like i check   if (mask1_4[2] > 0 && mask1_4[3] > 0)

    is this correct or did i miss something. Or is there a better way to do it ?

  • i spend 3 hours debuging NEON code until now.

    but  "vmaxvq_u32(mask1_4_01 ) " does not compile "error: no matching function for call to 'vmaxvq_u32"

    it look like vmaxvq_u32 is for 128bits,"uint32x4_t", long and not 64bits,"uint32x2_t". I tried to find the correct function but without result.

    i tried to use vmin_u32 and vmax_u32 but it look like it is a comparaison of 2 uint32x2_t. Not usefull in my context.

    I need an vmaxvq_u32 or vminvq_u32 how work with uint32x2_t.

    But i can use "if ( mask1_4[0]  > 0 && mask1_4[1]  > 0) because "uint32x4_t mask1_4 = vcltq_s32(diff1_4, constseuil);" return -1 when the comparaison is OK and 0 when it is not.

    Something wrong with vmaxvq_u32( uint32x2_t) .

    PS: concerning the extraction. high is not  [0][1] but [2][3]

  • why if ( mask1_4[0]  > 0 && mask1_4[1]  > 0) work with a -1 value. It does look at the sign ?

    same thing if i use vcltq_u32.

    by the way if (  (mask1_4[0]+mask1_4[1]) > 1 ) does not  work in all the case.

    in fact i tested namy solution.

    lets said a have

    int32x4_t diff1_4  = vabsq_s32(vsubq_s32(xv, yv));

    uint32x4_t mask1_4 = vcltq_s32(diff1_4, constseuil);

    mask1_4    -1  -1  -1  -1    if i add  (mask1_4[0]+mask1_4[1]) i got -2

    but if i check

    if ( (mask1_4[0]+mask1_4[1])  < 0)     it is false

    if ( (mask1_4[0]+mask1_4[1])  == -2) it istrue

    if ( (mask1_4[0]+mask1_4[1])  <  -1) it istrue

    if ( (mask1_4[0]+mask1_4[1])  > 0)   it istrue

    there is a problem !   < 0 should be true and > 0 false.

    here is the complete code :

    __attribute__((noinline)) void neon_multi(){

        int A1[4] = {300,400,600,400};
        int B1[4] = {300,400,600,400};//= {,600,400,300,400};
        
        int32x4_t constseuil = vdupq_n_s32(4);


        int* x_base = (A1);
        int32x4_t xv = vld1q_s32(x_base);

        int* y_base = (B1);
        int32x4_t yv = vld1q_s32(y_base);

        // Compute diff
        int32x4_t diff1_4 = vabsq_s32(vsubq_s32(xv, yv));

        // Compute diff
        int32x4_t yv_swap = vextq_u32(yv, yv, 2);
        int32x4_t diff5_8 = vabsq_s32(vsubq_s32(xv, yv_swap));

        // Generate diff conditions
        uint32x4_t mask1_4 = vcltq_s32(diff1_4, constseuil);
        LOGE(" neon_multi mask1_4 %3d %3d %3d %3d \n",mask1_4[0],mask1_4[1],mask1_4[2],mask1_4[3]);
        LOGE(" neon_multi test %3d %3d \n",(mask1_4[0]+mask1_4[1]),(mask1_4[2]+mask1_4[3]));
        uint32x4_t mask5_8 = vcltq_s32(diff5_8, constseuil);
        LOGE(" neon_multi mask5_8 %3d %3d %3d %3d \n",mask5_8[0],mask5_8[1],mask5_8[2],mask5_8[3]);
        LOGE(" neon_multi test %3d %3d \n",(mask5_8[0]+mask5_8[1]),(mask5_8[2]+mask5_8[3]));
            
        if ( (mask1_4[0]+mask1_4[1])    > 0){LOGE(" neon_multi 1  %3d\n",(mask1_4[0]+mask1_4[1]));}

        if ( (mask1_4[0]+mask1_4[1])    < 0){LOGE(" neon_multi 2  %3d\n",(mask1_4[0]+mask1_4[1]));}

        if ( (mask1_4[0]+mask1_4[1]) == -2){LOGE(" neon_multi 3  %3d\n",(mask1_4[0]+mask1_4[1]));}

        if ( (mask1_4[0]+mask1_4[1])   < -1){LOGE(" neon_multi 4  %3d\n",(mask1_4[0]+mask1_4[1]));}

    }

    let's see tomorow ;))

  • it look like vmaxvq_u32 is for 128bits,"uint32x4_t", long and not 64bits,"uint32x2_t". I tried to find the correct function but without result.

    I need an vmaxvq_u32 or vminvq_u32 how work with uint32x2_t.

    NEON intrinsics are relatively regular in terms of naming convention, the "q" versions are 128-bit quad-word versions, the non-q versions are 64-bit double-word versions.

    So vmaxvq_u32() is the 128-bit version, and vmaxv_u32() is the 64-bit version.

    why if ( mask1_4[0]  > 0 && mask1_4[1]  > 0) work with a -1 value. It does look at the sign ?

    mask1_4 is an unsigned uint32 type, so 4294967295 not -1. NEON compares just return a vector with all bits set if true, or all bits clear for false, which is useful for use in logical bitwise mask operations.

  • i check it pomorow.

    and thanks again. I spend 6 hours on neon today. I am full. ;))

  • hi,

    vmaxv_u32() work fine. And i anderstoud the problem for >0 and < 0. It look like it is a problem of printf using signed and unsigned integer type. that is way i always see -1 for signed and unsigned.

    but i stil do not anderstand why if i use unsigned int.

    uint32x4_t mask1_4 = vcltq_u32(diff1_4, constseuil);

    the check is true for > 0 and at the same time == -2 and < -1. it is quite confusing.

    how unsigned can respond to == -2 and < -1.

    If i were using signed i would anderstand because signed check true for < 0, == -2 and < -1. wich is logic.

    do i miss one include in my program to avoid unsigned int to answer true to == -2 and < -1. for the == -2 i can, let said andertand. but for < -1 i do not. it is not logic that positive number can be inferieur at -1. It is the first time i see something like that. ;))

  • t look like it is a problem of printf using signed and unsigned integer type.

    You need to use "%d" for signed and "%u" for unsigned.

    uint32x4_t mask1_4 = vcltq_u32(diff1_4, constseuil);

    In your original code your diff1_4 is signed and you were using "vcltq_s32" which is the signed compare, so the compare should be correct for signed values.

    The result of all of the NEON compare functions is not really meaningful as a number - it's just a bitmask with all bits in a lane set to 1 if the compare, like vcltq_s32, passed and zero if it failed. Given it's not expected to be interpreted as a number, the result is always uint32_t no matter what the original compare data type was, and the only comparison that matters is whether that mask is zero or non-zero.

  • i just read your response when i was writting my. ;))

    sorry for the prévious post. I anderstoud the problem. Using %u rather than %d give me the response.

    it look like -1 is converted to unsigned int during the < -1

    so -1 = 4294967295 and -2 = 4294967294 so  4294967294 is  <  to 4294967295.

    but i is really confusing.

    Thanks very much for your answer. I was confuse at the begining of the day. Now i can go back to my program calmly, ;))

    have a good day.

  • sorry again, i just forgot to ask. what is the best in term of performance.

    extract and compare

    uint32x2_t mask1_4_01 = vget_low_u32(mask1_4); then  if ( vminv_u32(mask1_4_01) > 0)

    or

    if (mask1_4[0] > 0 && mask1_4[1] > 0 ){

    I think it is good for me then.

    and thanks again.

  • I suspect they are going to end up exactly the same