SIMD help for exemple

hi,

i decided to have a look at SIMD intrinsics instructions but there is a lt of documentation but i cannot find exemple.

So i decide once again to ask question about how to use SIMD with exemple.

i need only 2 exemple. Than i think a should be able to mixte practique et knowledge.

the first axemple is how to do when (*in1) are INT array . the traitment is inside this append in loop (*in1)[x] - (*in1)[y], the intrincis should be VSUB if i read correctky and VABS. But i need the syntaxe code.

           ONE:

            int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
            int diff2 = std::abs((*in1)[x].min                  - (*in1)[y].min);
            int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max);
            int diff4 = std::abs((*in1)[x].max                  - (*in1)[y].max);
            int diff5 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_max);
            int diff6 = std::abs((*in1)[x].min                  - (*in1)[y].max);
            int diff7 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_min);
            int diff8 = std::abs((*in1)[x].max                  - (*in1)[y].min);

and

           TWO :

            int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
            int diff2 = std::abs((*in1)[x].min                  - (*in1)[y].min);
            int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max);
            int diff4 = std::abs((*in1)[x].max                  - (*in1)[y].max);

and

           FOUR:

            int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
            int diff2 = std::abs((*in1)[x].min                  - (*in1)[y].min);

and how to do

           if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) || (diff5 < 5 && diff6 < 5 && diff7 < 5 && diff8 < 5) ){

and

          if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) ){

and

         if ( (diff1 < 9 || diff2 < 9)  &&  (diff3 < 9 || diff4 < 9) ){

i think that would be enough. Than i should be able to find my way. Or i will come back to you. ;))

Thanks a lot in advence.

PS: i work with médiatek 9200+ and Mali-G715-Immortalis MC11 r1p2

Parents
  • But i did not had a look at the assembler produced by clang android-ndk-r27c. I have not done it for the last 35 years. May be i should. ;))

    I would always check this - you might find Clang has done a half-decent job vectorizing your code anyway.

    Testing a build of the C implementation with auto-vectorization on macOS:

        clang++ test.cpp -o test -O3

    ... against a build with no auto-vectorization:

         clang++ test.cpp -o test -O3 -fno-vectorize -fno-slp-vectorize 

    ... so the autovec C code build is a lot faster (129ms vs 48 ms for 10K array).

    My attempt at a manually optimized NEON version (compiles and runs on macOS, but I've not tested output). It's faster, but only just (43ms vs 48ms). I was trying to force the compiler to use a conditional select that always stores at the end, instead of a conditional branch, but objdump says I failed ;)

    __attribute__((noinline)) void neon_code(rect* in1, int rect_count, int num)
    {
        __builtin_assume(rect_count > 0);
    
        int32x4_t const5 = vdupq_n_s32(5);
        int32x4_t const9 = vdupq_n_s32(9);
    
        for (int x = 0 ; x < rect_count; x++)
        {
            int* x_base = &(in1[x].raw_col_min);
            int32x4_t xv = vld1q_s32(x_base);
    
            for (int y = x + 1 ; y < rect_count; y++)
            {
                int* y_base = &(in1[y].raw_col_min);
                int32x4_t yv = vld1q_s32(y_base);
    
                // Compute diff of min-min and max-max
                int32x4_t diff1_4 = vabsq_s32(vsubq_s32(xv, yv));
    
                // Compute diff of min-max and max-min
                int32x4_t yv_swap = vextq_u32(yv, yv, 2);
                int32x4_t diff5_8 = vabsq_s32(vsubq_s32(xv, yv_swap));
    
                // Generate diff conditions
                uint32x4_t mask1_4 = vcltq_s32(diff1_4, const9);
                uint32x4_t mask5_8 = vcltq_s32(diff5_8, const5);
                bool all_mask1_4 = vminvq_u32(mask1_4) != 0;
                bool all_mask5_8 = vminvq_u32(mask5_8) != 0;
                bool any_diff = all_mask1_4 || all_mask5_8;
    
                // Use conditional selects rather than branches
                bool opt1 = any_diff && (in1[x].rupture == num || in1[x].Y_depart == 0);
                in1[y].rupture = opt1 ? num : in1[y].rupture;
    
                bool opt2 = any_diff && !opt1;
                in1[x].rupture = opt2 ? num : in1[x].rupture;
            }
        }
    }

    This code is assuming that your rect struct looks like:

    struct rect {
        int raw_col_min;
        int min;
        int raw_col_max;
        int max;
        int Y_depart;
        int rupture;
    }

    So loading 128 bits loads the min/min/max/max values in that order. Other orders would be possible to support, but you might need to replace the vextq with something else.

Reply
  • But i did not had a look at the assembler produced by clang android-ndk-r27c. I have not done it for the last 35 years. May be i should. ;))

    I would always check this - you might find Clang has done a half-decent job vectorizing your code anyway.

    Testing a build of the C implementation with auto-vectorization on macOS:

        clang++ test.cpp -o test -O3

    ... against a build with no auto-vectorization:

         clang++ test.cpp -o test -O3 -fno-vectorize -fno-slp-vectorize 

    ... so the autovec C code build is a lot faster (129ms vs 48 ms for 10K array).

    My attempt at a manually optimized NEON version (compiles and runs on macOS, but I've not tested output). It's faster, but only just (43ms vs 48ms). I was trying to force the compiler to use a conditional select that always stores at the end, instead of a conditional branch, but objdump says I failed ;)

    __attribute__((noinline)) void neon_code(rect* in1, int rect_count, int num)
    {
        __builtin_assume(rect_count > 0);
    
        int32x4_t const5 = vdupq_n_s32(5);
        int32x4_t const9 = vdupq_n_s32(9);
    
        for (int x = 0 ; x < rect_count; x++)
        {
            int* x_base = &(in1[x].raw_col_min);
            int32x4_t xv = vld1q_s32(x_base);
    
            for (int y = x + 1 ; y < rect_count; y++)
            {
                int* y_base = &(in1[y].raw_col_min);
                int32x4_t yv = vld1q_s32(y_base);
    
                // Compute diff of min-min and max-max
                int32x4_t diff1_4 = vabsq_s32(vsubq_s32(xv, yv));
    
                // Compute diff of min-max and max-min
                int32x4_t yv_swap = vextq_u32(yv, yv, 2);
                int32x4_t diff5_8 = vabsq_s32(vsubq_s32(xv, yv_swap));
    
                // Generate diff conditions
                uint32x4_t mask1_4 = vcltq_s32(diff1_4, const9);
                uint32x4_t mask5_8 = vcltq_s32(diff5_8, const5);
                bool all_mask1_4 = vminvq_u32(mask1_4) != 0;
                bool all_mask5_8 = vminvq_u32(mask5_8) != 0;
                bool any_diff = all_mask1_4 || all_mask5_8;
    
                // Use conditional selects rather than branches
                bool opt1 = any_diff && (in1[x].rupture == num || in1[x].Y_depart == 0);
                in1[y].rupture = opt1 ? num : in1[y].rupture;
    
                bool opt2 = any_diff && !opt1;
                in1[x].rupture = opt2 ? num : in1[x].rupture;
            }
        }
    }

    This code is assuming that your rect struct looks like:

    struct rect {
        int raw_col_min;
        int min;
        int raw_col_max;
        int max;
        int Y_depart;
        int rupture;
    }

    So loading 128 bits loads the min/min/max/max values in that order. Other orders would be possible to support, but you might need to replace the vextq with something else.

Children
  • The main challenge with this code in terms of vectorizability is that you have a dependent chain where one loop iteration can change the value of rupture used by the next loop iteration. This makes vectorizing by packing multiple iterations (structure of arrays style) difficult, and that means you end up trying to vectorize single iterations. This causes quite a lot of scalar code in the loop to handle the addressing and the update of rupture, which erodes the benefit of SIMD quite quickly. 

    In my vectorized version, the core inner loop is implemented in ~10 neon instructions but the overall inner loop is about 30 instructions because of the loop overhead and scalar update of rupture, so the scalar bits are eroding a lot of possible gains. There is also not that much computation vs the amount of memory accesses being made, so it's quite possible you are ending up load limited.