SIMD help for exemple

hterrolle 2 days ago

hi,

i decided to have a look at SIMD intrinsics instructions but there is a lt of documentation but i cannot find exemple.

So i decide once again to ask question about how to use SIMD with exemple.

i need only 2 exemple. Than i think a should be able to mixte practique et knowledge.

the first axemple is how to do when (*in1) are INT array . the traitment is inside this append in loop (*in1)[x] - (*in1)[y], the intrincis should be VSUB if i read correctky and VABS. But i need the syntaxe code.

ONE:

            int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
            int diff2 = std::abs((*in1)[x].min        - (*in1)[y].min);
            int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max);
            int diff4 = std::abs((*in1)[x].max        - (*in1)[y].max);
            int diff5 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_max);
            int diff6 = std::abs((*in1)[x].min                  - (*in1)[y].max);
            int diff7 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_min);
            int diff8 = std::abs((*in1)[x].max        - (*in1)[y].min);

and

TWO :

and

FOUR:

int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
int diff2 = std::abs((*in1)[x].min - (*in1)[y].min);

and how to do

if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) || (diff5 < 5 && diff6 < 5 && diff7 < 5 && diff8 < 5) ){

and

if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) ){

and

if ( (diff1 < 9 || diff2 < 9) && (diff3 < 9 || diff4 < 9) ){

i think that would be enough. Than i should be able to find my way. Or i will come back to you. ;))

Thanks a lot in advence.

PS: i work with médiatek 9200+ and Mali-G715-Immortalis MC11 r1p2

Parents

0 Peter Harris

1 hour ago in reply to hterrolle

hterrolle said:
But i did not had a look at the assembler produced by clang android-ndk-r27c. I have not done it for the last 35 years. May be i should. ;))

I would always check this - you might find Clang has done a half-decent job vectorizing your code anyway.

Testing a build of the C implementation with auto-vectorization on macOS:

clang++ test.cpp -o test -O3

... against a build with no auto-vectorization:

clang++ test.cpp -o test -O3 -fno-vectorize -fno-slp-vectorize

... so the autovec C code build is a lot faster (129ms vs 48 ms for 10K array).

My attempt at a manually optimized NEON version (compiles and runs on macOS, but I've not tested output). It's faster, but only just (43ms vs 48ms). I was trying to force the compiler to use a conditional select that always stores at the end, instead of a conditional branch, but objdump says I failed ;)

__attribute__((noinline)) void neon_code(rect* in1, int rect_count, int num)
{
    __builtin_assume(rect_count > 0);

    int32x4_t const5 = vdupq_n_s32(5);
    int32x4_t const9 = vdupq_n_s32(9);

    for (int x = 0 ; x < rect_count; x++)
    {
        int* x_base = &(in1[x].raw_col_min);
        int32x4_t xv = vld1q_s32(x_base);

        for (int y = x + 1 ; y < rect_count; y++)
        {
            int* y_base = &(in1[y].raw_col_min);
            int32x4_t yv = vld1q_s32(y_base);

            // Compute diff of min-min and max-max
            int32x4_t diff1_4 = vabsq_s32(vsubq_s32(xv, yv));

            // Compute diff of min-max and max-min
            int32x4_t yv_swap = vextq_u32(yv, yv, 2);
            int32x4_t diff5_8 = vabsq_s32(vsubq_s32(xv, yv_swap));

            // Generate diff conditions
            uint32x4_t mask1_4 = vcltq_s32(diff1_4, const9);
            uint32x4_t mask5_8 = vcltq_s32(diff5_8, const5);
            bool all_mask1_4 = vminvq_u32(mask1_4) != 0;
            bool all_mask5_8 = vminvq_u32(mask5_8) != 0;
            bool any_diff = all_mask1_4 || all_mask5_8;

            // Use conditional selects rather than branches
            bool opt1 = any_diff && (in1[x].rupture == num || in1[x].Y_depart == 0);
            in1[y].rupture = opt1 ? num : in1[y].rupture;

            bool opt2 = any_diff && !opt1;
            in1[x].rupture = opt2 ? num : in1[x].rupture;
        }
    }
}

This code is assuming that your rect struct looks like:

struct rect {
    int raw_col_min;
    int min;
    int raw_col_max;
    int max;
    int Y_depart;
    int rupture;
}

So loading 128 bits loads the min/min/max/max values in that order. Other orders would be possible to support, but you might need to replace the vextq with something else.

Reply