This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SIMD help for exemple

hterrolle 8 months ago

hi,

i decided to have a look at SIMD intrinsics instructions but there is a lt of documentation but i cannot find exemple.

So i decide once again to ask question about how to use SIMD with exemple.

i need only 2 exemple. Than i think a should be able to mixte practique et knowledge.

the first axemple is how to do when (*in1) are INT array . the traitment is inside this append in loop (*in1)[x] - (*in1)[y], the intrincis should be VSUB if i read correctky and VABS. But i need the syntaxe code.

ONE:

            int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
            int diff2 = std::abs((*in1)[x].min        - (*in1)[y].min);
            int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max);
            int diff4 = std::abs((*in1)[x].max        - (*in1)[y].max);
            int diff5 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_max);
            int diff6 = std::abs((*in1)[x].min                  - (*in1)[y].max);
            int diff7 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_min);
            int diff8 = std::abs((*in1)[x].max        - (*in1)[y].min);

and

TWO :

and

FOUR:

int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
int diff2 = std::abs((*in1)[x].min - (*in1)[y].min);

and how to do

if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) || (diff5 < 5 && diff6 < 5 && diff7 < 5 && diff8 < 5) ){

and

if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) ){

and

if ( (diff1 < 9 || diff2 < 9) && (diff3 < 9 || diff4 < 9) ){

i think that would be enough. Than i should be able to find my way. Or i will come back to you. ;))

Thanks a lot in advence.

PS: i work with médiatek 9200+ and Mali-G715-Immortalis MC11 r1p2

Top replies

Parents

0 Peter Harris 8 months ago in reply to Peter Harris

The main challenge with this code in terms of vectorizability is that you have a dependent chain where one loop iteration can change the value of rupture used by the next loop iteration. This makes vectorizing by packing multiple iterations (structure of arrays style) difficult, and that means you end up trying to vectorize single iterations. This causes quite a lot of scalar code in the loop to handle the addressing and the update of rupture, which erodes the benefit of SIMD quite quickly.

In my vectorized version, the core inner loop is implemented in ~10 neon instructions but the overall inner loop is about 30 instructions because of the loop overhead and scalar update of rupture, so the scalar bits are eroding a lot of possible gains. There is also not that much computation vs the amount of memory accesses being made, so it's quite possible you are ending up load limited.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Peter Harris 8 months ago in reply to Peter Harris

The main challenge with this code in terms of vectorizability is that you have a dependent chain where one loop iteration can change the value of rupture used by the next loop iteration. This makes vectorizing by packing multiple iterations (structure of arrays style) difficult, and that means you end up trying to vectorize single iterations. This causes quite a lot of scalar code in the loop to handle the addressing and the update of rupture, which erodes the benefit of SIMD quite quickly.

In my vectorized version, the core inner loop is implemented in ~10 neon instructions but the overall inner loop is about 30 instructions because of the loop overhead and scalar update of rupture, so the scalar bits are eroding a lot of possible gains. There is also not that much computation vs the amount of memory accesses being made, so it's quite possible you are ending up load limited.
Cancel
Vote up 0 Vote down

Cancel

Children

0 hterrolle 8 months ago in reply to Peter Harris

Thanks for the answer but i am not sure to anderstoud everyting, excep that the update of rupture is the problem.

So i will remove the check on rupture and just update (*in1)[x].rupture and (*in1)[y].rupture without any check. It will change nothing for me it was just used to check if work was well done.

I just check without the IF and it is quite faster. thanks you.

So, SIMD is very good if we do not use conditional check ? So, for massive calculation and update résult only.

So, it loock like i do not need to rewrite everything using NEON instructions clang should do the work ?
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris 8 months ago in reply to hterrolle

hterrolle said:
So, SIMD is very good if we do not use conditional check ?

There are two separate issues.

The first issue is just about what percentage of the code can be vectorized. If you have 8 scalar instructions that turn into 2 vector instructions then you go 4x faster. If you have 8 scalar instructions that turn into 1 vector instruction and 4 scalar instructions then you only go 1.6x faster. It doesn't take much serial code to reduce the amount of uplift you get, so good SIMD-friendly algorithms try and stay in SIMD and avoid scalar code as much as possible.

The second issue around use of conditional selects is about branches and branch prediction. Modern CPUs run deep pipelines with a lot of instructions in flight, and rely on branch prediction. For a lot of data-driven algorithms branches that branch off data can be hard to predict, because there isn't a regular pattern in the data. For these unpredictable branches, using conditional selects may often actually need to execute more instructions but avoids the misprediction overhead so runs faster in practice.

hterrolle said:
everything using NEON instructions clang should do the work ?

It's a question of effort vs reward. If you really care about optimizing then hand written code can often beat the compiler (especially if you know that your problem allows you to make assumptions the compiler cannot do in a generic way). However, it's more expensive to develop and maintain, and you might be happy with auto-vectorized code because it's lower long-term cost and more portable across different architectures.

HTH,
Pete
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 8 months ago in reply to Peter Harris

Yes ! the best way to know is to tried.

So, here are the result betwwen the 2 functions. "doublonBis" my version in bleu and the peter code "neon_code" in red.

for about 100 frames.

I would be interrested by some comment about the graph because SIMD code look not being inpacted by the scalling frequencies ?
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 8 months ago in reply to hterrolle

here is the last test.

I run both function inside the pthread this time.

Both doublon and neon functions take nrealy the same time. In yelow is the pthread time running alone. The pthread is much more complicated and it is faster.

Is there a limit to phtread instructions ? I will try both doublon and neon functions inside pthread. Let see.

So doublon or neon inside other pthread run faster than 1 pthread + 1 pthread with doublon or neon.

My conclusion would be to use compilator. The first picure a send in the prévious post is when i run 4 time the doublon and 4 time the neon. Dans in this case neon is faster but both are slower than function inside other pthread or function inside pthread.

PS: do not take care of the absysse, only point i a frame.

It is good to know ;))
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris 8 months ago in reply to hterrolle

The only limits are likely to be down to frequency scaling and thermal management - if you run a lot of threads on a lot of cores it can get hot, so frequencies get throttled if it overheats to give the device a chance to cool down
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 7 months ago in reply to Peter Harris

i would resume it by, the number of instructions multipled by the number of data to process.

So, the problem of heating is the bootleneck for mobile and laptop but also for desktop.

I do not know if it is feasible. But space between chip layer to drive heat outside should be possible. this would make the chip a little thicker but cooling system make it also thicker. It is just an idea. May be not the best one. ;))
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 7 months ago in reply to hterrolle

hi,

I forgot to ask a question concerning the optimization using SIMD.

I use a lot of INT array in my project. And if i anderstoud SIMD should work with 128 bytes long on the médiatek 9200+

So, if i change The INT Array to USHORT or UCHAR in certain case would i get better performance ?
Cancel
Vote up 0 Vote down

Cancel
0 Ben Clark 7 months ago in reply to hterrolle

Yes, SIMD operations allow more to be done at once on smaller datatypes, so you should be able to improve performance with them in many cases (as long as they still have the needed accuracy)
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 7 months ago in reply to Ben Clark

hi,

So, in case i had to load 8 short rather than 4 int. But if i load only 4 short there is no interest ? If i anderstoud how it work ;))
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris 7 months ago in reply to hterrolle

Correct.

Loading shorts might be slightly faster because of reduced cache pressure if you are memory-bound, but computationally 4 int16 vs 4 int32 won't make any difference because you just leave half the vector width unused.
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 7 months ago in reply to Peter Harris

do you min that int 64 got the same size as int 32 ? oups.
Cancel
Vote up 0 Vote down

Cancel
0 Peter Harris 7 months ago in reply to hterrolle

Sorry typo - fixed.
Cancel
Vote up 0 Vote down

Cancel
0 hterrolle 7 months ago in reply to Peter Harris

hi,

I think i need once again your compétence in SIMD to add a test on my doublon function.

I need to add an IF.

for (int x = 0 ; x < (*indnbObj) ; x++){

    for (int y = (x+1) ; y < (*indnbObj) ; y++){

        if ((*in1)[x].A == (*in1)[y].A && (*in1)[x].B == (*in1)[y].B && (*in1)[x].C == (*in1)[y].C && (*in1)[x].D == (*in1)[y].D){

          // netoyage des doublons aux extrémités
            int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min);
            int diff2 = std::abs((*in1)[x].min - (*in1)[y].min);
            int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max);
            int diff4 = std::abs((*in1)[x].max - (*in1)[y].max);
            int diff5 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_max);
            int diff6 = std::abs((*in1)[x].min - (*in1)[y].max);
            int diff7 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_min);
          int diff8 = std::abs((*in1)[x].max - (*in1)[y].min);

i plan to rewrite part of your code like this

for (int x = 0 ; x < rect_count; x++)
{

    for (int y = (x + 1) ; y < rect_count; y++)
    {

        int* x_base2 = &(in1[x].A);
        int32x4_t xv2 = vld1q_s32(x_base2);
        int* y_base2 = &(in1[y].A);
        int32x4_t yv2 = vld1q_s32(y_base2);

       And here i should Use branches selects rather than conditional. But i do not know how to do.

       int* x_base = &(in1[x].raw_col_min);
       int32x4_t xv = vld1q_s32(x_base);
       int* y_base = &(in1[y].raw_col_min);
       int32x4_t yv = vld1q_s32(y_base);

if you could explain me how to do it would be nice. ;))

PS: if i do the x_base inside the second loop. does it change something. Or should i keep x_base and x_base2 inside the first loop.

thanks in advance ;))
Cancel
Vote up 0 Vote down

Cancel

0 hterrolle 7 months ago in reply to hterrolle

i think i should use

int* x_base2 = &(in1[x].A);
int32x4_t xv2 = vld1q_s32(x_base2);
int* y_base2 = &(in1[y].A);
int32x4_t yv2 = vld1q_s32(y_base2);

uint32x4_t mask2 = vceqq_s32(xv2 , yv2); // i do the compare

if (mask2){ // if compare ok
   continu the work;

      // Using SIMD, it is better to put these two line inside
      // the first loop. Data load are done only one time.
          //int* x_base = &(in1[x].raw_col_min); 
          //int32x4_t xv = vld1q_s32(x_base); 
   int* y_base = &(in1[y].raw_col_min);
   int32x4_t yv = vld1q_s32(y_base);

 .............
}

0 hterrolle 6 months ago in reply to hterrolle

hi,

I just implement the modification like this : (not waiting for answer)

    for (int x = 0 ; x < rect_count; x++)
    {

        int* x_base2 = &(in1[x].A);
        int32x4_t xv2 = vld1q_s32(x_base2);

        for (int y = x + 1 ; y < rect_count; y++)
        {

            int* y_base2 = &(in1[y].A);
            int32x4_t yv2 = vld1q_s32(y_base2);

            uint32x4_t mask2 = vceqq_s32(xv2 , yv2); // i do the compare
            float32_t all_mask2_4 = vminvq_u32(mask2) != 0;

            if (all_mask2_4 == 1){ // if compare ok

but i was surprised that i could not use bool as the result of vminvq_u32(mask2) != 0 like in the original exemple if i use vceqq_s32 rather than vcltq_s32 .

the problem was the "if (mask2)" that said it is not a bool

I do not anderstand why ?
Cancel
Vote up 0 Vote down

Cancel