hi,
i decided to have a look at SIMD intrinsics instructions but there is a lt of documentation but i cannot find exemple.
So i decide once again to ask question about how to use SIMD with exemple.
i need only 2 exemple. Than i think a should be able to mixte practique et knowledge.
the first axemple is how to do when (*in1) are INT array . the traitment is inside this append in loop (*in1)[x] - (*in1)[y], the intrincis should be VSUB if i read correctky and VABS. But i need the syntaxe code.
ONE:
int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min); int diff2 = std::abs((*in1)[x].min - (*in1)[y].min); int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max); int diff4 = std::abs((*in1)[x].max - (*in1)[y].max); int diff5 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_max); int diff6 = std::abs((*in1)[x].min - (*in1)[y].max); int diff7 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_min); int diff8 = std::abs((*in1)[x].max - (*in1)[y].min);
and
TWO :
int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min); int diff2 = std::abs((*in1)[x].min - (*in1)[y].min); int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max); int diff4 = std::abs((*in1)[x].max - (*in1)[y].max);
FOUR:
int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min); int diff2 = std::abs((*in1)[x].min - (*in1)[y].min);
and how to do
if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) || (diff5 < 5 && diff6 < 5 && diff7 < 5 && diff8 < 5) ){
if ( (diff1 < 9 && diff2 < 9 && diff3 < 9 && diff4 < 9) ){
if ( (diff1 < 9 || diff2 < 9) && (diff3 < 9 || diff4 < 9) ){
i think that would be enough. Than i should be able to find my way. Or i will come back to you. ;))
Thanks a lot in advence.
PS: i work with médiatek 9200+ and Mali-G715-Immortalis MC11 r1p2
Yes, SIMD operations allow more to be done at once on smaller datatypes, so you should be able to improve performance with them in many cases (as long as they still have the needed accuracy)
So, in case i had to load 8 short rather than 4 int. But if i load only 4 short there is no interest ? If i anderstoud how it work ;))
Correct. Loading shorts might be slightly faster because of reduced cache pressure if you are memory-bound, but computationally 4 int16 vs 4 int32 won't make any difference because you just leave half the vector width unused.
do you min that int 64 got the same size as int 32 ? oups.
Sorry typo - fixed.
I think i need once again your compétence in SIMD to add a test on my doublon function.
I need to add an IF.
for (int x = 0 ; x < (*indnbObj) ; x++){
for (int y = (x+1) ; y < (*indnbObj) ; y++){
if ((*in1)[x].A == (*in1)[y].A && (*in1)[x].B == (*in1)[y].B && (*in1)[x].C == (*in1)[y].C && (*in1)[x].D == (*in1)[y].D){
// netoyage des doublons aux extrémités int diff1 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_min); int diff2 = std::abs((*in1)[x].min - (*in1)[y].min); int diff3 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_max); int diff4 = std::abs((*in1)[x].max - (*in1)[y].max); int diff5 = std::abs((*in1)[x].raw_col_min - (*in1)[y].raw_col_max); int diff6 = std::abs((*in1)[x].min - (*in1)[y].max); int diff7 = std::abs((*in1)[x].raw_col_max - (*in1)[y].raw_col_min); int diff8 = std::abs((*in1)[x].max - (*in1)[y].min);
i plan to rewrite part of your code like this
for (int x = 0 ; x < rect_count; x++) {
for (int y = (x + 1) ; y < rect_count; y++) {
int* x_base2 = &(in1[x].A); int32x4_t xv2 = vld1q_s32(x_base2); int* y_base2 = &(in1[y].A); int32x4_t yv2 = vld1q_s32(y_base2);
And here i should Use branches selects rather than conditional. But i do not know how to do.
int* x_base = &(in1[x].raw_col_min); int32x4_t xv = vld1q_s32(x_base); int* y_base = &(in1[y].raw_col_min); int32x4_t yv = vld1q_s32(y_base);
if you could explain me how to do it would be nice. ;))
PS: if i do the x_base inside the second loop. does it change something. Or should i keep x_base and x_base2 inside the first loop.
thanks in advance ;))
i think i should use
uint32x4_t mask2 = vceqq_s32(xv2 , yv2); // i do the compareif (mask2){ // if compare ok continu the work;
// Using SIMD, it is better to put these two line inside // the first loop. Data load are done only one time. //int* x_base = &(in1[x].raw_col_min); //int32x4_t xv = vld1q_s32(x_base); int* y_base = &(in1[y].raw_col_min); int32x4_t yv = vld1q_s32(y_base); .............}
I just implement the modification like this : (not waiting for answer)
for (int x = 0 ; x < rect_count; x++) { int* x_base2 = &(in1[x].A); int32x4_t xv2 = vld1q_s32(x_base2); for (int y = x + 1 ; y < rect_count; y++) { int* y_base2 = &(in1[y].A); int32x4_t yv2 = vld1q_s32(y_base2); uint32x4_t mask2 = vceqq_s32(xv2 , yv2); // i do the compare float32_t all_mask2_4 = vminvq_u32(mask2) != 0; if (all_mask2_4 == 1){ // if compare ok
but i was surprised that i could not use bool as the result of vminvq_u32(mask2) != 0 like in the original exemple if i use vceqq_s32 rather than vcltq_s32 .
the problem was the "if (mask2)" that said it is not a bool
if (mask2)" that said it is not a bool
I do not anderstand why ?
hterrolle said:float32_t all_mask2_4 = vminvq_u32(mask2) != 0;
This should be a bool result, not a float32_t result. The rest looks OK though as far as I can tell.
yes you are rigth. I did a mistake using : if (mask2)
this is much better ;))
bool all_mask2_4 = vminvq_u32(mask2) != 0; if (all_mask2_4){ // if compare ok
thanks.
Sorry to come back. but i got another question.
when i do the diff
// Compute diff int32x4_t diff1_4 = vabsq_s32(vsubq_s32(A, B));
I got 4 result. One for each test (A1,B1)(A2,B2)(A3,B3) and (A4,B4)
And than i have to do the comparaison
uint32x4_t mask1_4 = vcltq_s32(diff1_4, X);
So in "uint32x4_t mask1_4" i got the comparaison for each test, so 4 résult.
And i would like to check
if (mask1_4[0] > 0 && mask1_4[1] > 0) and if (mask1_4[2] > 0 && mask1_4[3] > 0)
If it is possible ! how to do this ?
thanks again. ;))
If you want "any of the 4 lanes" then do something like this:
if (vmaxvq_u32(mask1_4) > 0) { ... }
If you only want to match two lanes out of the four, then I would "vandq_u32()" the mask to zero out the mask lanes you don't want before doing the vmaxq_u32().
The other option is to reduce the mask to a 4-bit bitmask you can then test with normal C bit-wise arithmetic. Example of how to do this here:
https://github.com/ARM-software/astc-encoder/blob/701503966b1ac2ebd2616cba94adee5ae8ba6363/Source/astcenc_vecmathlib_neon_4.h#L410
P.S. In future, please raise new questions as a new forum post - it's easier to track questions and answers that way.
Cheers, Pete