We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
__builtin_ia32_cvtb2mask512() is the GNU C builtin for vpmovb2m k, zmm.The Intel intrinsic for it is _mm512_movepi8_mask.
__builtin_ia32_cvtb2mask512()
vpmovb2m k, zmm
_mm512_movepi8_mask
It extracts the most-significant bit from each byte, producing an integer mask.
The SSE2 and AVX2 instructions pmovmskb and vpmovmskb do the same thing for 16 or 32-byte vectors, producing the mask in a GPR instead of an AVX-512 mask register. (_mm_movemask_epi8 and _mm256_movemask_epi8).
pmovmskb
vpmovmskb
_mm_movemask_epi8
_mm256_movemask_epi8
I have attached a basic scalar implementation in C. For those trying to implement this in ARM, we care about the high bit, but each byte's high bit (in a 128bit vector), can be easily shifted to the low bit using the ARM NEON intrinsic: vshrq_n_u8(). Note that I would prefer not to store the bitmap to memory, it should just be the return value of the function similar to the following function.
#define _(n) __attribute((vector_size(1<<n),aligned(1))) typedef char V _(6); // 64 bytes, 512 bits typedef unsigned long U; #undef _ U generic_cvtb2mask512(V v) { U mask=0;int i=0; while(i<64){ // shift mask by 1 and OR with MSB of v[i] byte mask=(mask<<1)|((v[i]&0x80)>>7); i++;} return mask; }
This is one possible algorithm for 16 bytes (128b vector), it would just need to be put into a loop for 64 bytes (512b vector):
#define _(n) __attribute((vector_size(1<<n),aligned(1))) typedef char g4 _(4); // 16 bytes, 128 bits typedef char g3 _(3); // 8 bytes, 64 bits typedef unsigned long U; #undef _ unsigned short get_16msb(g4 v) { unsigned short = ret; // per byte, make every bit same as msb g4 msb = vdupq_n_u8(0x80); g4 filled = vceqq_u8(v, msb); // create a mask of each bit value g4 b = {0x80, 0x40, 0x20, 0x01, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x01, 0x08, 0x04, 0x02, 0x01}; // and vectors together g4 z = vandq_u8 (filled,b); // extract lower 8 bytes, hi 8 bytes g3 lo = vget_low_u8(z); g3 hi = vget_high_u8(z); // 'or' the 8 bytes of lo together ... // put in byte 1 of ret // 'or' the 8 bytes of hi together ... // put in byte 2 of ret return ret; }
Hello Anuj,
You may find the below learning path useful:
Porting architecture specific intrinsics
https://learn.arm.com/learning-paths/cross-platform/intrinsics/
Regards Ronan