__builtin_ia32_cvtb2mask512() is the GNU C builtin for vpmovb2m k, zmm.The Intel intrinsic for it is _mm512_movepi8_mask.
__builtin_ia32_cvtb2mask512()
vpmovb2m k, zmm
_mm512_movepi8_mask
It extracts the most-significant bit from each byte, producing an integer mask.
The SSE2 and AVX2 instructions pmovmskb and vpmovmskb do the same thing for 16 or 32-byte vectors, producing the mask in a GPR instead of an AVX-512 mask register. (_mm_movemask_epi8 and _mm256_movemask_epi8).
pmovmskb
vpmovmskb
_mm_movemask_epi8
_mm256_movemask_epi8
I have attached a basic scalar implementation in C. For those trying to implement this in ARM, we care about the high bit, but each byte's high bit (in a 128bit vector), can be easily shifted to the low bit using the ARM NEON intrinsic: vshrq_n_u8(). Note that I would prefer not to store the bitmap to memory, it should just be the return value of the function similar to the following function.
#define _(n) __attribute((vector_size(1<<n),aligned(1))) typedef char V _(6); // 64 bytes, 512 bits typedef unsigned long U; #undef _ U generic_cvtb2mask512(V v) { U mask=0;int i=0; while(i<64){ // shift mask by 1 and OR with MSB of v[i] byte mask=(mask<<1)|((v[i]&0x80)>>7); i++;} return mask; }
This is one possible algorithm for 16 bytes (128b vector), it would just need to be put into a loop for 64 bytes (512b vector):
#define _(n) __attribute((vector_size(1<<n),aligned(1))) typedef char g4 _(4); // 16 bytes, 128 bits typedef char g3 _(3); // 8 bytes, 64 bits typedef unsigned long U; #undef _ unsigned short get_16msb(g4 v) { unsigned short = ret; // per byte, make every bit same as msb g4 msb = vdupq_n_u8(0x80); g4 filled = vceqq_u8(v, msb); // create a mask of each bit value g4 b = {0x80, 0x40, 0x20, 0x01, 0x08, 0x04, 0x02, 0x01, 0x80, 0x40, 0x20, 0x01, 0x08, 0x04, 0x02, 0x01}; // and vectors together g4 z = vandq_u8 (filled,b); // extract lower 8 bytes, hi 8 bytes g3 lo = vget_low_u8(z); g3 hi = vget_high_u8(z); // 'or' the 8 bytes of lo together ... // put in byte 1 of ret // 'or' the 8 bytes of hi together ... // put in byte 2 of ret return ret; }
Hello Anuj,
You may find the below learning path useful:
Porting architecture specific intrinsics
https://learn.arm.com/learning-paths/cross-platform/intrinsics/
Regards Ronan