x86 _mm_sign_epi8(_m128i a,_m128i b) intrinsic NEON equivalent

I have the x86 intrinsic (_m128i _mm_sign_epi8(_m128i a,_m128i b)) it performs the following task:
for (i = 0; i < 16; i++) {
  if (b[i] < 0) {
      r[i] = -a[i];
  else if (b[i] == 0) {
      r[i] = 0;
  else {
     r[i] = a[i];
I am wondering what is the best way to do this same instruction using NEON intrinsics.

No Data
  • Hi,

    NEON doesn't have the equivalent intrinsics for _mm_sign_epi8. According to Arm architects it is never the goal to design equivalent SIMD to match other non-Arm architectures. The compilers might not vectorize this piece of code. I have come up with the following sequence of NEON instructions and intrinsics for your consideration. I am not guaranteeing you this is the best solution. Best solution is always context based, e.g. how many elements do you have.

    In instructions:

    CMGT V1.16B, Vb.16B, 0
    CMLT V2.16B, Vb.16B, 0
    NEG V3.16B, V2.16B
    ORR V4.16B, V3.16B, V1.16B
    MUL Vr.16B, V4.16B, Va.6B

    In intrinsics:

    uint8x16_t vcgtq_s8 (int8x16_t a, int8x16_t b)
    uint8x16_t vcltq_s8 (int8x16_t a, int8x16_t b)
    int8x16_t vnegq_s8 (int8x16_t a)
    int8x16_t vorrq_s8 (int8x16_t a, int8x16_t b)
    int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b)

    A small test, armclang won't optimize the foo away with -O0:

    #include "arm_neon.h"

    int8x16_t foo(int8x16_t a, int8x16_t b)
    int8x16_t r, v0, v3;
    uint8x16_t v1, v2;

    v1 = vcgtq_s8(b, v0);  //set v1 elements to 1 when b[i]>0, the rest of v1 = 0
    v2 = vcltq_s8(b, v0);  //set v2 elements to 1 when b[i]<0, the rest of v2 = 0
    v3 = vnegq_s8(v2);   //negate v2: set v3 elements to -1 and 0
    r = vorrq_s8(v3, v1);  //bitwise ORR of v3 and v1, put result into r[i], so that r[i] is a sequence of 1,0,-1
    r = vmulq_s8(r, a);   // r[i] = r[i] * a[i]

    return r;


    Does this help?

No Data