I have the x86 intrinsic (_m128i _mm_sign_epi8(_m128i a,_m128i b)) it performs the following task: for (i = 0; i < 16; i++) { if (b[i] < 0) { r[i] = -a[i]; } else if (b[i] == 0) { r[i] = 0; } else { r[i] = a[i]; }}I am wondering what is the best way to do this same instruction using NEON intrinsics.
Hi,
NEON doesn't have the equivalent intrinsics for _mm_sign_epi8. According to Arm architects it is never the goal to design equivalent SIMD to match other non-Arm architectures. The compilers might not vectorize this piece of code. I have come up with the following sequence of NEON instructions and intrinsics for your consideration. I am not guaranteeing you this is the best solution. Best solution is always context based, e.g. how many elements do you have.
In instructions:
CMGT V1.16B, Vb.16B, 0CMLT V2.16B, Vb.16B, 0NEG V3.16B, V2.16BORR V4.16B, V3.16B, V1.16BMUL Vr.16B, V4.16B, Va.6BIn intrinsics:
uint8x16_t vcgtq_s8 (int8x16_t a, int8x16_t b)uint8x16_t vcltq_s8 (int8x16_t a, int8x16_t b)int8x16_t vnegq_s8 (int8x16_t a)int8x16_t vorrq_s8 (int8x16_t a, int8x16_t b)int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b)A small test, armclang won't optimize the foo away with -O0:#include "arm_neon.h"int8x16_t foo(int8x16_t a, int8x16_t b){int8x16_t r, v0, v3;uint8x16_t v1, v2;v1 = vcgtq_s8(b, v0); //set v1 elements to 1 when b[i]>0, the rest of v1 = 0v2 = vcltq_s8(b, v0); //set v2 elements to 1 when b[i]<0, the rest of v2 = 0v3 = vnegq_s8(v2); //negate v2: set v3 elements to -1 and 0r = vorrq_s8(v3, v1); //bitwise ORR of v3 and v1, put result into r[i], so that r[i] is a sequence of 1,0,-1r = vmulq_s8(r, a); // r[i] = r[i] * a[i]return r;}
Does this help?