We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I have the x86 intrinsic (_m128i _mm_sign_epi8(_m128i a,_m128i b)) it performs the following task: for (i = 0; i < 16; i++) { if (b[i] < 0) { r[i] = -a[i]; } else if (b[i] == 0) { r[i] = 0; } else { r[i] = a[i]; }}I am wondering what is the best way to do this same instruction using NEON intrinsics.
Hi,
NEON doesn't have the equivalent intrinsics for _mm_sign_epi8. According to Arm architects it is never the goal to design equivalent SIMD to match other non-Arm architectures. The compilers might not vectorize this piece of code. I have come up with the following sequence of NEON instructions and intrinsics for your consideration. I am not guaranteeing you this is the best solution. Best solution is always context based, e.g. how many elements do you have.
In instructions:
CMGT V1.16B, Vb.16B, 0CMLT V2.16B, Vb.16B, 0NEG V3.16B, V2.16BORR V4.16B, V3.16B, V1.16BMUL Vr.16B, V4.16B, Va.6BIn intrinsics:
uint8x16_t vcgtq_s8 (int8x16_t a, int8x16_t b)uint8x16_t vcltq_s8 (int8x16_t a, int8x16_t b)int8x16_t vnegq_s8 (int8x16_t a)int8x16_t vorrq_s8 (int8x16_t a, int8x16_t b)int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b)A small test, armclang won't optimize the foo away with -O0:#include "arm_neon.h"int8x16_t foo(int8x16_t a, int8x16_t b){int8x16_t r, v0, v3;uint8x16_t v1, v2;v1 = vcgtq_s8(b, v0); //set v1 elements to 1 when b[i]>0, the rest of v1 = 0v2 = vcltq_s8(b, v0); //set v2 elements to 1 when b[i]<0, the rest of v2 = 0v3 = vnegq_s8(v2); //negate v2: set v3 elements to -1 and 0r = vorrq_s8(v3, v1); //bitwise ORR of v3 and v1, put result into r[i], so that r[i] is a sequence of 1,0,-1r = vmulq_s8(r, a); // r[i] = r[i] * a[i]return r;}
Does this help?