This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

x86 _mm_sign_epi8(_m128i a,_m128i b) intrinsic NEON equivalent

I have the x86 intrinsic (_m128i _mm_sign_epi8(_m128i a,_m128i b)) it performs the following task:
for (i = 0; i < 16; i++) {
if (b[i] < 0) {
r[i] = -a[i];
}
else if (b[i] == 0) {
r[i] = 0;
}
else {
r[i] = a[i];
}
}
I am wondering what is the best way to do this same instruction using NEON intrinsics.

Parents

0 Juan Gao over 7 years ago

Hi,

NEON doesn't have the equivalent intrinsics for _mm_sign_epi8. According to Arm architects it is never the goal to design equivalent SIMD to match other non-Arm architectures. The compilers might not vectorize this piece of code. I have come up with the following sequence of NEON instructions and intrinsics for your consideration. I am not guaranteeing you this is the best solution. Best solution is always context based, e.g. how many elements do you have.

In instructions:

CMGT V1.16B, Vb.16B, 0
CMLT V2.16B, Vb.16B, 0
NEG V3.16B, V2.16B
ORR V4.16B, V3.16B, V1.16B
MUL Vr.16B, V4.16B, Va.6B

In intrinsics:

uint8x16_t vcgtq_s8 (int8x16_t a, int8x16_t b)
uint8x16_t vcltq_s8 (int8x16_t a, int8x16_t b)
int8x16_t vnegq_s8 (int8x16_t a)
int8x16_t vorrq_s8 (int8x16_t a, int8x16_t b)
int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b)

A small test, armclang won't optimize the foo away with -O0:

#include "arm_neon.h"

int8x16_t foo(int8x16_t a, int8x16_t b)
{
int8x16_t r, v0, v3;
uint8x16_t v1, v2;

v1 = vcgtq_s8(b, v0); //set v1 elements to 1 when b[i]>0, the rest of v1 = 0
v2 = vcltq_s8(b, v0); //set v2 elements to 1 when b[i]<0, the rest of v2 = 0
v3 = vnegq_s8(v2); //negate v2: set v3 elements to -1 and 0
r = vorrq_s8(v3, v1); //bitwise ORR of v3 and v1, put result into r[i], so that r[i] is a sequence of 1,0,-1
r = vmulq_s8(r, a); // r[i] = r[i] * a[i]

return r;

}

Does this help?
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Juan Gao over 7 years ago

Hi,

NEON doesn't have the equivalent intrinsics for _mm_sign_epi8. According to Arm architects it is never the goal to design equivalent SIMD to match other non-Arm architectures. The compilers might not vectorize this piece of code. I have come up with the following sequence of NEON instructions and intrinsics for your consideration. I am not guaranteeing you this is the best solution. Best solution is always context based, e.g. how many elements do you have.

In instructions:

CMGT V1.16B, Vb.16B, 0
CMLT V2.16B, Vb.16B, 0
NEG V3.16B, V2.16B
ORR V4.16B, V3.16B, V1.16B
MUL Vr.16B, V4.16B, Va.6B

In intrinsics:

uint8x16_t vcgtq_s8 (int8x16_t a, int8x16_t b)
uint8x16_t vcltq_s8 (int8x16_t a, int8x16_t b)
int8x16_t vnegq_s8 (int8x16_t a)
int8x16_t vorrq_s8 (int8x16_t a, int8x16_t b)
int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b)

A small test, armclang won't optimize the foo away with -O0:

#include "arm_neon.h"

int8x16_t foo(int8x16_t a, int8x16_t b)
{
int8x16_t r, v0, v3;
uint8x16_t v1, v2;

v1 = vcgtq_s8(b, v0); //set v1 elements to 1 when b[i]>0, the rest of v1 = 0
v2 = vcltq_s8(b, v0); //set v2 elements to 1 when b[i]<0, the rest of v2 = 0
v3 = vnegq_s8(v2); //negate v2: set v3 elements to -1 and 0
r = vorrq_s8(v3, v1); //bitwise ORR of v3 and v1, put result into r[i], so that r[i] is a sequence of 1,0,-1
r = vmulq_s8(r, a); // r[i] = r[i] * a[i]

return r;

}

Does this help?
Cancel
Vote up 0 Vote down

Cancel

Children

No data