This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to shuffle bits and Check high bit value using Neon Intrinsics?

Note: This was originally posted on 1st November 2011 at http://forums.arm.com

Hi,

I am trying to convert a code written in SSE3 intrinsics to NEON SIMD  and am stuck because of a shuffle function.I have looked at the GCC Intrinsics ,ARM manuals but have not been able to find a solution

Is there any equivalent function for the _mm_shuffle_epi8 function in SSE3 .Any suggestions on how to implement this would be really appreciated since I cant seem to get past this.I know that a lookup-table exists ,but it does not do an initial comparison like the _mm_shuffle ,so i am not sure how to implement this.

Also,I need to check only the high bit values of each byte in a register.Is there any way to check the high-bit value of each element in a vector ?I looked at the manual and was not able to find anything relevant.Any help/info would be sincerely appreciated.

Cheers,



Parents
  • Note: This was originally posted on 2nd November 2011 at http://forums.arm.com

    vtbl actually does have a special case for setting the value to zero. The only difference between it and  SSSE3's pshufb is that it will set the result to zero if any of the out of range bits of the index are set, not just if the most significant bit is.  If you're using tables of 16 values like pshufb that refers to bits 4 through 7 of the indexes. If for some reason your input has any of bits 4 through 6 you can clear  them before the vtbl by using vand or vbic.

    You do have to use vtbl twice to get both the lower and upper part, if you're working with 128-bit vectors.

    As for your second question, we need to know more about what you mean by  "look" at the most significant bits. If you want to generate a byte-mask  that's 0xFF where the MSB is set and 0x00 where it isn't you can  accomplish it with vclt.s8 #0, vtst.8, or vshr.s8 (I recommend the first  one). If you want to pack the MSBs into an 8-bit mask like pmovmskb  does that'll take more code. If at all possible it'd be best to change the  algorithm to not need this. But if you must have it you can do it with  the following:

    - Expand the MSB to to a byte mask using one of the above methods
    - Isolate a different single bit in each byte by ANDing the byte mask against a vector containing { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 }
    - Combine the bits using a series of three parallel adds (vpadd)

    This works best if you can do it over more than one vector worth of  bytes so the later vpadds have more data to work with, and can hide  latency better.
Reply
  • Note: This was originally posted on 2nd November 2011 at http://forums.arm.com

    vtbl actually does have a special case for setting the value to zero. The only difference between it and  SSSE3's pshufb is that it will set the result to zero if any of the out of range bits of the index are set, not just if the most significant bit is.  If you're using tables of 16 values like pshufb that refers to bits 4 through 7 of the indexes. If for some reason your input has any of bits 4 through 6 you can clear  them before the vtbl by using vand or vbic.

    You do have to use vtbl twice to get both the lower and upper part, if you're working with 128-bit vectors.

    As for your second question, we need to know more about what you mean by  "look" at the most significant bits. If you want to generate a byte-mask  that's 0xFF where the MSB is set and 0x00 where it isn't you can  accomplish it with vclt.s8 #0, vtst.8, or vshr.s8 (I recommend the first  one). If you want to pack the MSBs into an 8-bit mask like pmovmskb  does that'll take more code. If at all possible it'd be best to change the  algorithm to not need this. But if you must have it you can do it with  the following:

    - Expand the MSB to to a byte mask using one of the above methods
    - Isolate a different single bit in each byte by ANDing the byte mask against a vector containing { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 }
    - Combine the bits using a series of three parallel adds (vpadd)

    This works best if you can do it over more than one vector worth of  bytes so the later vpadds have more data to work with, and can hide  latency better.
Children
No data