This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON: fast 128 bit comparison

Note: This was originally posted on 30th January 2012 at http://forums.arm.com

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored in two NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).

So far I have the following:

(1) Using the VFP floating point comparison:






vcmp.f64        d0, d6
vmrs            APSR_nzcv, fpscr
vcmpeq.f64      d1, d7
vmrseq          APSR_nzcv, fpscr


If the 64bit "floats" are equivalent to NaN, this version will not work.

(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):



vceq.i32        q15, q0, q3
vmovn.i32       d31, q15
vshl.s16        d31, d31, #8
vcmp.f64        d31, d29
vmrs            APSR_nzcv, fpscr




The D29 register is previously preloaded with the right 16bit pattern:

vmov.i16        d29, #65280  ; 0xff00


My question is: is there any better than this? Am I overseeing some obvious way to do it?



Parents
  • Note: This was originally posted on 31st January 2012 at http://forums.arm.com

    The scenrio is the following: I have a 128 bit constant K and I want to compare it with the 128 bit "variables" A1, A2... An in the following manner (with the loop unrolled):


    for i = 1, n do
      if A[i] != K then
        break;
    done


    So... should I lock down two cache lines and use them to communicate between VFP/NEON and ARM?

    Do the same performance penalties (regarding the pipelines) apply when using 64 bits-at-a-time comparisons (VFP only)?
Reply
  • Note: This was originally posted on 31st January 2012 at http://forums.arm.com

    The scenrio is the following: I have a 128 bit constant K and I want to compare it with the 128 bit "variables" A1, A2... An in the following manner (with the loop unrolled):


    for i = 1, n do
      if A[i] != K then
        break;
    done


    So... should I lock down two cache lines and use them to communicate between VFP/NEON and ARM?

    Do the same performance penalties (regarding the pipelines) apply when using 64 bits-at-a-time comparisons (VFP only)?
Children
No data