This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON: fast 128 bit comparison

Note: This was originally posted on 30th January 2012 at http://forums.arm.com

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored in two NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).

So far I have the following:

(1) Using the VFP floating point comparison:






vcmp.f64        d0, d6
vmrs            APSR_nzcv, fpscr
vcmpeq.f64      d1, d7
vmrseq          APSR_nzcv, fpscr


If the 64bit "floats" are equivalent to NaN, this version will not work.

(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):



vceq.i32        q15, q0, q3
vmovn.i32       d31, q15
vshl.s16        d31, d31, #8
vcmp.f64        d31, d29
vmrs            APSR_nzcv, fpscr




The D29 register is previously preloaded with the right 16bit pattern:

vmov.i16        d29, #65280  ; 0xff00


My question is: is there any better than this? Am I overseeing some obvious way to do it?



Parents
  • Note: This was originally posted on 30th January 2012 at http://forums.arm.com

    Could you describe more what you're going to be using the comparison for? Either way you do it there will be a performance penalty involved in using NEON to influence control flow, even on Cortex-A9. And you probably don't gain an awful lot by performing the comparison in VFP and doing an msr instead of moving a compare mask to ARM registers and doing tests there.

    If at all possible you should keep things in NEON and perform masking selects instead of control flow. If you can't do this then you should re-evaluate if NEON makes much sense here.
Reply
  • Note: This was originally posted on 30th January 2012 at http://forums.arm.com

    Could you describe more what you're going to be using the comparison for? Either way you do it there will be a performance penalty involved in using NEON to influence control flow, even on Cortex-A9. And you probably don't gain an awful lot by performing the comparison in VFP and doing an msr instead of moving a compare mask to ARM registers and doing tests there.

    If at all possible you should keep things in NEON and perform masking selects instead of control flow. If you can't do this then you should re-evaluate if NEON makes much sense here.
Children
No data