This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON: fast 128 bit comparison

Note: This was originally posted on 30th January 2012 at http://forums.arm.com

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored in two NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).

So far I have the following:

(1) Using the VFP floating point comparison:






vcmp.f64        d0, d6
vmrs            APSR_nzcv, fpscr
vcmpeq.f64      d1, d7
vmrseq          APSR_nzcv, fpscr


If the 64bit "floats" are equivalent to NaN, this version will not work.

(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):



vceq.i32        q15, q0, q3
vmovn.i32       d31, q15
vshl.s16        d31, d31, #8
vcmp.f64        d31, d29
vmrs            APSR_nzcv, fpscr




The D29 register is previously preloaded with the right 16bit pattern:

vmov.i16        d29, #65280  ; 0xff00


My question is: is there any better than this? Am I overseeing some obvious way to do it?



Parents
  • Note: This was originally posted on 31st January 2012 at http://forums.arm.com

    If you are interrested by performance issue,
    you can forget instuction like


    vmrs

    or any instruction that transfert data form NEON to ARM register.

    In fact, you have 2 solutions.
    - If you have a simple comparaison to do, you should considere the possibility do to it with the ARM and not NEON.
    - If you have a lot of comparaion to do (in a loop for example), then use memory buffer to store the comparaison result. In this case, you make 8 or 16 (or more) 128 bit comparaison and then you make the ARM part with the stored result. This is the best way to let NEON and the ARM working both together.

    When you have to use VMRS, the ARM will have to wait for NEON to finish its computation.
    Because NEON have a instruction Queue, it can take 10, 20 ou more cycles, just to let NEON reach your comparaison instruction.
    After that you will have to wait for extra cycles to transfert the data fro NEON to ARM register.

    So. the rule is.
    Never try to transfert any NEON register to ARM register.


    PS: this is not the subject but the reverse operation can be done. Transfering a ARM register to NEON register is very fast because the content of the ARM register is copyed into the instruction queue.
Reply
  • Note: This was originally posted on 31st January 2012 at http://forums.arm.com

    If you are interrested by performance issue,
    you can forget instuction like


    vmrs

    or any instruction that transfert data form NEON to ARM register.

    In fact, you have 2 solutions.
    - If you have a simple comparaison to do, you should considere the possibility do to it with the ARM and not NEON.
    - If you have a lot of comparaion to do (in a loop for example), then use memory buffer to store the comparaison result. In this case, you make 8 or 16 (or more) 128 bit comparaison and then you make the ARM part with the stored result. This is the best way to let NEON and the ARM working both together.

    When you have to use VMRS, the ARM will have to wait for NEON to finish its computation.
    Because NEON have a instruction Queue, it can take 10, 20 ou more cycles, just to let NEON reach your comparaison instruction.
    After that you will have to wait for extra cycles to transfert the data fro NEON to ARM register.

    So. the rule is.
    Never try to transfert any NEON register to ARM register.


    PS: this is not the subject but the reverse operation can be done. Transfering a ARM register to NEON register is very fast because the content of the ARM register is copyed into the instruction queue.
Children
No data