This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON: fast 128 bit comparison

Note: This was originally posted on 30th January 2012 at http://forums.arm.com

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored in two NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).

So far I have the following:

(1) Using the VFP floating point comparison:






vcmp.f64        d0, d6
vmrs            APSR_nzcv, fpscr
vcmpeq.f64      d1, d7
vmrseq          APSR_nzcv, fpscr


If the 64bit "floats" are equivalent to NaN, this version will not work.

(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):



vceq.i32        q15, q0, q3
vmovn.i32       d31, q15
vshl.s16        d31, d31, #8
vcmp.f64        d31, d29
vmrs            APSR_nzcv, fpscr




The D29 register is previously preloaded with the right 16bit pattern:

vmov.i16        d29, #65280  ; 0xff00


My question is: is there any better than this? Am I overseeing some obvious way to do it?



  • Note: This was originally posted on 30th January 2012 at http://forums.arm.com

    Could you describe more what you're going to be using the comparison for? Either way you do it there will be a performance penalty involved in using NEON to influence control flow, even on Cortex-A9. And you probably don't gain an awful lot by performing the comparison in VFP and doing an msr instead of moving a compare mask to ARM registers and doing tests there.

    If at all possible you should keep things in NEON and perform masking selects instead of control flow. If you can't do this then you should re-evaluate if NEON makes much sense here.
  • Note: This was originally posted on 31st January 2012 at http://forums.arm.com

    If you are interrested by performance issue,
    you can forget instuction like


    vmrs

    or any instruction that transfert data form NEON to ARM register.

    In fact, you have 2 solutions.
    - If you have a simple comparaison to do, you should considere the possibility do to it with the ARM and not NEON.
    - If you have a lot of comparaion to do (in a loop for example), then use memory buffer to store the comparaison result. In this case, you make 8 or 16 (or more) 128 bit comparaison and then you make the ARM part with the stored result. This is the best way to let NEON and the ARM working both together.

    When you have to use VMRS, the ARM will have to wait for NEON to finish its computation.
    Because NEON have a instruction Queue, it can take 10, 20 ou more cycles, just to let NEON reach your comparaison instruction.
    After that you will have to wait for extra cycles to transfert the data fro NEON to ARM register.

    So. the rule is.
    Never try to transfert any NEON register to ARM register.


    PS: this is not the subject but the reverse operation can be done. Transfering a ARM register to NEON register is very fast because the content of the ARM register is copyed into the instruction queue.
  • Note: This was originally posted on 31st January 2012 at http://forums.arm.com

    The scenrio is the following: I have a 128 bit constant K and I want to compare it with the 128 bit "variables" A1, A2... An in the following manner (with the loop unrolled):


    for i = 1, n do
      if A[i] != K then
        break;
    done


    So... should I lock down two cache lines and use them to communicate between VFP/NEON and ARM?

    Do the same performance penalties (regarding the pipelines) apply when using 64 bits-at-a-time comparisons (VFP only)?
  • Note: This was originally posted on 7th February 2012 at http://forums.arm.com

    Thanks a lot for the detailed solution. I do, however, have a number of questions:

    • Wouldn't it be better to "unroll" the "pre-filling loop" in order to avoid the branch mispredictions?
    • Why are there exactly FOUR iterations (comparison pairs) in the "pre-filling loop"?
    • Why are there exactly TWO 128-bit comparisons in one iteration? Is this because the A9 is dual-issue?
    • How does the memory disambiguation mechanism impact the performance in your solution? Doesn't it stall the ARM pipeline?
    • Why is the transfer from VFP registers to ARM registers (more precisely the VMRS instruction) so frowned upon? The "Cortex A9 MPE TRM" (section 3.4.10) states the a transfer from a VFP register to an "integer core" register has a latency of only 3 cycles. How does the pipeline impact of VMRS compare to the impact of the memory disambiguation mechanism?
    Thanks!
  • Note: This was originally posted on 8th February 2012 at http://forums.arm.com


    Wouldn't it be better to "unroll" the "pre-filling loop" in order to avoid the branch mispredictions?

    On modern processor, the branch prediction is so good that it's not (most of time) usefull to unroll the loop.
    You may win some cycles but this is not the most interesting optimisation. 


    Why are there exactly FOUR iterations (comparison pairs) in the "pre-filling loop" ?

    The idea is to fill the NEON queue. There is no reason to do exactly 4 interations
    Less that 4 and you could have 2 problem : not enough instructions into NEON queue and a possible intraction between NEON memory write and ARM memory read
    More than 4 and the algorithm will need a bigger n value (number of iteration) to be efficient.


    Why are there exactly TWO 128-bit comparisons in one iteration? Is this because the A9 is dual-issue?

    No! This is because NEON can't write a single 32 bit value. The smallest NEON write is 64 bits (2 * 32).
    So I'm making 2 comparaison by iteration to get 2 32bits result to write.


    How does the memory disambiguation mechanism impact the performance in your solution? Doesn't it stall the ARM pipeline?

    Using memory buffer to transfert data from NEON to ARM will allow the 2 units to works both together without any dependency problem while they do not work on the same datas.
    That's All. It just avoid to use data transfer units.
    I've made some tests on the Cortex A8 few month ago about that. http://pulsar.websha...n-arm-and-neon/


    Why is the transfer from VFP registers to ARM registers (more precisely the VMRS instruction) so frowned upon?
    The "Cortex A9 MPE TRM" (section 3.4.10) states the a transfer from a VFP register to an "integer core" register has a latency of only 3 cycles.
    How does the pipeline impact of VMRS compare to the impact of the memory disambiguation mechanism?

    The main problem is due to the depedency between:
    - the test
    - the VMRS
    - the conditional instruction.
    We speak about pipelined processor. The three steps are fully dependent.
    The given cycle information are given for fully pipelined instruction.
    So. In real life, you'll not be able to obtain 3 cycle to execute the MOV from VPf to ARM unit.

    The best you can do is to make some bench :)
    After all, that's not impossible that the Vpf version if the fastest one !!!