This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON: fast 128 bit comparison

Note: This was originally posted on 30th January 2012 at http://forums.arm.com

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored in two NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).

So far I have the following:

(1) Using the VFP floating point comparison:






vcmp.f64        d0, d6
vmrs            APSR_nzcv, fpscr
vcmpeq.f64      d1, d7
vmrseq          APSR_nzcv, fpscr


If the 64bit "floats" are equivalent to NaN, this version will not work.

(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):



vceq.i32        q15, q0, q3
vmovn.i32       d31, q15
vshl.s16        d31, d31, #8
vcmp.f64        d31, d29
vmrs            APSR_nzcv, fpscr




The D29 register is previously preloaded with the right 16bit pattern:

vmov.i16        d29, #65280  ; 0xff00


My question is: is there any better than this? Am I overseeing some obvious way to do it?



Parents
  • Note: This was originally posted on 8th February 2012 at http://forums.arm.com


    Wouldn't it be better to "unroll" the "pre-filling loop" in order to avoid the branch mispredictions?

    On modern processor, the branch prediction is so good that it's not (most of time) usefull to unroll the loop.
    You may win some cycles but this is not the most interesting optimisation. 


    Why are there exactly FOUR iterations (comparison pairs) in the "pre-filling loop" ?

    The idea is to fill the NEON queue. There is no reason to do exactly 4 interations
    Less that 4 and you could have 2 problem : not enough instructions into NEON queue and a possible intraction between NEON memory write and ARM memory read
    More than 4 and the algorithm will need a bigger n value (number of iteration) to be efficient.


    Why are there exactly TWO 128-bit comparisons in one iteration? Is this because the A9 is dual-issue?

    No! This is because NEON can't write a single 32 bit value. The smallest NEON write is 64 bits (2 * 32).
    So I'm making 2 comparaison by iteration to get 2 32bits result to write.


    How does the memory disambiguation mechanism impact the performance in your solution? Doesn't it stall the ARM pipeline?

    Using memory buffer to transfert data from NEON to ARM will allow the 2 units to works both together without any dependency problem while they do not work on the same datas.
    That's All. It just avoid to use data transfer units.
    I've made some tests on the Cortex A8 few month ago about that. http://pulsar.websha...n-arm-and-neon/


    Why is the transfer from VFP registers to ARM registers (more precisely the VMRS instruction) so frowned upon?
    The "Cortex A9 MPE TRM" (section 3.4.10) states the a transfer from a VFP register to an "integer core" register has a latency of only 3 cycles.
    How does the pipeline impact of VMRS compare to the impact of the memory disambiguation mechanism?

    The main problem is due to the depedency between:
    - the test
    - the VMRS
    - the conditional instruction.
    We speak about pipelined processor. The three steps are fully dependent.
    The given cycle information are given for fully pipelined instruction.
    So. In real life, you'll not be able to obtain 3 cycle to execute the MOV from VPf to ARM unit.

    The best you can do is to make some bench :)
    After all, that's not impossible that the Vpf version if the fastest one !!!
Reply
  • Note: This was originally posted on 8th February 2012 at http://forums.arm.com


    Wouldn't it be better to "unroll" the "pre-filling loop" in order to avoid the branch mispredictions?

    On modern processor, the branch prediction is so good that it's not (most of time) usefull to unroll the loop.
    You may win some cycles but this is not the most interesting optimisation. 


    Why are there exactly FOUR iterations (comparison pairs) in the "pre-filling loop" ?

    The idea is to fill the NEON queue. There is no reason to do exactly 4 interations
    Less that 4 and you could have 2 problem : not enough instructions into NEON queue and a possible intraction between NEON memory write and ARM memory read
    More than 4 and the algorithm will need a bigger n value (number of iteration) to be efficient.


    Why are there exactly TWO 128-bit comparisons in one iteration? Is this because the A9 is dual-issue?

    No! This is because NEON can't write a single 32 bit value. The smallest NEON write is 64 bits (2 * 32).
    So I'm making 2 comparaison by iteration to get 2 32bits result to write.


    How does the memory disambiguation mechanism impact the performance in your solution? Doesn't it stall the ARM pipeline?

    Using memory buffer to transfert data from NEON to ARM will allow the 2 units to works both together without any dependency problem while they do not work on the same datas.
    That's All. It just avoid to use data transfer units.
    I've made some tests on the Cortex A8 few month ago about that. http://pulsar.websha...n-arm-and-neon/


    Why is the transfer from VFP registers to ARM registers (more precisely the VMRS instruction) so frowned upon?
    The "Cortex A9 MPE TRM" (section 3.4.10) states the a transfer from a VFP register to an "integer core" register has a latency of only 3 cycles.
    How does the pipeline impact of VMRS compare to the impact of the memory disambiguation mechanism?

    The main problem is due to the depedency between:
    - the test
    - the VMRS
    - the conditional instruction.
    We speak about pipelined processor. The three steps are fully dependent.
    The given cycle information are given for fully pipelined instruction.
    So. In real life, you'll not be able to obtain 3 cycle to execute the MOV from VPf to ARM unit.

    The best you can do is to make some bench :)
    After all, that's not impossible that the Vpf version if the fastest one !!!
Children
No data