This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON conditional execution

Note: This was originally posted on 17th February 2012 at http://forums.arm.com

Hi, I would like to perform the following operation simultaneously on 8x16bits registers:

Q0 = Q0 + Q1
if (Q0 >= Q2) Q0 = Q3

I am not clear if this is possible to do that ?
In normal mode, I know it is since the mov can be conditional, but in SIMD I don't know.
  • Note: This was originally posted on 17th February 2012 at http://forums.arm.com

    Thank you Marcus, so is this the common way to do "conditional" execution, using the VBIT instruction ?
    I can see all comparison instructions gives either all ones or all zeros, i was wondering what I could do with that, VBIT seems to make sense.

    If i wanted to do:

    [color=#222222][size=2]Q0 = Q0 + Q1[/size][/color]
    [color=#222222][size=2]if (Q0 >= Q2) Q0 = Q0 - Q3[/size][/color]
    [size=2]
    [/size]
    [size=2]Since there is no conditional instructions in SIMD mode, is the idea to always compute Q5=Q0-Q3 and then do the compare and use vbit to assign Q5 to Q0 via VBIT ?[/size]
    [size=2]It seems not as efficient as if it was a true conditional execution, but I don't see any other way.[/size]
    [size=2]
    [/size]
    [size=2]Is there some general guidelines or tricks on how to efficiently use the way the VCXX instructions work ?[/size]
    [size=2]
    [/size]
  • Note: This was originally posted on 17th February 2012 at http://forums.arm.com

    yes I understand what you say regarding continuous instruction flow.



    Why do you believe this to be less efficient?


    my example
    Q0 = Q0 + Q1
    if (Q0 >= Q2) Q0 = Q0 - Q3

    could be implemented without VBIT and without using 2 extra registers, one that stores the temporary result of Q0-Q3, one that stores the result of VCGE (that is truly just a one bit flag by the way..)  to use with VBIT or VBIF

    so, really like in the SISD mode:

    add r0,r0,r1
    cmp r0,r2
    subge r0, r0, r3

    but for that, it would mean that each lane could maintain its own set of flags (result of a generic CMP instruction) and then conditionally execute the instruction depending on the lane's flags.There are some DSP that do this in SIMD like the analog devices ADSP213xx family. This is the only SIMD DSP I have used so I first assumed NEON did the same.

    I wonder what is the technological problem in doing this in the chip, or at least, just maintaining a single flag bit per lane (result of a VCxx instruction), and preventing the instruction to execute (or affect any registers) in that lane when its flag is 0 ?
    Seeing the NEON instruction encoding, the 4bits condition flags are not used yet, maybe a future feature ? ...

    Anyways, I can see the possibilities with VBIT, and doing the "else" counter-part with VBIF, that is a good thing already, it makes things possible
  • Note: This was originally posted on 21st February 2012 at http://forums.arm.com


    Isn't that really exactly what the VBIT instruction does, except that it does so in a manner which is (1) generic and fairly flexible so you can use it for other things and (2) it doesn't need a load of extra special logic just for this specific use.

    The only downside of the current NEON approach is that you need one extra register to store the condition pattern, but this is rarely an issue in most algorithms.


    In most cases when using VBIT & VBIF it will require at least 2 extra registers, not only to store the condition pattern, but also for the operation's result you potentially want to affect to another register. Of course this can get worse when you have to do more conditional instructions related to a common condition result.
    VBIT/VBIF do the job, but it is not the most flexible solution.

    I am not saying NEON instructions need to be conditional with a 4bits flag (like many simple ARM instructions).
    I really like the VCxx instructions approach that generates a one bit flag, this has allowed to do add new compare instructions like VACxx. Very nice.
    However, it would be better if they did not require a whole temporary register to store that one bit flag, but instead store it next to the corresponding lane.
    After that, yes, we would need to have a 2bits flag per NEON instruction, 00 = exec if flag=0, 01 = exec if flag=1, 1X exec always.

    This approach is more efficient, but you are right, that does require extra logic in the chip and I can understand it was not done (yet)
  • Note: This was originally posted on 23rd February 2012 at http://forums.arm.com


    Does my sample too easy or is it always possible to eval a conditional expression without any branch (only with conditional instruction) ?

    You can generate all other boolean operations from AND and NOT (just do a truth table to quickly verify)
    The only issue I see could be code readability when you have only AND and NOT.

    If you saw   if (!(!A && !B )) in your code instead of your original  if ( A || B ), it's not very good looking.
    When I write asm code, I always put a comment on each line to describe what it does, so it's not really an issue...

    So yes, I think it is always possible to evaluate a boolean expression, without having to do any branch to perform that evaluation and generate the final 1bit result. From experience, most tests and most operations related to that test are fairly simple in DSP algorithms, that is where I find NEON not super efficient with its VCxx and VBIT instructions. The worse thing being the use of whole registers that are often very precious.

    All that said, I prefer to have something that is not the best but does the job, than nothing at all ...
    So I'm happy we can at least do some form of conditional execution with NEON !
  • Note: This was originally posted on 17th February 2012 at http://forums.arm.com

    This untested sequence should do what you want. Depending on the greater context you may be able to reuse some registers.

    ; Q0 = Q0 + Q1
    VADD.U16 q0, q0, q1
    ; if (Q0 >= Q2) Q0 = Q3
    VCGE.U16 q4, q0, q2
    VBIT  q0, q3, q4

    --
    Marcus
  • Note: This was originally posted on 17th February 2012 at http://forums.arm.com


    It seems not as efficient as if it was a true conditional execution, but I don't see any other way.

    Why do you believe this to be less efficient? What would an imaginary conditional vector instruction look like? NEON works best with a continuous, linear instruction stream. Traditional conditional execution just wouldn't work. In todays processors, conditional execution is implemented quite differently from good old ARM7TDMI. Many (if not all) conditional instructions execute the same way that they would normally execute. The condition code only determines whether the result is discarded or not.

    Kindly
    Marcus
  • Note: This was originally posted on 20th February 2012 at http://forums.arm.com

    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]> I wonder what is the technological problem in doing this in the chip[/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]
    [/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]Isn't that really exactly what the VBIT instruction does, except that it does so in a manner which is (1) generic and fairly flexible so you can use it for other things and (2) it doesn't need a load of extra special logic just for this specific use.[/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]
    [/size][/font][/color]
    [color=#222222][font=Arial, Verdana, Tahoma, sans-serif][size=2]The only downside of the current NEON approach is that you need one extra register to store the condition pattern, but this is rarely an issue in most algorithms.[/size][/font][/color]