This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Division with NEON

Note: This was originally posted on 30th September 2011 at http://forums.arm.com

Hi.

I have 4 unsigned 16bit values into a Dn register (or 8 into a Qn register)

[v1] [v2] [v3] [v4]

I'm looking for the code to finaly have

[65536 / v1] [65536 / v2] [65536 / v3] [65536 / v4]

into another (or the same) Dn (or Qn) register...
Thank's

Etienne
  • Note: This was originally posted on 30th September 2011 at http://forums.arm.com


    One option would be do what a compiler normally does for ARM; use fixed point domain VRECPE to calculate 1/v1, etc, and then multiply out rather than divide, and then renormalize.


    Yes.
    That's exaclty what I'm looking for.

    I'd like to know how to use
    VRECPE.U32

    I don't understand what is the estimation of 1 / 1234  when I'm using U32 data type !

    I've found this code


    vrecpe.f32             d1, d5
    vrecps.f32             d2, d1, d5
    vmul.f32               d1, d1, d2
    vrecps.f32             d2, d1, d5
    vmul.f32               d5, d1, d2


    and it's work correctly with float operation.
    I'm looking for the same code using unsigned integer !
  • Note: This was originally posted on 4th October 2011 at http://forums.arm.com

    Thank you for this explanation !!!!

    I've check the precision of the divide approxiamtion. You're right it is near to 8 bit!
    that enough for me. so finaly, the code I used is this one



    vcvt.f32.u32  q0, q0
    vrecpe.f32        q0, q0
    vmul.f32    q0, q0, q1   @ q1 = 65536
    vcvt.u32.f32  q0, q0


    precision is enough for my colour traitment !
    speed is quite correct !
  • Note: This was originally posted on 5th October 2011 at http://forums.arm.com


    Glad that's working out for you. Out of curiosity, does this work?


            vcvt.f32.u32            q0, q0
            vrecpe.f32        q0, q0
            vcvt.u32.f32            q0, q0, #16



    I'll ckeck this evening but it should work !!!
    Thank's for this usefull optimisation !
  • Note: This was originally posted on 3rd October 2011 at http://forums.arm.com

    vrecpe.u32 takes normalized inputs, similar to how floating point significant data is usually stored. What that means is that the input has no leading zeroes past the first bit that's always 0. So the top two bits will always be 01.

    Another way to look at it is that vrecpe.u32 works on values between 0.5 and 1.0 (non-inclusive), where the format is 0.1.31. That means no sign bits, 1 whole bit, and 31 fraction bits. Due to the input constraints the top bit will always be 0.

    The reason for this format is to limit the possible range of the calculated reciprocal, which you'll notice must be between 1.0 and 2.0. The one whole number bit was kept available to satisfy this range. If you didn't perform this range limiting you wouldn't be able to define very useful data representations for integer reciprocals, since the reciprocal of any whole number is a fraction.

    What normalization does is converts an input x to the format:

    x_normalized = x * 2^shift
    x = x_normalized * 2^-shift

    Where the multiplication can be performed by a bit-shift. Note that for the reciprocal:

    x_reciprocal = 1 / x = 1 / (x_normalized * 2^-shift) = (1 / x_normalized) * 2^shift

    Which means that you end performing a left shift in the end to undo the normalization. This is instead of a right shift because the reciprocal changes the sign of the power.

    Then for the actual division:

    a = y / x
    a = y * (1 / x)
    a = y * (1 / x_normalized * 2^-shift)
    a = (y * (1 / x_normalized)) * 2^-shift

    You can find the normalization shift with a count leading zeroes instruction. In your case you'll want to use vclz.u16. But you need to leave that integer bit, so you want to set shift equal to clz(x) - 1.

    However, you will not always get the correct answer using vrecpe.u32, because it's only correct to ~8 bits. In order to improve the result to get correct 16 bit values you need to use Newton-Raphson iteration. That is, for y = 1 / x,

    y_refined = y * (2 - (x * y))

    This is kind of a pain to do in integer on NEON because there's no vrecps equivalent instruction and since this is a fixed point multiplication you need the long answer, only to throw out the bottom bits. Honestly you're probably better off just converting to floating point and back. You don't even have to do the final multiplication, you can use vcvt to convert between floating point and fixed point and do the multiplication (left shift by 16) for free. Of course, you can do something similar if you stick with integer.
  • Note: This was originally posted on 4th October 2011 at http://forums.arm.com

    Glad that's working out for you. Out of curiosity, does this work?


            vcvt.f32.u32            q0, q0
            vrecpe.f32        q0, q0
            vcvt.u32.f32            q0, q0, #16
  • Note: This was originally posted on 26th September 2012 at http://forums.arm.com

    When dividing signed numbers (by an unsigned value) one way to do it is to multiply the numbers by -1 before and after the division if necessary, so the division still takes place as unsigned. This can be accomplished with the following code:


    // 8x16-bit signed inputs are in q0
    // Elements in q1 are 0xFFFF for negative values, 0x0000 for positive (or zero) values
    vclt.s16 q1, q0, #0
    // Make negative values positive
    vabs.s16 q0, q0

    // ... Division performed here, results in q0 ...

    // Negate values that were negative. This is done by observing that neg(x) = not(x) + 1.
    // For values that were negative the field in q1 was 0xFFFF, therefore we get ((x ^ 0xFFFF) - 0xFFFF) which is not(x) + 1.
    // For values that were positive the field in q1 was 0x0000, therefore we get (x ^ 0x0000) - 0x0000 which is just x.
    // If you can, put some other operation between these two instructions to avoid a stall.
    veor.s16 q0, q0, q1
    vsub.s16 q0, q0, q1


    Note that this will round the negative result towards zero if the positive result was rounded towards zero. This is how most CPU integer divide instructions work, but if you want one that rounds negative values towards negative infinity you'll have to do this differently.

    For the actual division please refer to my post from October 3. Note that webshaker didn't need fully accurate results, so he was able to just use the reciprocal approximation instruction by itself. But if you want accurate result that won't work. He also must have been starting with 4x32-bit unsigned values so he probably did the conversion from 16-bit to 32-bit earlier in code he didn't show (with a vmovw or something).
  • Note: This was originally posted on 30th September 2011 at http://forums.arm.com

    One option would be do what a compiler normally does for ARM; use fixed point domain VRECPE to calculate 1/v1, etc, and then multiply out rather than divide, and then renormalize.
  • Note: This was originally posted on 26th September 2012 at http://forums.arm.com

    Hi,
    I ended up on this discussion while looking for a way to use integer division into ARM NEON registers.

    My problem is similar, the only difference is that I need to work with signed 16 bit integers instead of unsigned ones.

    Is there any way to do this? Also, I don't understand how the original uint16 bit problem of the post has been solved converting uint32 to float32. How can this be possible with the case of the uint16x8_t bit Qn register?

    Thanks,

    Francesco