This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Division with NEON

Note: This was originally posted on 30th September 2011 at http://forums.arm.com

Hi.

I have 4 unsigned 16bit values into a Dn register (or 8 into a Qn register)

[v1] [v2] [v3] [v4]

I'm looking for the code to finaly have

[65536 / v1] [65536 / v2] [65536 / v3] [65536 / v4]

into another (or the same) Dn (or Qn) register...
Thank's

Etienne
Parents
  • Note: This was originally posted on 3rd October 2011 at http://forums.arm.com

    vrecpe.u32 takes normalized inputs, similar to how floating point significant data is usually stored. What that means is that the input has no leading zeroes past the first bit that's always 0. So the top two bits will always be 01.

    Another way to look at it is that vrecpe.u32 works on values between 0.5 and 1.0 (non-inclusive), where the format is 0.1.31. That means no sign bits, 1 whole bit, and 31 fraction bits. Due to the input constraints the top bit will always be 0.

    The reason for this format is to limit the possible range of the calculated reciprocal, which you'll notice must be between 1.0 and 2.0. The one whole number bit was kept available to satisfy this range. If you didn't perform this range limiting you wouldn't be able to define very useful data representations for integer reciprocals, since the reciprocal of any whole number is a fraction.

    What normalization does is converts an input x to the format:

    x_normalized = x * 2^shift
    x = x_normalized * 2^-shift

    Where the multiplication can be performed by a bit-shift. Note that for the reciprocal:

    x_reciprocal = 1 / x = 1 / (x_normalized * 2^-shift) = (1 / x_normalized) * 2^shift

    Which means that you end performing a left shift in the end to undo the normalization. This is instead of a right shift because the reciprocal changes the sign of the power.

    Then for the actual division:

    a = y / x
    a = y * (1 / x)
    a = y * (1 / x_normalized * 2^-shift)
    a = (y * (1 / x_normalized)) * 2^-shift

    You can find the normalization shift with a count leading zeroes instruction. In your case you'll want to use vclz.u16. But you need to leave that integer bit, so you want to set shift equal to clz(x) - 1.

    However, you will not always get the correct answer using vrecpe.u32, because it's only correct to ~8 bits. In order to improve the result to get correct 16 bit values you need to use Newton-Raphson iteration. That is, for y = 1 / x,

    y_refined = y * (2 - (x * y))

    This is kind of a pain to do in integer on NEON because there's no vrecps equivalent instruction and since this is a fixed point multiplication you need the long answer, only to throw out the bottom bits. Honestly you're probably better off just converting to floating point and back. You don't even have to do the final multiplication, you can use vcvt to convert between floating point and fixed point and do the multiplication (left shift by 16) for free. Of course, you can do something similar if you stick with integer.
Reply
  • Note: This was originally posted on 3rd October 2011 at http://forums.arm.com

    vrecpe.u32 takes normalized inputs, similar to how floating point significant data is usually stored. What that means is that the input has no leading zeroes past the first bit that's always 0. So the top two bits will always be 01.

    Another way to look at it is that vrecpe.u32 works on values between 0.5 and 1.0 (non-inclusive), where the format is 0.1.31. That means no sign bits, 1 whole bit, and 31 fraction bits. Due to the input constraints the top bit will always be 0.

    The reason for this format is to limit the possible range of the calculated reciprocal, which you'll notice must be between 1.0 and 2.0. The one whole number bit was kept available to satisfy this range. If you didn't perform this range limiting you wouldn't be able to define very useful data representations for integer reciprocals, since the reciprocal of any whole number is a fraction.

    What normalization does is converts an input x to the format:

    x_normalized = x * 2^shift
    x = x_normalized * 2^-shift

    Where the multiplication can be performed by a bit-shift. Note that for the reciprocal:

    x_reciprocal = 1 / x = 1 / (x_normalized * 2^-shift) = (1 / x_normalized) * 2^shift

    Which means that you end performing a left shift in the end to undo the normalization. This is instead of a right shift because the reciprocal changes the sign of the power.

    Then for the actual division:

    a = y / x
    a = y * (1 / x)
    a = y * (1 / x_normalized * 2^-shift)
    a = (y * (1 / x_normalized)) * 2^-shift

    You can find the normalization shift with a count leading zeroes instruction. In your case you'll want to use vclz.u16. But you need to leave that integer bit, so you want to set shift equal to clz(x) - 1.

    However, you will not always get the correct answer using vrecpe.u32, because it's only correct to ~8 bits. In order to improve the result to get correct 16 bit values you need to use Newton-Raphson iteration. That is, for y = 1 / x,

    y_refined = y * (2 - (x * y))

    This is kind of a pain to do in integer on NEON because there's no vrecps equivalent instruction and since this is a fixed point multiplication you need the long answer, only to throw out the bottom bits. Honestly you're probably better off just converting to floating point and back. You don't even have to do the final multiplication, you can use vcvt to convert between floating point and fixed point and do the multiplication (left shift by 16) for free. Of course, you can do something similar if you stick with integer.
Children
No data