This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Any equivalent NEON instruction to SMULWy?

Note: This was originally posted on 6th July 2013 at http://forums.arm.com

Hi everybody,

I'm currently working on 7x7 gaussian blur filter for NEON.

And since everything bigger than 3x3 is hard to handle with 2D algorithms, I made it to 2*1D algorithm.

Running the filter horizontally first, every pixel (y value) is temporarily stored in 16bit uq8. So far so good. It's super fast with zero latency and dual issue everywhere possible.

The problem begins when doing it vertically.

I know WIDE model isn't available for mul instructions. It's no problem widening the coefficients to 16bit for 16bit*16bit multiplications. With VMULL.u16, the result is then 32bit so I have to do narrowing twice in order to get the final result in 8bit, and I really don't like it.

I read through the assembly reference several times, but there seems to be no mul instruction giving the upper 16bit as the result. Am I right on this?  Do I really have to accept having to do narrowing twice?

I badly need something like SMULWy....

VQDMULH, which gives the upper half as result won't do the trick since it works only with signed values and doubles the result, if I understood correctly.

I'm really curious : What is VQDMULH good for? I can hardly imagine anything where that doubling might be useful. Can someone enlighten me?

Thanks in advance
Parents
  • Note: This was originally posted on 7th July 2013 at http://forums.arm.com


    VQDMULH performs the "shift-by-Q" required to compute a fixed-point multiply.
    Using "0.8 * 0.8 = 0.64" in Q15 format as an example:

    • The Q15 register value is given by 0.8 * 2^15 = 26214
    • 26214 in hexadecimal = 0x6666
    • 0x6666 * 0x6666 = 0x28F570A4
    • 0x28F570A4 * 2 = 0x51EAE148
    • Top half of 0x51EAE148 = 0x51EA
    • 0x51EA in decimal = 20970
    • The interpretation of this Q15 register value is 20970 / 2^15 = 0.64
    hth
    s.


    Thank you very much. Now I see that VQDMULH might be really useful for q31 and q15 arithmetics.
Reply
  • Note: This was originally posted on 7th July 2013 at http://forums.arm.com


    VQDMULH performs the "shift-by-Q" required to compute a fixed-point multiply.
    Using "0.8 * 0.8 = 0.64" in Q15 format as an example:

    • The Q15 register value is given by 0.8 * 2^15 = 26214
    • 26214 in hexadecimal = 0x6666
    • 0x6666 * 0x6666 = 0x28F570A4
    • 0x28F570A4 * 2 = 0x51EAE148
    • Top half of 0x51EAE148 = 0x51EA
    • 0x51EA in decimal = 20970
    • The interpretation of this Q15 register value is 20970 / 2^15 = 0.64
    hth
    s.


    Thank you very much. Now I see that VQDMULH might be really useful for q31 and q15 arithmetics.
Children
No data