This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Fast duplicate lane

Note: This was originally posted on 17th April 2013 at http://forums.arm.com

hi.
I have a little problem.

I have as input a Dn register with 8 byte.
[a, b, c, d, e, f, g, h]

I'd like to have 2 Dn Register with
[a, a, b, b, c, c, d, d]
and
[e, e, f, f, g, g, h, h]

The purpose is to try to do that with a minimum number of NEON register.
for the moment the best Way I found is something like


vmovl.u8              Qn, Dn                 @ convert byte to half word
vmul.u16              Qn, Qn, Qx    @ Dx contain 8 * 257


I'm looking for a solution not using extra register !

do you nhave any idea ?
thank's
Parents
  • Note: This was originally posted on 17th April 2013 at http://forums.arm.com

    I'm looking for a solution not using extra register !


    I think you're out of luck - whatever you do is going to need a scratch register because you either need to hold the temporary modified value of the data before you combine it, or a constant for the multiply trick.

    You might get away without the multiply and have a transient register (rather than one containing the constant) by shifting the widened value left and ANDing it with the unshifted version, which might be faster (untested theory) but it is technically one instruction longer, so depends on pipeline (shift and and should be "simple" operations vs a MUL but YMMV).

    FWIW the multiply trick you are doing already is what I've used in the past for cross-lane data duplication, so that's my answer ;)


    Iso



Reply
  • Note: This was originally posted on 17th April 2013 at http://forums.arm.com

    I'm looking for a solution not using extra register !


    I think you're out of luck - whatever you do is going to need a scratch register because you either need to hold the temporary modified value of the data before you combine it, or a constant for the multiply trick.

    You might get away without the multiply and have a transient register (rather than one containing the constant) by shifting the widened value left and ANDing it with the unshifted version, which might be faster (untested theory) but it is technically one instruction longer, so depends on pipeline (shift and and should be "simple" operations vs a MUL but YMMV).

    FWIW the multiply trick you are doing already is what I've used in the past for cross-lane data duplication, so that's my answer ;)


    Iso



Children
No data