This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON performance; Operators order;

Note: This was originally posted on 22nd October 2012 at http://forums.arm.com

Hi Guys!

I have 2 short codes, which give the equal result.

Code1
     
"vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q8, d11, d3   \r\n"

      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7
      "vmlal.s16    q12, d13, d5   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



Code2

      "vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7

      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q12, d13, d5   \r\n"

      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"

      "vmlal.s16    q8, d11, d3   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



It seems, that Code2 should be faster, because commands use different registers, so shouldn't wait previous one.
But in practise Code1 is faster  (checked by timer in loop)

Could anybody explain it, please?

You are welcomed to offer better option than Code1/2 :)

Thanks in advance.

PS: The core is Cortex-A9!!!
Parents
  • Note: This was originally posted on 24th October 2012 at http://forums.arm.com

    Hi Exophase,

    Thanks for the detailed reply, oh I see what you mean now, that if you have a VMUL or VMLA at T=0, and then any normal NEON instruction (such as VADD at T=1) needs the result of it before T=7 then it must stall until T=7, unless it is a VMLA instruction instead of VADD, in which case it would use the special VMLA forwarding path to execute at T=1 without stalling, but would still provide the result 6 cycles later at T=7. So potentially you could then run an independent (3rd) instruction such as VSHL at T=2 (or a dependent instruction at T=8), whereas if you used VADD as the 2nd instruction instead of VMLA then the next (3rd) instruction wouldn't start till atleast T=9.

    So in this case it would be slightly faster to replace a VADD with a VMLA! Very surprising!

    Thanks,
    Shervin.
Reply
  • Note: This was originally posted on 24th October 2012 at http://forums.arm.com

    Hi Exophase,

    Thanks for the detailed reply, oh I see what you mean now, that if you have a VMUL or VMLA at T=0, and then any normal NEON instruction (such as VADD at T=1) needs the result of it before T=7 then it must stall until T=7, unless it is a VMLA instruction instead of VADD, in which case it would use the special VMLA forwarding path to execute at T=1 without stalling, but would still provide the result 6 cycles later at T=7. So potentially you could then run an independent (3rd) instruction such as VSHL at T=2 (or a dependent instruction at T=8), whereas if you used VADD as the 2nd instruction instead of VMLA then the next (3rd) instruction wouldn't start till atleast T=9.

    So in this case it would be slightly faster to replace a VADD with a VMLA! Very surprising!

    Thanks,
    Shervin.
Children
No data