This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON performance; Operators order;

Note: This was originally posted on 22nd October 2012 at http://forums.arm.com

Hi Guys!

I have 2 short codes, which give the equal result.

Code1
     
"vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q8, d11, d3   \r\n"

      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7
      "vmlal.s16    q12, d13, d5   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



Code2

      "vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7

      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q12, d13, d5   \r\n"

      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"

      "vmlal.s16    q8, d11, d3   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



It seems, that Code2 should be faster, because commands use different registers, so shouldn't wait previous one.
But in practise Code1 is faster  (checked by timer in loop)

Could anybody explain it, please?

You are welcomed to offer better option than Code1/2 :)

Thanks in advance.

PS: The core is Cortex-A9!!!
Parents
  • Note: This was originally posted on 22nd October 2012 at http://forums.arm.com

    On Cortex-A8 and Cortex-A9 integer multiply and multiply-accumulate have a special forwarding path where you can issue them back to back without stalling if the result from the first is the value added to the second. But this only works if you don't perform other multiplies in between, so your second example breaks this and makes you pay for the full latencies shown in the timing tables in the TRMs.

    The forwarding is so useful that sometimes I use it even when I don't want a full multiply add - for instance, to replace just an add or a shift + OR. But this means you need to have registers setup for this purpose. For instance, if you can have a register that has 1 in all of the fields you can replace that vadd at the end with another vmlal.
Reply
  • Note: This was originally posted on 22nd October 2012 at http://forums.arm.com

    On Cortex-A8 and Cortex-A9 integer multiply and multiply-accumulate have a special forwarding path where you can issue them back to back without stalling if the result from the first is the value added to the second. But this only works if you don't perform other multiplies in between, so your second example breaks this and makes you pay for the full latencies shown in the timing tables in the TRMs.

    The forwarding is so useful that sometimes I use it even when I don't want a full multiply add - for instance, to replace just an add or a shift + OR. But this means you need to have registers setup for this purpose. For instance, if you can have a register that has 1 in all of the fields you can replace that vadd at the end with another vmlal.
Children
No data