NEON performance; Operators order;

Note: This was originally posted on 22nd October 2012 at http://forums.arm.com

Hi Guys!

I have 2 short codes, which give the equal result.

Code1
     
"vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q8, d11, d3   \r\n"

      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7
      "vmlal.s16    q12, d13, d5   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



Code2

      "vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7

      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q12, d13, d5   \r\n"

      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"

      "vmlal.s16    q8, d11, d3   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



It seems, that Code2 should be faster, because commands use different registers, so shouldn't wait previous one.
But in practise Code1 is faster  (checked by timer in loop)

Could anybody explain it, please?

You are welcomed to offer better option than Code1/2 :)

Thanks in advance.

PS: The core is Cortex-A9!!!
More questions in this forum