This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON performance; Operators order;

Note: This was originally posted on 22nd October 2012 at http://forums.arm.com

Hi Guys!

I have 2 short codes, which give the equal result.

Code1
     
"vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q8, d11, d3   \r\n"

      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7
      "vmlal.s16    q12, d13, d5   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



Code2

      "vmull.s16    q8, d8, d0   \r\n"    //Col 0-3
      "vmull.s16    q12, d12, d4   \r\n"  //Col 4-7

      "vmlal.s16    q8, d9, d1   \r\n"
      "vmlal.s16    q12, d13, d5   \r\n"

      "vmlal.s16    q8, d10, d2   \r\n"
      "vmlal.s16    q12, d14, d6   \r\n"

      "vmlal.s16    q8, d11, d3   \r\n"
      "vmlal.s16    q12, d15, d7   \r\n"

      "vadd.i32  q8, q8, q12     \r\n"



It seems, that Code2 should be faster, because commands use different registers, so shouldn't wait previous one.
But in practise Code1 is faster  (checked by timer in loop)

Could anybody explain it, please?

You are welcomed to offer better option than Code1/2 :)

Thanks in advance.

PS: The core is Cortex-A9!!!
  • Note: This was originally posted on 24th October 2012 at http://forums.arm.com

    Hi guys!

    Exophase! Thanks a lot for the so detailed answer!!!

    I was really surprized such result. Code was revised following you advice. Now it's:

      "vmull.s16    q8, d8, d0      \r\n"    //Col 0-3
      "vmlal.s16    q8, d9, d1      \r\n"
      "vmlal.s16    q8, d10, d2      \r\n"
      "vmlal.s16    q8, d11, d3      \r\n"

      "vmlal.s16    q8, d12, d4      \r\n"  //Col 4-7
      "vmlal.s16    q8, d13, d5      \r\n"
      "vmlal.s16    q8, d14, d6      \r\n"
      "vmlal.s16    q8, d15, d7      \r\n"



    There was about 20% increase in performance comparing with Code1 from my first topic.
    It's amazing!

    Guys thanks for discussing and your answers
  • Note: This was originally posted on 22nd October 2012 at http://forums.arm.com

    On Cortex-A8 and Cortex-A9 integer multiply and multiply-accumulate have a special forwarding path where you can issue them back to back without stalling if the result from the first is the value added to the second. But this only works if you don't perform other multiplies in between, so your second example breaks this and makes you pay for the full latencies shown in the timing tables in the TRMs.

    The forwarding is so useful that sometimes I use it even when I don't want a full multiply add - for instance, to replace just an add or a shift + OR. But this means you need to have registers setup for this purpose. For instance, if you can have a register that has 1 in all of the fields you can replace that vadd at the end with another vmlal.
  • Note: This was originally posted on 23rd October 2012 at http://forums.arm.com


    Exophase, I knew there is a special forwarding path for VMLA, but I never imagined it can allow VMLA to be faster than VADD! Can you give any more info on why an extra VMLA would be faster than a VADD in this case (ie: at the end of the code shown here)? I would have expected VADD to always be faster.

    -Shervin.


    It is pretty unexpected, isn't it? But going by the TRM integer VMLA is the only instruction that can issue with dependencies back to back without stalling. And in my testing I've yet to find a contradiction to this, and I've used and counted the timing of most of the integer NEON instructions.

    A simple NEON integer instruction will take its operands in stage N2 and produce results in stage N3. So if you issue these instructions back to back with dependencies there'll be a one cycle stall.

    Some instructions will use stage N1 to perform a pre-process stage: it's used for negations (vsub, vneg), widening, shifts (where N2 is used for inserts for instance, with vsli/vsri). Stage N4 is then used for post-processing, like for saturation, narrowing, and mask generation. So the latency can be three or four cycles instead of two. There are some more complex instructions that go all the way up to N6 but this covers the majority of them.

    The multiply/multiply-accumulate instructions are worse. They take their inputs in N2 or N1 (except the accumulate part, that's taken later in N3) and produce results in N6. So the back to back latency for a multiply output to a non-multiply instruction is four cycles. In this case you'd incur 3 cycles of stalling while vadd waits for q12 to be ready.

    Of course, you could just be putting the stall off for later if you follow it with another instruction that accesses it. A vstN instruction it's even worse, because it needs its inputs in N1. And even if you put this dependent instruction after a bunch of scalar instructions (like loop control) they'll still issue back to back in the NEON pipeline unless you're dominating the code with the scalar instructions. For these reasons it's best to try to software pipeline your loops, so you can interleave multiple iterations to hide dependencies. With only two iterations you can often space apart critical parts (like going for a vmla to a vstN) by more than one cycle, by interleaving big chunks of independent instructions.
  • Note: This was originally posted on 24th October 2012 at http://forums.arm.com

    Hi Exophase,

    Thanks for the detailed reply, oh I see what you mean now, that if you have a VMUL or VMLA at T=0, and then any normal NEON instruction (such as VADD at T=1) needs the result of it before T=7 then it must stall until T=7, unless it is a VMLA instruction instead of VADD, in which case it would use the special VMLA forwarding path to execute at T=1 without stalling, but would still provide the result 6 cycles later at T=7. So potentially you could then run an independent (3rd) instruction such as VSHL at T=2 (or a dependent instruction at T=8), whereas if you used VADD as the 2nd instruction instead of VMLA then the next (3rd) instruction wouldn't start till atleast T=9.

    So in this case it would be slightly faster to replace a VADD with a VMLA! Very surprising!

    Thanks,
    Shervin.
  • Note: This was originally posted on 23rd October 2012 at http://forums.arm.com

    For instance, if you can have a register that has 1 in all of the fields you can replace that vadd at the end with another vmlal.


    Exophase, I knew there is a special forwarding path for VMLA, but I never imagined it can allow VMLA to be faster than VADD! Can you give any more info on why an extra VMLA would be faster than a VADD in this case (ie: at the end of the code shown here)? I would have expected VADD to always be faster.

    -Shervin.