This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cascading multiplication; NEON;

Note: This was originally posted on 5th March 2013 at http://forums.arm.com

Hi Guys!

I've just found out the next confusing me thing in cascading multiplication in NEON module.
source1- http://pulsar.websha...sample-25fce1da
source2- http://pulsar.websha...sample-58fce985

1:
n.18-1   1c  __            vmull.u8 q12,d9 ,d17
n.19-0   1c          __ vmlal.u8 q12,d11,d19
n.20-0   1c n0     vmlal.u8 q12,d12,d20
n.21-0   1c n0     vmlal.u8 q12,d14,d22
n.22-0   1c n0     vmlal.u8 q12,d2 ,d4
                      
n.23-0   1c n0     vmlsl.u8 q12,d8 ,d16
n.24-0   1c n0     vmlsl.u8 q12,d10,d18
n.25-0   1c n0     vmlsl.u8 q12,d13,d21
n.26-0   1c n0     vmlsl.u8 q12,d15,d23


2:
n.35-0   2c n0   _____ vmul.s16 q12,q4 ,d0[0]
n.39-0   2c n0 q12l:5 vmla.s16 q12,q5 ,d0[1]
n.43-0   2c n0 q12l:5 vmla.s16 q12,q6 ,d0[2]
n.47-0   2c n0 q12l:5 vmla.s16 q12,q7 ,d0[3]
n.51-0   2c n0 q12l:5 vmla.s16 q12,q8 ,d1[0]
n.55-0   2c n0 q12l:5 vmla.s16 q12,q9 ,d1[1]
n.59-0   2c n0 q12l:5 vmla.s16 q12,q10,d1[2]
n.63-0   2c n0 q12l:5 vmla.s16 q12,q11,d1[3]
n.67-0   2c n0 q12l:5 vmla.s16 q12,q1 ,q2


As you can see, in first case multiply commands follow one to another without any delays while in second case each command is started when the previous one have been finished.
I was very surprized when I found out it.

The only one description about cascading multiplication which I found is "DDI0409I_cortex_a9_neon_mpe_r4p1_trm"



If a multiply-accumulate follows a multiply or another
multiply-accumulate, and depends on the result of that first instruction, then
if the dependency between both instructions are of the same type and size,
the processor uses a special multiplier accumulator forwarding. This special
forwarding means the multiply instructions can issue back-to-back because
the result of the first instruction in cycle 5 is forwarded to the accumulator
of the second instruction in cycle 4. If the size and type of the instructions
do not match, then Dd or Qd is required in cycle 3. This applies to
combinations of the multiply-accumulate instructions VMLA, VMLS, VQDMLA,
and VQDMLS, and the multiply instructions VMUL andVQDMUL.



but unfortunatelly this doesn't explain why it works in one case and not to work  in another...

Does anybody know how I can get the full description of multiplication commands to use them right and effective?
Thanks for your attention.
  • Note: This was originally posted on 6th March 2013 at http://forums.arm.com

    Exophase!

    To be truth, I tryed this web simulator after I got the significant difference in performance on the Motorola phone (cortex-a9, 1.2 GHz).
    As you can see the function is interpolate data block. For block 8x8 launched 500*10^6 times I got 188 sec in one case against 101 sec in another, so software simulator and hardware gave almost the same results. There are Perf1/Perf2 = 2.02 and 1.86 in simulator and hardware respectively.
    So it seems that the simulator works correctly.

    Also, you can see another varient of multiplication - http://pulsar.websha...sample-42bd665a
    In this case, all multiplications follow each other without delays, in spite of using scalar operands.

    n.9-0    1c n0     vmull.s16 q13,d9 ,d0[0]
    n.10-0   1c n0     vmlal.s16 q13,d11,d0[1]
    n.11-0   1c n0     vmlal.s16 q13,d13,d0[2]
    n.12-0   1c n0     vmlal.s16 q13,d15,d0[3]
    n.13-0   1c n0     vmlal.s16 q13,d17,d1[0]
    n.14-0   1c n0     vmlal.s16 q13,d19,d1[1]
    n.15-0   1c n0     vmlal.s16 q13,d21,d1[2]
    n.16-0   1c n0     vmlal.s16 q13,d23,d1[3]


    I'm completely confused by all this options .
    All in all, my question's opened yet.
  • Note: This was originally posted on 15th March 2013 at http://forums.arm.com

    UP!
  • Note: This was originally posted on 2nd April 2013 at http://forums.arm.com


    If you think about it, this behavior is what you would expect. The NEON unit on Cortex-A8 and Cortex-A9 can do 4x16-bit multiplications in one cycle. Doing an 8x16-bit multiplication is like doing two 4x16-bit ones back to back, and takes two cycles. Because it's like alternating between two different multiplications it can't forward the result because it'd have to forward two things where there's probably only one internal accumulator.


    Exophase!

    Where did you get information about "4x16-bit", "8x16-bit"? Actually in the first example there was 8x8bit multiplications, which was well....
  • Note: This was originally posted on 2nd April 2013 at http://forums.arm.com

    4x16-bit and 8x8-bit both take one cycle to issue. 8x16-bit and 16x8-bit take two cycles. You can see this in the Cortex-A8 and A9 TRMs.

    You can't issue the two cycle versions back to back without stalling for more cycles since it breaks the forwarding.

    Basically, as far as the NEON unit is concerned, what you're doing here:

    vmul.s16 q12,q4 ,d0[0]
    vmla.s16 q12,q5 ,d0[1]
    vmla.s16 q12,q6 ,d0[2]
    vmla.s16 q12,q7 ,d0[3]
    vmla.s16 q12,q8 ,d1[0]
    vmla.s16 q12,q9 ,d1[1]
    vmla.s16 q12,q10,d1[2]
    vmla.s16 q12,q11,d1[3]
    vmla.s16 q12,q1 ,q2


    Is the same as this:

    vmul.s16 d24,d8 ,d0[0]
    vmul.s16 d25,d9 ,d0[0]
    vmla.s16 q24,d10,d0[1]
    vmla.s16 d25,d11,d0[1]
    vmla.s16 d24,d12,d0[2]
    vmla.s16 d25,d13,d0[2]
    vmla.s16 d24,d14,d0[3]
    vmla.s16 d25,d15,d0[3]
    vmla.s16 d24,d16,d1[0]
    vmla.s16 d25,d17,d1[0]
    vmla.s16 d24,d18,d1[1]
    vmla.s16 d25,d19,d1[1]
    vmla.s16 d24,d20,d1[2]
    vmla.s16 d25,d21,d1[2]
    vmla.s16 d24,d22,d1[3]
    vmla.s16 d25,d23,d1[3]
    vmla.s16 d24,d2 ,d4
    vmla.s16 d25,d3 ,d5


    It stalls because the operations aren't really back to back and it can't forward between two interleaved operations.

    If you instead did this manually:

    vmul.s16 d24,d8 ,d0[0]
    vmla.s16 q24,d10,d0[1]
    vmla.s16 d24,d12,d0[2]
    vmla.s16 d24,d14,d0[3]
    vmla.s16 d24,d16,d1[0]
    vmla.s16 d24,d18,d1[1]
    vmla.s16 d24,d20,d1[2]
    vmla.s16 d24,d22,d1[3]
    vmla.s16 d24,d2 ,d4

    vmul.s16 d25,d9 ,d0[0]
    vmla.s16 d25,d11,d0[1]
    vmla.s16 d25,d13,d0[2]
    vmla.s16 d25,d15,d0[3]
    vmla.s16 d25,d17,d1[0]
    vmla.s16 d25,d19,d1[1]
    vmla.s16 d25,d21,d1[2]
    vmla.s16 d25,d23,d1[3]
    vmla.s16 d25,d3 ,d5


    The result should be the same but you shouldn't get any stalls.
  • Note: This was originally posted on 5th March 2013 at http://forums.arm.com

    You need to try timing it on real hardware to see if it's a limitation there and not just a problem with webshaker's simulator. I can confirm that you can issue dependently vmla back to back but I haven't tried the scalar operand version.
  • Note: This was originally posted on 19th March 2013 at http://forums.arm.com

    Well, this isn't something that ARM documents, but there are a lot of things about NEON timing that you can't find in datasheets.

    I hadn't noticed originally that the second set of examples was using 128-bit registers. I think it's this, and not the fact that you're using scalars, that's causing the stalls. That you don't get them with 64-bit multiplications with scalars also supports this..

    If you think about it, this behavior is what you would expect. The NEON unit on Cortex-A8 and Cortex-A9 can do 4x16-bit multiplications in one cycle. Doing an 8x16-bit multiplication is like doing two 4x16-bit ones back to back, and takes two cycles. Because it's like alternating between two different multiplications it can't forward the result because it'd have to forward two things where there's probably only one internal accumulator.

    If the forwarding gets broken you have to wait the full latency which is 6 cycles. The second multiplication hides one of the latency cycles so you get a 4 cycle stall instead of 5 cycles.

    You can also find the forwarding broken if you put some (maybe any, haven't tested) NEON instructions between the multiplications.