This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cascading multiplication; NEON;

Note: This was originally posted on 5th March 2013 at http://forums.arm.com

Hi Guys!

I've just found out the next confusing me thing in cascading multiplication in NEON module.
source1- http://pulsar.websha...sample-25fce1da
source2- http://pulsar.websha...sample-58fce985

1:

n.18-1   1c __            vmull.u8 q12,d9 ,d17
n.19-0   1c          __ vmlal.u8 q12,d11,d19
n.20-0   1c n0     vmlal.u8 q12,d12,d20
n.21-0   1c n0     vmlal.u8 q12,d14,d22
n.22-0   1c n0     vmlal.u8 q12,d2 ,d4

n.23-0   1c n0     vmlsl.u8 q12,d8 ,d16
n.24-0   1c n0     vmlsl.u8 q12,d10,d18
n.25-0   1c n0     vmlsl.u8 q12,d13,d21
n.26-0   1c n0     vmlsl.u8 q12,d15,d23

n.35-0   2c n0   _____ vmul.s16 q12,q4 ,d0[0]
n.39-0   2c n0 q12l:5 vmla.s16 q12,q5 ,d0[1]
n.43-0   2c n0 q12l:5 vmla.s16 q12,q6 ,d0[2]
n.47-0   2c n0 q12l:5 vmla.s16 q12,q7 ,d0[3]
n.51-0   2c n0 q12l:5 vmla.s16 q12,q8 ,d1[0]
n.55-0   2c n0 q12l:5 vmla.s16 q12,q9 ,d1[1]
n.59-0   2c n0 q12l:5 vmla.s16 q12,q10,d1[2]
n.63-0   2c n0 q12l:5 vmla.s16 q12,q11,d1[3]
n.67-0   2c n0 q12l:5 vmla.s16 q12,q1 ,q2

As you can see, in first case multiply commands follow one to another without any delays while in second case each command is started when the previous one have been finished.
I was very surprized when I found out it.

The only one description about cascading multiplication which I found is "DDI0409I_cortex_a9_neon_mpe_r4p1_trm"

If a multiply-accumulate follows a multiply or another
multiply-accumulate, and depends on the result of that first instruction, then
if the dependency between both instructions are of the same type and size,
the processor uses a special multiplier accumulator forwarding. This special
forwarding means the multiply instructions can issue back-to-back because
the result of the first instruction in cycle 5 is forwarded to the accumulator
of the second instruction in cycle 4. If the size and type of the instructions
do not match, then Dd or Qd is required in cycle 3. This applies to
combinations of the multiply-accumulate instructions VMLA, VMLS, VQDMLA,
and VQDMLS, and the multiply instructions VMUL andVQDMUL.

but unfortunatelly this doesn't explain why it works in one case and not to work in another...

Does anybody know how I can get the full description of multiplication commands to use them right and effective?
Thanks for your attention.