"vmull.s16 q8, d8, d0 \r\n" //Col 0-3 "vmlal.s16 q8, d9, d1 \r\n" "vmlal.s16 q8, d10, d2 \r\n" "vmlal.s16 q8, d11, d3 \r\n" "vmull.s16 q12, d12, d4 \r\n" //Col 4-7 "vmlal.s16 q12, d13, d5 \r\n" "vmlal.s16 q12, d14, d6 \r\n" "vmlal.s16 q12, d15, d7 \r\n" "vadd.i32 q8, q8, q12 \r\n"
"vmull.s16 q8, d8, d0 \r\n" //Col 0-3 "vmull.s16 q12, d12, d4 \r\n" //Col 4-7 "vmlal.s16 q8, d9, d1 \r\n" "vmlal.s16 q12, d13, d5 \r\n" "vmlal.s16 q8, d10, d2 \r\n" "vmlal.s16 q12, d14, d6 \r\n" "vmlal.s16 q8, d11, d3 \r\n" "vmlal.s16 q12, d15, d7 \r\n" "vadd.i32 q8, q8, q12 \r\n"
Exophase, I knew there is a special forwarding path for VMLA, but I never imagined it can allow VMLA to be faster than VADD! Can you give any more info on why an extra VMLA would be faster than a VADD in this case (ie: at the end of the code shown here)? I would have expected VADD to always be faster.-Shervin.