We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
"vmull.s16 q8, d8, d0 \r\n" //Col 0-3 "vmlal.s16 q8, d9, d1 \r\n" "vmlal.s16 q8, d10, d2 \r\n" "vmlal.s16 q8, d11, d3 \r\n" "vmull.s16 q12, d12, d4 \r\n" //Col 4-7 "vmlal.s16 q12, d13, d5 \r\n" "vmlal.s16 q12, d14, d6 \r\n" "vmlal.s16 q12, d15, d7 \r\n" "vadd.i32 q8, q8, q12 \r\n"
"vmull.s16 q8, d8, d0 \r\n" //Col 0-3 "vmull.s16 q12, d12, d4 \r\n" //Col 4-7 "vmlal.s16 q8, d9, d1 \r\n" "vmlal.s16 q12, d13, d5 \r\n" "vmlal.s16 q8, d10, d2 \r\n" "vmlal.s16 q12, d14, d6 \r\n" "vmlal.s16 q8, d11, d3 \r\n" "vmlal.s16 q12, d15, d7 \r\n" "vadd.i32 q8, q8, q12 \r\n"
"vmull.s16 q8, d8, d0 \r\n" //Col 0-3 "vmlal.s16 q8, d9, d1 \r\n" "vmlal.s16 q8, d10, d2 \r\n" "vmlal.s16 q8, d11, d3 \r\n" "vmlal.s16 q8, d12, d4 \r\n" //Col 4-7 "vmlal.s16 q8, d13, d5 \r\n" "vmlal.s16 q8, d14, d6 \r\n" "vmlal.s16 q8, d15, d7 \r\n"
Exophase, I knew there is a special forwarding path for VMLA, but I never imagined it can allow VMLA to be faster than VADD! Can you give any more info on why an extra VMLA would be faster than a VADD in this case (ie: at the end of the code shown here)? I would have expected VADD to always be faster.-Shervin.
For instance, if you can have a register that has 1 in all of the fields you can replace that vadd at the end with another vmlal.