"vmull.s16 q8, d8, d0 \r\n" //Col 0-3 "vmlal.s16 q8, d9, d1 \r\n" "vmlal.s16 q8, d10, d2 \r\n" "vmlal.s16 q8, d11, d3 \r\n" "vmull.s16 q12, d12, d4 \r\n" //Col 4-7 "vmlal.s16 q12, d13, d5 \r\n" "vmlal.s16 q12, d14, d6 \r\n" "vmlal.s16 q12, d15, d7 \r\n" "vadd.i32 q8, q8, q12 \r\n"
"vmull.s16 q8, d8, d0 \r\n" //Col 0-3 "vmull.s16 q12, d12, d4 \r\n" //Col 4-7 "vmlal.s16 q8, d9, d1 \r\n" "vmlal.s16 q12, d13, d5 \r\n" "vmlal.s16 q8, d10, d2 \r\n" "vmlal.s16 q12, d14, d6 \r\n" "vmlal.s16 q8, d11, d3 \r\n" "vmlal.s16 q12, d15, d7 \r\n" "vadd.i32 q8, q8, q12 \r\n"
For instance, if you can have a register that has 1 in all of the fields you can replace that vadd at the end with another vmlal.