vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEONSo 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.
vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEON