We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEONSo 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.
vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEON