vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEONSo 16 cycles like predicted. Note that you'd get a lot better performance if you unrolled this loop to fill up the latency after the last multiply and shift. Doing it 4 times should be sufficient.
vld3.8 {d0-d2}, [r1]! @ cycles 0-3, result in N2 of last cycle vmull.u8 q3, d0, d5 @ cycle 4 (can't dual issue due to previous result in N2) vmlal.u8 q3, d1, d4 @ cycle 5 vmlal.u8 q3, d2, d3 @ cycle 6, result in N6 vshrn.u16 d6, q3, #8 @ cycle 12 (value needed in N1, 5 cycle stall), result in N3 vst1.8 {d6}, [r0]! @ cycle 15 (value needed in N1, 2 cycle stall) subs r2, r2, #1 @ overlaps w/NEON bne .loop @ overlaps w/NEON
From the very beginning, Idon't think AML8726-M is a good platform for its 128KB L2 and 65nm fabprocess, but its multimedia performance is pretty well, 1080P, Mali400. What is the differences between imx515 and imx535, freq?
Any one have document about Cortex-A9 pipeline ?
View all questions in Arm Development Studio forum