Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.
This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.
I'd still like to see some evidence that this causes problems and what exactly.
vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14
vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r1:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r1:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r1:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r1:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r1:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r1:128] vmul.f32 d7,d15,d14