hum.It was so strange that I've made the test.I do not find exactly the same result as yours but the problem il still there.That's really strange !!!if you replace VMLA.F32 by VMUL.F32 or VMLA.U32 the problem is solved.So I assume that the shortcut of the vmla.f32 is not applied if there is another instruction between the mul and the mla.It seem's that this problem is only true for float MLA !That's strange.What is more strange is why the first code take so many time while it should take 9 cycles (if we don't use vmla.f32) !I've tried to change the value of the adress registerFinally I changed the address register value. add r2, r1, #16 add r3, r2, #16 add r4, r3, #16 b .loop1 .align 4.loop1: vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14 subs r0, r0, #1 bgt .loop1This code (the NEON part only) take now 10 cycles. It should take only 8 cycles.I assume that there is a conflict into the memory file of NEON when you use the same address register.So.1 - don't put instruction between MUL and MAL when you use float opérations.2 - don't read the same data with NEON (in you never have to do that. You've made thins because you try a bench. In real life this case never happend).NEON is not fully detailled in the documentation. There is a lot of hint you'll have to found by testing.I do not know the both you found !Etienne.
add r2, r1, #16 add r3, r2, #16 add r4, r3, #16 b .loop1 .align 4.loop1: vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14 subs r0, r0, #1 bgt .loop1