Hum.In fact, You may be right !I've tryed to copy 2, 3 and 4 times the NEON code into the loopnormally, 8 couples of instruction should take 8 cycles but it takes 1016 couples takes 2024 couples takes 3832 couples takes 48The time increase strangely.I've replaced the vld1 by vtrn. the timing are8 couples takes 8 cyles16 couples takes 19 cyles24 couples takes 32 cyles..So now. I'm think you're right. There is a bottleneck due to bandwith.And it's more visible with vld1 than with vtrn because when the vld1 is pushed to the NEON queue, the VALUE of the address register is pushed too.We are not agree that's could arrives Just replace vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14by vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r1:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r1:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r1:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r1:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r1:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r1:128] vmul.f32 d7,d15,d14...and you'll have your proof.What you'll not have is a logical explanation During a moment, I thaught that maybe the address register could not be used directly on the next cycle, but if you replace the ADD by MOV to just duplicated register value, the problem is the same !
vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14
vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r1:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r1:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r1:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r1:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r1:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r1:128] vmul.f32 d7,d15,d14