pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept 1: pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vst1.16 {d24-d27}, [r0,:256]! subs r12, r12, #16 bgt 1b bx lr
pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] pld [r1, #64*4] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] 1: vaddw.s16 q12, q12, d1 vld1.16 {d20-d23}, [r1,:256]! vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vmull.s16 q8, d20, d0[0] vmull.s16 q9, d21, d0[0] vmull.s16 q10, d22, d0[0] vmull.s16 q11, d23, d0[0] vst1.16 {d24-d27}, [r0,:256]! vaddw.s16 q8, q8, d1 vaddw.s16 q9, q9, d1 vaddw.s16 q10, q10, d1 vaddw.s16 q11, q11, d1 subs r12, r12, #32 vqrshrun.s32 d16, q8, #8 vqrshrun.s32 d17, q9, #8 vqrshrun.s32 d18, q10, #8 ble 2f pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vqrshrun.s32 d19, q11, #8 vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vst1.16 {d16-d19}, [r0,:256]! vmull.s16 q15, d31, d0[0] b 1b 2: vqrshrun.s32 d19, q11, #8 vst1.16 {d16-d19}, [r0,:256]! bx lr
Your problem seems very strange- Does r0 and r1 reference the same memory zone ? if this is the case try to reference different memory space.- Can you try your code without any NEON instruction (except memoryt acces of course) ! May be you had saturate the memory acces capacity (what is the hardware you are using ?) !- Finally is there any chance you benchmark method were wrong ?I will try you code on my beagleboard to see !Etienne
> In the extended version It's exactly 64Bytes per iteration. No mistakes here.What I meant is that you have 2 VLD instructions (loading a total of 2 cachelines) per iteration, but you only have 1 PLD instruction. So you should add an extra PLD instruction in your loop, to preload 2 cachelines not just 1, otherwise you are only preloading half of your data!
Hi.The main NEON optimisation method is to let some times between register loading/saving instructions and compute operationsTry something like this@ preload vld1.16 {d28-d31}, [r1,:256]!@ first operation to tranfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ you'll have to make 1 less iteration subs r12, r12, #16.loop:@ load for next iteration vld1.16 {d28-d31}, [r1,:256]!@ working on preloaded datas vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1@ prepare for saving vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8@ let sone time before saving - transfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ saving vst1.16 {d20-d23}, [r0,:256]! subs r12, r12, #16bgt .loop@ need to finsh last iteration vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 vst1.16 {d20-d23}, [r0,:256]!That sould works.But I do not test the code !
@ preload vld1.16 {d28-d31}, [r1,:256]!@ first operation to tranfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ you'll have to make 1 less iteration subs r12, r12, #16.loop:@ load for next iteration vld1.16 {d28-d31}, [r1,:256]!@ working on preloaded datas vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1@ prepare for saving vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8@ let sone time before saving - transfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ saving vst1.16 {d20-d23}, [r0,:256]! subs r12, r12, #16bgt .loop@ need to finsh last iteration vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 vst1.16 {d20-d23}, [r0,:256]!
Yes I think Webshaker's idea will give you good results, because it is easy to read ARM & NEON docs and think that counting the CPU clock cycles for your instructions will be enough to estimate the total time of your loop, when in fact most code in mobile spends more time loading & saving memory than performing CPU calculations. So your main goal should be to minimize the amount of time wasted while waiting for data to load & store. This usually means accessing memory in good cache-friendly ways, using main memory the least you can inside your loop, AND trying to increase the number of useful operations that happen between loading a value from memory and actually using it in another instruction.Also, having longer loops can sometimes reduce the speed of your code, so this might also be influencing your perf.
View all questions in Arm Development Studio forum