We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept 1: pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vst1.16 {d24-d27}, [r0,:256]! subs r12, r12, #16 bgt 1b bx lr
pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] pld [r1, #64*4] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] 1: vaddw.s16 q12, q12, d1 vld1.16 {d20-d23}, [r1,:256]! vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vmull.s16 q8, d20, d0[0] vmull.s16 q9, d21, d0[0] vmull.s16 q10, d22, d0[0] vmull.s16 q11, d23, d0[0] vst1.16 {d24-d27}, [r0,:256]! vaddw.s16 q8, q8, d1 vaddw.s16 q9, q9, d1 vaddw.s16 q10, q10, d1 vaddw.s16 q11, q11, d1 subs r12, r12, #32 vqrshrun.s32 d16, q8, #8 vqrshrun.s32 d17, q9, #8 vqrshrun.s32 d18, q10, #8 ble 2f pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vqrshrun.s32 d19, q11, #8 vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vst1.16 {d16-d19}, [r0,:256]! vmull.s16 q15, d31, d0[0] b 1b 2: vqrshrun.s32 d19, q11, #8 vst1.16 {d16-d19}, [r0,:256]! bx lr
Yes I think Webshaker's idea will give you good results, because it is easy to read ARM & NEON docs and think that counting the CPU clock cycles for your instructions will be enough to estimate the total time of your loop, when in fact most code in mobile spends more time loading & saving memory than performing CPU calculations. So your main goal should be to minimize the amount of time wasted while waiting for data to load & store. This usually means accessing memory in good cache-friendly ways, using main memory the least you can inside your loop, AND trying to increase the number of useful operations that happen between loading a value from memory and actually using it in another instruction.Also, having longer loops can sometimes reduce the speed of your code, so this might also be influencing your perf.
Hi.The main NEON optimisation method is to let some times between register loading/saving instructions and compute operationsTry something like this@ preload vld1.16 {d28-d31}, [r1,:256]!@ first operation to tranfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ you'll have to make 1 less iteration subs r12, r12, #16.loop:@ load for next iteration vld1.16 {d28-d31}, [r1,:256]!@ working on preloaded datas vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1@ prepare for saving vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8@ let sone time before saving - transfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ saving vst1.16 {d20-d23}, [r0,:256]! subs r12, r12, #16bgt .loop@ need to finsh last iteration vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 vst1.16 {d20-d23}, [r0,:256]!That sould works.But I do not test the code !
@ preload vld1.16 {d28-d31}, [r1,:256]!@ first operation to tranfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ you'll have to make 1 less iteration subs r12, r12, #16.loop:@ load for next iteration vld1.16 {d28-d31}, [r1,:256]!@ working on preloaded datas vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1@ prepare for saving vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8@ let sone time before saving - transfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ saving vst1.16 {d20-d23}, [r0,:256]! subs r12, r12, #16bgt .loop@ need to finsh last iteration vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 vst1.16 {d20-d23}, [r0,:256]!
> In the extended version It's exactly 64Bytes per iteration. No mistakes here.What I meant is that you have 2 VLD instructions (loading a total of 2 cachelines) per iteration, but you only have 1 PLD instruction. So you should add an extra PLD instruction in your loop, to preload 2 cachelines not just 1, otherwise you are only preloading half of your data!
Your problem seems very strange- Does r0 and r1 reference the same memory zone ? if this is the case try to reference different memory space.- Can you try your code without any NEON instruction (except memoryt acces of course) ! May be you had saturate the memory acces capacity (what is the hardware you are using ?) !- Finally is there any chance you benchmark method were wrong ?I will try you code on my beagleboard to see !Etienne
I've found really weird, inexplicable things with NEON too :/ One thing I've had a suspicion of for a while, but little really hard data on, is that maybe there are penalties for too many NEON instructions before hitting a non-NEON instruction. At the very least we do know that the 16-entry queue will fill up and you'll no longer be able to decode non-NEON instructions ahead of NEON execution. You can also try aligning the loop entry point to 8 bytes to maximize fetch throughput.One kind of obvious question.. are your iterations actually multiples of 32? If you're rounding them up to 32 you'll be doing half a loop more work. I don't think you need to unroll like this to get enough distance to avoid stalls, you can probably do it just by staggering/software pipelining the loop.The only other thing I can think of, is I don't know how well the multi-cycle loads and stores dual-issue. You may want to try splitting it into separate loads and stores to try to get both sides to dual-issue. But you can't put them back to back or the address increment will stall in the integer pipeline. If you really need them back to back you can do it with separate pointers at interleaved addresses and a register increment but that increases register pressure a lot.
Yes the Cortex-A15 is a beast, it can dual-issue NEON and it has numerous NEON ALUs. eg: VADDQ is 1.5 instructions per clock (ie: you can add 24 x 8-bit numbers per clock!) per CPU core. So a quad-core A15 running at 1.6 GHz with 4 cores getting 85% parallelization can process 4 x 0.85 x 24 x 1.6 x 10^9 = 130 GB/s!