We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept 1: pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vst1.16 {d24-d27}, [r0,:256]! subs r12, r12, #16 bgt 1b bx lr
pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] pld [r1, #64*4] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] 1: vaddw.s16 q12, q12, d1 vld1.16 {d20-d23}, [r1,:256]! vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vmull.s16 q8, d20, d0[0] vmull.s16 q9, d21, d0[0] vmull.s16 q10, d22, d0[0] vmull.s16 q11, d23, d0[0] vst1.16 {d24-d27}, [r0,:256]! vaddw.s16 q8, q8, d1 vaddw.s16 q9, q9, d1 vaddw.s16 q10, q10, d1 vaddw.s16 q11, q11, d1 subs r12, r12, #32 vqrshrun.s32 d16, q8, #8 vqrshrun.s32 d17, q9, #8 vqrshrun.s32 d18, q10, #8 ble 2f pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vqrshrun.s32 d19, q11, #8 vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vst1.16 {d16-d19}, [r0,:256]! vmull.s16 q15, d31, d0[0] b 1b 2: vqrshrun.s32 d19, q11, #8 vst1.16 {d16-d19}, [r0,:256]! bx lr
Two vld's loading 32bytes each = 64bytes = 1 cache line. Putting one additional PLD didn't help.Thanks anyway
Or... does NEON on the A15 dual-issuing?
Note: This was originally posted on 13th July 2013 at http://forums.arm.com
Actually that's another thing you need to do a LOT of testing with: PLD.I noticed that your optimized loop above is loading 2 cachelines per iterations but only preloading 1 cacheline per iteration. PLD is something that seems too hard to predict what will work best, you need to do lots of testing with different number of PLD instructions, different locations of PLD within your code, and most importantly, different number of cachelines to load ahead of. But in general, you should preload 2 cachelines if you are loading 2 cachelines, and the PLD instructions are typically better if they are away from the VLD instructions (but not always) and interleaved among other code. And you should also experiment with loading 1, 2, 3, 4, 5, 6, 8, 10, 20, and 30 cachelines ahead, to see which one gives best results. (Obviously if you notice the perf is getting worse as you increase or decrease the PLD amount then you don't need to test every amount, but you get the idea).
Each cache line consists of 64 Bytes on Cortex A8. In the extended version It's exactly 64Bytes per iteration. No mistakes here.I already tried everything you are suggesting, repositioning PLD to every possible position.And reading further ahead only reduced the performance. Through my experiments, I found out 4 lines ahead to be the most efficient.
BTW, I know your site since two years or so Good stuffs there.