pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept 1: pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vst1.16 {d24-d27}, [r0,:256]! subs r12, r12, #16 bgt 1b bx lr
pld [r1, #64*0] pld [r1, #64*1] pld [r1, #64*2] pld [r1, #64*3] pld [r1, #64*4] ldr r12, [sp] vdup.16 d0, r2 //coeff vdup.16 d1, r3 //intercept vld1.16 {d28-d31}, [r1,:256]! vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0] 1: vaddw.s16 q12, q12, d1 vld1.16 {d20-d23}, [r1,:256]! vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d24, q12, #8 vqrshrun.s32 d25, q13, #8 vqrshrun.s32 d26, q14, #8 vqrshrun.s32 d27, q15, #8 vmull.s16 q8, d20, d0[0] vmull.s16 q9, d21, d0[0] vmull.s16 q10, d22, d0[0] vmull.s16 q11, d23, d0[0] vst1.16 {d24-d27}, [r0,:256]! vaddw.s16 q8, q8, d1 vaddw.s16 q9, q9, d1 vaddw.s16 q10, q10, d1 vaddw.s16 q11, q11, d1 subs r12, r12, #32 vqrshrun.s32 d16, q8, #8 vqrshrun.s32 d17, q9, #8 vqrshrun.s32 d18, q10, #8 ble 2f pld [r1, #64*4] vld1.16 {d28-d31}, [r1,:256]! vqrshrun.s32 d19, q11, #8 vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vst1.16 {d16-d19}, [r0,:256]! vmull.s16 q15, d31, d0[0] b 1b 2: vqrshrun.s32 d19, q11, #8 vst1.16 {d16-d19}, [r0,:256]! bx lr
Note: This was originally posted on 13th July 2013 at http://forums.arm.com
Actually that's another thing you need to do a LOT of testing with: PLD.I noticed that your optimized loop above is loading 2 cachelines per iterations but only preloading 1 cacheline per iteration. PLD is something that seems too hard to predict what will work best, you need to do lots of testing with different number of PLD instructions, different locations of PLD within your code, and most importantly, different number of cachelines to load ahead of. But in general, you should preload 2 cachelines if you are loading 2 cachelines, and the PLD instructions are typically better if they are away from the VLD instructions (but not always) and interleaved among other code. And you should also experiment with loading 1, 2, 3, 4, 5, 6, 8, 10, 20, and 30 cachelines ahead, to see which one gives best results. (Obviously if you notice the perf is getting worse as you increase or decrease the PLD amount then you don't need to test every amount, but you get the idea).
Each cache line consists of 64 Bytes on Cortex A8. In the extended version It's exactly 64Bytes per iteration. No mistakes here.I already tried everything you are suggesting, repositioning PLD to every possible position.And reading further ahead only reduced the performance. Through my experiments, I found out 4 lines ahead to be the most efficient.
BTW, I know your site since two years or so Good stuffs there.
Or... does NEON on the A15 dual-issuing?
Two vld's loading 32bytes each = 64bytes = 1 cache line. Putting one additional PLD didn't help.Thanks anyway
@ preload vld1.16 {d28-d31}, [r1,:256]!@ first operation to tranfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ you'll have to make 1 less iteration subs r12, r12, #16.loop:@ load for next iteration vld1.16 {d28-d31}, [r1,:256]!@ working on preloaded datas vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1@ prepare for saving vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8@ let sone time before saving - transfert register vmull.s16 q12, d28, d0[0] vmull.s16 q13, d29, d0[0] vmull.s16 q14, d30, d0[0] vmull.s16 q15, d31, d0[0]@ saving vst1.16 {d20-d23}, [r0,:256]! subs r12, r12, #16bgt .loop@ need to finsh last iteration vaddw.s16 q12, q12, d1 vaddw.s16 q13, q13, d1 vaddw.s16 q14, q14, d1 vaddw.s16 q15, q15, d1 vqrshrun.s32 d20, q12, #8 vqrshrun.s32 d21, q13, #8 vqrshrun.s32 d22, q14, #8 vqrshrun.s32 d23, q15, #8 vst1.16 {d20-d23}, [r0,:256]!
Yes the Cortex-A15 is a beast, it can dual-issue NEON and it has numerous NEON ALUs. eg: VADDQ is 1.5 instructions per clock (ie: you can add 24 x 8-bit numbers per clock!) per CPU core. So a quad-core A15 running at 1.6 GHz with 4 cores getting 85% parallelization can process 4 x 0.85 x 24 x 1.6 x 10^9 = 130 GB/s!
I've found really weird, inexplicable things with NEON too :/ One thing I've had a suspicion of for a while, but little really hard data on, is that maybe there are penalties for too many NEON instructions before hitting a non-NEON instruction. At the very least we do know that the 16-entry queue will fill up and you'll no longer be able to decode non-NEON instructions ahead of NEON execution. You can also try aligning the loop entry point to 8 bytes to maximize fetch throughput.One kind of obvious question.. are your iterations actually multiples of 32? If you're rounding them up to 32 you'll be doing half a loop more work. I don't think you need to unroll like this to get enough distance to avoid stalls, you can probably do it just by staggering/software pipelining the loop.The only other thing I can think of, is I don't know how well the multi-cycle loads and stores dual-issue. You may want to try splitting it into separate loads and stores to try to get both sides to dual-issue. But you can't put them back to back or the address increment will stall in the integer pipeline. If you really need them back to back you can do it with separate pointers at interleaved addresses and a register increment but that increases register pressure a lot.
View all questions in Arm Development Studio forum