I am impressed by how fast you modified your code.I wish, I would be as good as you.I have just found some instructions that your module report as unrecognized. Please check them:
vqdmulh.s16 d0, d1, d2[0]vqrdmulh.s16 d0, d1, d2vqrdmulh.s16 d0, d1, d2[0]vqshlu.s32 q1, q2, #1vrecpe.u32 d1, d0vrecpe.u32 q1, q0vrsqrte.u32 d1, d0vrsqrte.u32 q1, q0vpmax.s16 d0, d1, d2vpmin.s16 d2, d1, d0vqdmulh.s16 d0, d1, d2vshll.s16 d2, q0, #1vshll.u16 d2, q0, #1
mov r0, #5add r0, r0, r2
Okay, I don't have a kind of epilogue at the end of my loop. This greatly changes the problem, and is more similar to my addition of nops. I don't think the NEON part is really taking 10 cycles, it's just that some of those cycles are overlapping with the non-NEON part.
This is conventional wisdom, but on NEON I would suggest the exact opposite. Loads make their data available in N1 and many NEON instructions need their data available in N2. In this case it's even possible to load and use the result of the load in the same cycle. If the memory is stalled for some other reason like a cache miss I don't think having the instruction ahead will help you.
I'd still like to see some evidence that this causes problems and what exactly.
vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14
vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r1:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r1:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r1:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r1:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r1:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r1:128] vmul.f32 d7,d15,d14
If I use three buffers with 4096 bytes size, the cycle count increases to around 140 cycles. I have given the code here, r1,r2,r3 has the address of buffer1,buffer2 and buffer3 respectively. However, if the buffer size is not a multiple of 4096, cycle count is normal.
This chapter provides the information to estimate how much execution time particular code sequences require. The complexity of the processor makes it impossible to guarantee precise timing information with hand calculations.
There are also similar restrictions to the ARM integer pipeline in terms of dual issue pairing with multi-cycle instructions. The NEON engine can potentially dual issue on both the first and last cycle of a multi-cycle instruction, but not on any of the intermediate cycles.
I thought I remembered issuing on both first and last cycle but I'm having trouble doing it now too. I'm also having trouble getting the loop you mentioned earlier down to 10 cycles. It looks like it's taking at least 12. The entire loop is taking 14 - since there is stalling, it's difficult to tell how much, if any, is overlapping the 2 cycles of integer loop overhead. You would think that at least one cycle would be overlapped since it's purely a fetch cycle.
movw r1, #:lower16:coef movt r1, #:upper16:coef add r2, r1, #16 add r3, r2, #16 add r4, r3, #16 b .loop1 .align 4.loop1: vld1.32 {d16,d17},[r1:128] vmul.f32 d0,d15,d14 vld1.32 {d18,d19},[r2:128] vmul.f32 d1,d15,d14 vld1.32 {d20,d21},[r3:128] vmul.f32 d2,d15,d14 vld1.32 {d22,d23},[r4:128] vmul.f32 d3,d15,d14 vld1.32 {d24,d25},[r1:128] vmul.f32 d4,d15,d14 vld1.32 {d26,d27},[r2:128] vmul.f32 d5,d15,d14 vld1.32 {d28,d29},[r3:128] vmul.f32 d6,d15,d14 vld1.32 {d30,d31},[r4:128] vmul.f32 d7,d15,d14 smuad r10, r10, r10 nop nop smuad r11, r11, r11 nop subs r0, r0, #1 smuad r12, r12, r12 bgt .loop1
The number of cycles stays the same for me regardless of if I load to different registers or using different base registers with the same arrangement as in your example. Maybe we're using different versions of Cortex-A8? I'm using OMAP3530, how about you?
Here are some interesting things I've observed:1) If I add one or two pairs of nops in the middle I get the same speed (14 cycles for the loop). If I add a third pair the speed goes down to 13 cycles. With the fourth pair it goes back up to 14 cycles, and with every pair after that it adds 2 cycles. So, with 3 nop pairs I get no stalls in the NEON code, because there are 12 pairs of instructions (+1 cycle for fetch stall).2) If I change three or more of the vld1s to independent vext.8 I get 10 cycles, or full pairing. Same with vmovn, vswp, vrev16, vzip, and vuzp. So the bottleneck is not dual-issue, it's loads and stores.3) If I change to 64-bit loads instead of 128-bit I still get 14 cycles for the loop. So I don't think it's a bandwidth limitation.4) If I change to 64-bit or 128-bit store I get 21 cycles for the loop. However, here if I store to separate 16-byte addresses in a 64-byte block I get something like 15.5 cycles (this is with a cache-line aligned destination). This is probably due to coalescing filling a whole cache line in the write buffer, where otherwise the cache line has to be loaded. I tried "warming" the buffer by memcpying it to itself to make sure it was in L1 cache, but that didn't make a difference.5) If I change the vmul.f32s to vmla.f32 things get bad. If I start at a baseline of no-pairing I get the expected 9 cycles. Then pairing a single vmovn turns it into 12. And from there every new pair adds 4 cycles. I get the same cycles with vrecps.f32, and presumably will with the other chained pipeline instructions.So I guess the lessons are to not do too many loads/stores in a row, and that chained pipeline instructions hate being dual issued with anything for some reason. We should do some more testing to see if there are any other instructions that cause a big penalty over dual-issue like this.
[color=#222222][size=2]10 NOP instructions on my beagle takes 2.508 s[/size][/color]
I'n using this code because I'm sure that the end ARM code take exactly 5 cycles and let 2 bubbles in the pipeline for the branch.Remember this post http://pulsar.websha...h-instructions/I have a beagleboard XM (DM3730). But the processor is not the problem (i believe). Try the code I give and tell me if you found 15 cycles (10 for NEON part and 5 for ARM part).
I do not understand the point 5 and how you get 9 cycles !!!
- try to load (if it's possible) long time before using datas (there is enough registers to load the datas of the next iteration during the previous one).- try to write as soon as possible (that's to say as soon as the register are available for VSAVE).
- and now don't read the same memory bloc with consecutive VLOAD