We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
PRESERVE8 AREA newarea,CODE,READONLY,ALIGN=4 ARM EXPORT ne10_radix4_butterfly_float_neon_lsne10_radix4_butterfly_float_neon_ls FUNCTION PUSH {r4-r12,lr} ;push r12 to keep stack 8 bytes aligned VPUSH {d8-d15}pSrc RN R0count RN R6const256 RN R7pSrcB RN R8 MOV count,#256 MOV const256,#64 ADD pSrcB, pSrc, #32fftCopyLoop vld1.f32 { d0, d1, d2, d3 }, [ pSrc@256 ] ,const256 vld1.f32 { d4, d5, d6, d7 }, [ pSrcB@256 ] ,const256 vld1.f32 { d8, d9, d10, d11 }, [ pSrc@256 ] ,const256 vld1.f32 { d12, d13, d14, d15 },[ pSrcB@256 ] ,const256 SUBS count,#1 BGT fftCopyLoop ;/* Retureq From Function*/ VPOP {d8-d15} POP {r4-r12,pc} ENDP END
You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:mov const64, #64add pSrcB, pSrc, #32fftCopyLoopvld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64subs count, #1bgt fftCopyLoopThis assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.(also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
mov const64, #64add pSrcB, pSrc, #32fftCopyLoopvld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64subs count, #1bgt fftCopyLoop