We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
PRESERVE8 AREA newarea,CODE,READONLY,ALIGN=4 ARM EXPORT ne10_radix4_butterfly_float_neon_lsne10_radix4_butterfly_float_neon_ls FUNCTION PUSH {r4-r12,lr} ;push r12 to keep stack 8 bytes aligned VPUSH {d8-d15}pSrc RN R0count RN R6const256 RN R7pSrcB RN R8 MOV count,#256 MOV const256,#64 ADD pSrcB, pSrc, #32fftCopyLoop vld1.f32 { d0, d1, d2, d3 }, [ pSrc@256 ] ,const256 vld1.f32 { d4, d5, d6, d7 }, [ pSrcB@256 ] ,const256 vld1.f32 { d8, d9, d10, d11 }, [ pSrc@256 ] ,const256 vld1.f32 { d12, d13, d14, d15 },[ pSrcB@256 ] ,const256 SUBS count,#1 BGT fftCopyLoop ;/* Retureq From Function*/ VPOP {d8-d15} POP {r4-r12,pc} ENDP END
You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:mov const64, #64add pSrcB, pSrc, #32fftCopyLoopvld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64subs count, #1bgt fftCopyLoopThis assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.(also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
mov const64, #64add pSrcB, pSrc, #32fftCopyLoopvld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64subs count, #1bgt fftCopyLoop
In the above figuresMPU operating frequency:MPU_CLK = ARM_FCLK (Hardware Divider =1 ).ARM_FCLK =Core clock of Cortex-A8.L3 interconnect frequency:AXI_FCLK(L3 interconnect ) = ARM_FCLK/2 (Hardware divider of 2).SRAM frequency:Operates at L3 interconnect frequency.
#define CM_CLKEN_PLL_MPU 0x48004904#define CM_CLKSEL1_PLL_MPU 0x48004940#define CM_CLKSEL2_PLL_MPU 0x48004944#define CM_IDLEST_PLL_MPU 0x48004924[size=2]#define CM_CLKSEL2_PLL_MPU_var (volatile unsigned int *)(CM_CLKSEL2_PLL_MPU)[/size]#define CM_CLKSEL1_PLL_MPU_var (volatile unsigned int *)(CM_CLKSEL1_PLL_MPU)#define CM_IDLEST_PLL_MPU_var (volatile unsigned int *)(CM_IDLEST_PLL_MPU)#define CM_CLKEN_PLL_MPU_var (volatile unsigned int *)(CM_CLKEN_PLL_MPU)
// Unlocking the DPLL *CM_CLKEN_PLL_MPU_var= 0x35; // waiting for the pll to be put in to bypass mode while (*CM_IDLEST_PLL_MPU_var & 0x00); //// clksel1= 12580c for 600Mhz //// clksel1= 112c0c for 300Mhz //// clksel1= 10640c for 100Mhz //// clksel1= 10320c for 50Mhz // Setting M and clock dividers *CM_CLKSEL1_PLL_MPU_var=0x10320c; // Setting M2 *CM_CLKSEL2_PLL_MPU_var=0x01; //Putting back the Dpll in to normal mode *CM_CLKEN_PLL_MPU_var= 0x37; //Waiting for the Dpll1 to lock on the frequency while (*CM_IDLEST_PLL_MPU_var & 0x01);
Yes, please post your code. If it's large I recommend using something like pastebin.com instead of posting it directly.When you say you're using on-chip SRAM are you referring to the 64KB at 0x40200000? This is still on the other side of the L3 bus so you'd definitely be accessing it at bus speed and not something derived from the CPU clock. I can't find any SRAM internal to the CPU unless you're using part of L2 cache in lockdown. If you're going through L3 that's a 200MHz clock, if setup correctly of course.Still not sure why you're getting what appears to be variable perf/MHz for your larger data set if L2 cache is enabled. It could be that page attributes aren't setup correctly, or something uninitialized with NEON. There's a lot of stuff to setup.
*( CM_CLKEN_PLL_var) = 0x00110015; *( CM_CLKEN_PLL_var) = 0x00110015; // ; MPU *( CM_CLKEN_PLL_MPU_var) = 0x00000015; // ; EMU *( CM_CLKSEL1_EMU_var) = 0x02030A50; // // ; Setup PLL's // ; Clock control registers *(CM_CLKSEL1_PLL_var) = 0x094C0C00; *(CM_CLKSEL2_PLL_var) = 0x0001B00C; *(CM_CLKSEL3_PLL_var) = 0x00000009; *(CM_CLKEN_PLL_var) = 0x00310035; // // ; WKUP *(CM_CLKSEL_WKUP_var) = 0x00000015; // // ; Core *(CM_ICLKEN1_CORE_var) = 0x00000042; *(CM_CLKSEL_CORE_var) = 0x0000020A; // // ; MPU BYPASS //Setting up the frequency M multiplier ,N divider //// clksel1= 0x12580c for 600Mhz //// clksel1= 0x11f40c for 500Mhz //// clksel1= 0x112c0c for 300Mhz //// clksel1= 0x10640c for 100Mhz //// clksel1= 0x10320c for 50Mhz *(CM_CLKSEL1_PLL_MPU_var) =0x12580c; *(CM_CLKEN_PLL_MPU_var) = 0x00000035; while (*CM_IDLEST_PLL_MPU_var & 0x00); // // ; Enable PLL's // ; Clock control registers *(CM_CLKEN_PLL_var) = 0x00370037; // // ; MPU LOCK *(CM_CLKEN_PLL_MPU_var) = 0x00000037; while (*CM_IDLEST_PLL_MPU_var & 0x01); // // ; Increase trace clock. *(CM_CLKSEL1_EMU_var) = 0x03020A55;