This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:









Loop Unrolled code:





The following are the results I ran for different frequencies
 [size=2]T                     [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]




  • Note: This was originally posted on 26th March 2013 at http://forums.arm.com

    Thanks  "Exophase"  for your time.
    I am running directly on the board(Beagle Board). I am using arm compiler suite.
    L1, L2 caches are enabled in the processor.
    The code below is from arm itself.
     

    [size=2]Final assembly[/size]


      PRESERVE8


      AREA  newarea,CODE,READONLY,ALIGN=4

      ARM

      EXPORT ne10_radix4_butterfly_float_neon_ls

    ne10_radix4_butterfly_float_neon_ls FUNCTION


            PUSH    {r4-r12,lr}    ;push r12 to keep stack 8 bytes aligned
            VPUSH   {d8-d15}


    pSrc        RN  R0
    count    RN  R6
    const256    RN  R7
    pSrcB    RN  R8
            MOV        count,#256
            MOV        const256,#64
            ADD        pSrcB, pSrc, #32

    fftCopyLoop
        vld1.f32 { d0, d1, d2, d3 }, [ pSrc@256 ] ,const256
        vld1.f32 { d4, d5, d6, d7 }, [ pSrcB@256 ] ,const256
        vld1.f32 { d8, d9, d10, d11 }, [ pSrc@256 ] ,const256
        vld1.f32 { d12, d13, d14, d15 },[ pSrcB@256 ] ,const256
            SUBS        count,#1
            BGT   fftCopyLoop

            ;/* Retureq From Function*/
            VPOP    {d8-d15}
            POP  {r4-r12,pc}
            ENDP
            END




    When I used alignment in instructions in my  original FFT code, I was able to reduce the instruction count by around 10000 cycles. Thank you very much!!
    If you do not mind can I post my FFT code here ? To see if there is any scope for further improvement in the performance.






    You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

    I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

    You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:


    mov const64, #64
    add pSrcB, pSrc, #32

    fftCopyLoop
    vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64
    vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64
    vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64
    vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64
    subs count, #1
    bgt fftCopyLoop


    This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

    Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

    (also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
  • Note: This was originally posted on 27th March 2013 at http://forums.arm.com

    I am extremely sorry!! I seriously regret for my life.You have been so kind to me. Instead of up voting  it I accidentally   down voted it, there was no option for editing it.

    "Solved the issue partially by evening  still i thought i would keep the post and updated my query in reply to the post."

    [size="2"]Ya I am using the on-chip SRAM  with 64 KB starting at 0x40200000.  I did some study on SRAM and found the following. [/size]
    http://s20.postimg.o...cks_picture.png


    From the above I inferred the following things .


    In the above figures


    In the above figures
    MPU operating frequency:
    MPU_CLK = ARM_FCLK (Hardware Divider =1 ).
    ARM_FCLK =Core clock of Cortex-A8.
    L3 interconnect frequency:
    AXI_FCLK(L3 interconnect ) = ARM_FCLK/2 (Hardware divider of 2).
    SRAM frequency:
    Operates at L3 interconnect frequency.

    Please correct me if I am wrong. I thought I only need to configure DPLL1 in DM3730 for the Cortex-A8 and SRAM to be up and running.
    After endless digging through the manuals and going though the boot code (X-loader ). I finally devised a clock code to set the clock frequency of DPLL1. (I assumed that Cortex-A8 will run at DPLL1 o/p and SRAM runs at  DPLL1 o/p frequency/2).


    Clock address defnitions


    #define CM_CLKEN_PLL_MPU    0x48004904
    #define CM_CLKSEL1_PLL_MPU  0x48004940
    #define CM_CLKSEL2_PLL_MPU  0x48004944
    #define CM_IDLEST_PLL_MPU   0x48004924
    [size=2]#define CM_CLKSEL2_PLL_MPU_var      (volatile unsigned int *)(CM_CLKSEL2_PLL_MPU)[/size]
    #define CM_CLKSEL1_PLL_MPU_var      (volatile unsigned int *)(CM_CLKSEL1_PLL_MPU)
    #define CM_IDLEST_PLL_MPU_var    (volatile unsigned int *)(CM_IDLEST_PLL_MPU)
    #define CM_CLKEN_PLL_MPU_var        (volatile unsigned int *)(CM_CLKEN_PLL_MPU)





    // Unlocking the DPLL
            *CM_CLKEN_PLL_MPU_var= 0x35;
        // waiting for the pll to be put in to bypass mode
         while (*CM_IDLEST_PLL_MPU_var & 0x00);

        ////  clksel1= 12580c  for 600Mhz
        ////  clksel1= 112c0c  for 300Mhz
        ////  clksel1= 10640c  for 100Mhz
        ////  clksel1= 10320c  for 50Mhz
        // Setting M and clock dividers
         *CM_CLKSEL1_PLL_MPU_var=0x10320c;
         // Setting M2
         *CM_CLKSEL2_PLL_MPU_var=0x01;
         //Putting back the Dpll in to normal mode
         *CM_CLKEN_PLL_MPU_var= 0x37;
         //Waiting for the Dpll1 to lock on the frequency
            while (*CM_IDLEST_PLL_MPU_var & 0x01);


    By the way what is "L2 cache in lockdown" .Is it something related to DSP L2 cache which is used as memory map. If L2 can be directly used for program memory my code would accelerate.

    Luckily I am using ARM DS-5 and I got the initialization code with it. Looks like everything is configured correctly.  http://pastebin.com/hKeFGRmW  .
    After hardware reset the Reset_Handler()  takes over and configures everything related to Cortex-A8.  In the main() , I explicitly enable_caches()  and then I configure the above clocks.
    Main().http://pastebin.com/3QRdJjRX


    Coming to my FFT code . I am using the FFT code to kind of benchmark "ARM's-Neon unit capability". The code is borrowed and customized from  opensource library called NE10.
    ARM NEON  FFT  code to be optimized  http://pastebin.com/d6DNVygs. 

    I do not know if  I violated opensource violations....the entire code is at . https://code.google..../downloads/list
    Initially I am getting  50Mhz --45000cycles, 300Mhz  -76000cycles.  I was struct here for two weeks. After you suggested the instruction alignment the results were 50Mhz -33000 cycles, 300Mhz -66000 cycles. 
    As you can see there was a 10000 reduction in the cycle count at both the frequencies. I suspect that memory is an issue.

    I ran the fft  NEON code with only LOAD ,STORE and Loop counter instructions(removing the VMUL,VADD...) instructions and the following are the results.
    http://s20.postimg.o..._and_stores.png



    As you can see in my case the actual program is running only for 20,000 cycles and the loads and stores are taking a lot of cycles.  If we compare the last two columns of the above table I find that the difference between   "loads and stores" at different frequencies is same as different between "FFT" at different frequencies.

    My main aim is to reduce the number of cycles taken and also fix the issues related to variation of cycles with respect to frequencies.
    I do not know if  I violated opensource violations....the entire code is at .

    Any inputs will be very helpful to me. I have been working on this for 3 months now.




    Yes, please post your code. If it's large I recommend using something like pastebin.com instead of posting it directly.

    When you say you're using on-chip SRAM are you referring to the 64KB at 0x40200000? This is still on the other side of the L3 bus so you'd definitely be accessing it at bus speed and not something derived from the CPU clock. I can't find any SRAM internal to the CPU unless you're using part of L2 cache in lockdown. If you're going through L3 that's a 200MHz clock, if setup correctly of course.

    Still not sure why you're getting what appears to be variable perf/MHz for your larger data set if L2 cache is enabled. It could be that page attributes aren't setup correctly, or something uninitialized with NEON. There's a lot of stuff to setup.
  • Note: This was originally posted on 27th March 2013 at http://forums.arm.com

    Hey ,
    Thank you very much for the inputs .
    I fixed the issue with the performance variation of code w.r.t to frequency. I was not configuring the Dplls well .
    New clock Code:


    *( CM_CLKEN_PLL_var) = 0x00110015;
        *( CM_CLKEN_PLL_var) = 0x00110015;
        //    ; MPU
        *( CM_CLKEN_PLL_MPU_var) = 0x00000015;
        //    ; EMU
        *( CM_CLKSEL1_EMU_var) = 0x02030A50;
        //
        //    ; Setup PLL's
        //    ; Clock control registers
        *(CM_CLKSEL1_PLL_var) = 0x094C0C00;
        *(CM_CLKSEL2_PLL_var) = 0x0001B00C;
        *(CM_CLKSEL3_PLL_var) = 0x00000009;
        *(CM_CLKEN_PLL_var)   = 0x00310035;
        //
        //    ; WKUP
        *(CM_CLKSEL_WKUP_var) = 0x00000015;
        //
        //    ; Core
        *(CM_ICLKEN1_CORE_var) =  0x00000042;
        *(CM_CLKSEL_CORE_var) = 0x0000020A;
        //
        //    ; MPU BYPASS
        //Setting up the frequency M multiplier ,N divider
        ////  clksel1= 0x12580c  for 600Mhz
        ////  clksel1= 0x11f40c for 500Mhz
        ////  clksel1= 0x112c0c  for 300Mhz
        ////  clksel1= 0x10640c  for 100Mhz
        ////  clksel1= 0x10320c  for 50Mhz

        *(CM_CLKSEL1_PLL_MPU_var) =0x12580c;
        *(CM_CLKEN_PLL_MPU_var) = 0x00000035;
        while (*CM_IDLEST_PLL_MPU_var & 0x00);
        //
        //    ; Enable PLL's
        //    ; Clock control registers
        *(CM_CLKEN_PLL_var) = 0x00370037;
        //
        //    ; MPU LOCK
        *(CM_CLKEN_PLL_MPU_var) = 0x00000037;
        while (*CM_IDLEST_PLL_MPU_var & 0x01);
        //
        //    ; Increase trace clock.
        *(CM_CLKSEL1_EMU_var) = 0x03020A55;


    It took me a lot of time to figure this.

    Coming to my FFT code . I am using the FFT code to kind of benchmark "ARM's-Neon unit capability". The code is borrowed and customized from  opensource library called NE10 .ARM NEON  FFT  code to be optimized .
    My entire code is at https://code.google..../downloads/list .  FFT neon code is  NE10_cfft.neon1.s  and corresponding C code is NE10_cfft.c .Please help me in optimization of the FFT code.
    By the way what is "L2 cache in lockdown" .Is it something related to DSP L2 cache which is used as memory map. If L2 can be directly used for program memory my code would accelerate. Please throw some light on this.

    Thanks again
    Vamsi
  • Note: This was originally posted on 2nd April 2013 at http://forums.arm.com

    Thanks for the  inputs I was able to achieve considerable performance till now. I am happy with the performance.   Please just  "glance" through the code at  https://code.google.com/p/neon-fft/downloads/detail?name=NE10_cfft.neon1.s&can=2&q=  . I don't want you to understand the algorithm.  If possible please give any general suggestions for optimization(Like the alignment in instruction which was very useful ).



  • Note: This was originally posted on 25th March 2013 at http://forums.arm.com

    You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

    I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

    You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:


    mov const64, #64
    add pSrcB, pSrc, #32

    fftCopyLoop
    vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64
    vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64
    vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64
    vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64
    subs count, #1
    bgt fftCopyLoop


    This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

    Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

    (also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
  • Note: This was originally posted on 26th March 2013 at http://forums.arm.com

    Yes, please post your code. If it's large I recommend using something like pastebin.com instead of posting it directly.

    When you say you're using on-chip SRAM are you referring to the 64KB at 0x40200000? This is still on the other side of the L3 bus so you'd definitely be accessing it at bus speed and not something derived from the CPU clock. I can't find any SRAM internal to the CPU unless you're using part of L2 cache in lockdown. If you're going through L3 that's a 200MHz clock, if setup correctly of course.

    Still not sure why you're getting what appears to be variable perf/MHz for your larger data set if L2 cache is enabled. It could be that page attributes aren't setup correctly, or something uninitialized with NEON. There's a lot of stuff to setup.
  • Note: This was originally posted on 29th March 2013 at http://forums.arm.com

    As far as the whole configuration for the DM3730 goes I don't have any real experience with it and I don't think you'll get a lot of help here.. maybe you should ask on TI's forums? For instance here: http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/537.aspx You could also try the BeagleBoard newsgroup http://beagleboard.org/discuss

    I think from what you've said that it's clear at least that the block labeled local interconnect running on ARM_FCLK isn't connected to L3. That you have to set the two separate PLLs correctly proves that they're not on the same clock domain. You can happen to set it to a value that scales like you want because you're using such low CPU clock speeds, but if you want to run the CPU at 1GHz you won't be able to run L3 at half the clock rate.

    Still not really sure why the performance seems to suggest your data isn't going through L2 cache. Maybe the page tables aren't setup to allow this for the internal SRAM. That makes sense since it's supposed to be shared, but it doesn't make sense that it'd still be cached in L1, which is what appears to be the case.

    When I mentioned L2 cache in lockdown I'm referring to this feature:

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Chdeghcb.html

    If you use L2 in lockdown you can treat it kind of like a scratchpad memory, but it still needs to be backed by some real RAM. Anyway, since you've confirmed you aren't doing this it isn't really important.