This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:









Loop Unrolled code:





The following are the results I ran for different frequencies
 [size=2]T                     [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]




Parents
  • Note: This was originally posted on 27th March 2013 at http://forums.arm.com

    I am extremely sorry!! I seriously regret for my life.You have been so kind to me. Instead of up voting  it I accidentally   down voted it, there was no option for editing it.

    "Solved the issue partially by evening  still i thought i would keep the post and updated my query in reply to the post."

    [size="2"]Ya I am using the on-chip SRAM  with 64 KB starting at 0x40200000.  I did some study on SRAM and found the following. [/size]
    http://s20.postimg.o...cks_picture.png


    From the above I inferred the following things .


    In the above figures


    In the above figures
    MPU operating frequency:
    MPU_CLK = ARM_FCLK (Hardware Divider =1 ).
    ARM_FCLK =Core clock of Cortex-A8.
    L3 interconnect frequency:
    AXI_FCLK(L3 interconnect ) = ARM_FCLK/2 (Hardware divider of 2).
    SRAM frequency:
    Operates at L3 interconnect frequency.

    Please correct me if I am wrong. I thought I only need to configure DPLL1 in DM3730 for the Cortex-A8 and SRAM to be up and running.
    After endless digging through the manuals and going though the boot code (X-loader ). I finally devised a clock code to set the clock frequency of DPLL1. (I assumed that Cortex-A8 will run at DPLL1 o/p and SRAM runs at  DPLL1 o/p frequency/2).


    Clock address defnitions


    #define CM_CLKEN_PLL_MPU    0x48004904
    #define CM_CLKSEL1_PLL_MPU  0x48004940
    #define CM_CLKSEL2_PLL_MPU  0x48004944
    #define CM_IDLEST_PLL_MPU   0x48004924
    [size=2]#define CM_CLKSEL2_PLL_MPU_var      (volatile unsigned int *)(CM_CLKSEL2_PLL_MPU)[/size]
    #define CM_CLKSEL1_PLL_MPU_var      (volatile unsigned int *)(CM_CLKSEL1_PLL_MPU)
    #define CM_IDLEST_PLL_MPU_var    (volatile unsigned int *)(CM_IDLEST_PLL_MPU)
    #define CM_CLKEN_PLL_MPU_var        (volatile unsigned int *)(CM_CLKEN_PLL_MPU)





    // Unlocking the DPLL
            *CM_CLKEN_PLL_MPU_var= 0x35;
        // waiting for the pll to be put in to bypass mode
         while (*CM_IDLEST_PLL_MPU_var & 0x00);

        ////  clksel1= 12580c  for 600Mhz
        ////  clksel1= 112c0c  for 300Mhz
        ////  clksel1= 10640c  for 100Mhz
        ////  clksel1= 10320c  for 50Mhz
        // Setting M and clock dividers
         *CM_CLKSEL1_PLL_MPU_var=0x10320c;
         // Setting M2
         *CM_CLKSEL2_PLL_MPU_var=0x01;
         //Putting back the Dpll in to normal mode
         *CM_CLKEN_PLL_MPU_var= 0x37;
         //Waiting for the Dpll1 to lock on the frequency
            while (*CM_IDLEST_PLL_MPU_var & 0x01);


    By the way what is "L2 cache in lockdown" .Is it something related to DSP L2 cache which is used as memory map. If L2 can be directly used for program memory my code would accelerate.

    Luckily I am using ARM DS-5 and I got the initialization code with it. Looks like everything is configured correctly.  http://pastebin.com/hKeFGRmW  .
    After hardware reset the Reset_Handler()  takes over and configures everything related to Cortex-A8.  In the main() , I explicitly enable_caches()  and then I configure the above clocks.
    Main().http://pastebin.com/3QRdJjRX


    Coming to my FFT code . I am using the FFT code to kind of benchmark "ARM's-Neon unit capability". The code is borrowed and customized from  opensource library called NE10.
    ARM NEON  FFT  code to be optimized  http://pastebin.com/d6DNVygs. 

    I do not know if  I violated opensource violations....the entire code is at . https://code.google..../downloads/list
    Initially I am getting  50Mhz --45000cycles, 300Mhz  -76000cycles.  I was struct here for two weeks. After you suggested the instruction alignment the results were 50Mhz -33000 cycles, 300Mhz -66000 cycles. 
    As you can see there was a 10000 reduction in the cycle count at both the frequencies. I suspect that memory is an issue.

    I ran the fft  NEON code with only LOAD ,STORE and Loop counter instructions(removing the VMUL,VADD...) instructions and the following are the results.
    http://s20.postimg.o..._and_stores.png



    As you can see in my case the actual program is running only for 20,000 cycles and the loads and stores are taking a lot of cycles.  If we compare the last two columns of the above table I find that the difference between   "loads and stores" at different frequencies is same as different between "FFT" at different frequencies.

    My main aim is to reduce the number of cycles taken and also fix the issues related to variation of cycles with respect to frequencies.
    I do not know if  I violated opensource violations....the entire code is at .

    Any inputs will be very helpful to me. I have been working on this for 3 months now.




    Yes, please post your code. If it's large I recommend using something like pastebin.com instead of posting it directly.

    When you say you're using on-chip SRAM are you referring to the 64KB at 0x40200000? This is still on the other side of the L3 bus so you'd definitely be accessing it at bus speed and not something derived from the CPU clock. I can't find any SRAM internal to the CPU unless you're using part of L2 cache in lockdown. If you're going through L3 that's a 200MHz clock, if setup correctly of course.

    Still not sure why you're getting what appears to be variable perf/MHz for your larger data set if L2 cache is enabled. It could be that page attributes aren't setup correctly, or something uninitialized with NEON. There's a lot of stuff to setup.
Reply
  • Note: This was originally posted on 27th March 2013 at http://forums.arm.com

    I am extremely sorry!! I seriously regret for my life.You have been so kind to me. Instead of up voting  it I accidentally   down voted it, there was no option for editing it.

    "Solved the issue partially by evening  still i thought i would keep the post and updated my query in reply to the post."

    [size="2"]Ya I am using the on-chip SRAM  with 64 KB starting at 0x40200000.  I did some study on SRAM and found the following. [/size]
    http://s20.postimg.o...cks_picture.png


    From the above I inferred the following things .


    In the above figures


    In the above figures
    MPU operating frequency:
    MPU_CLK = ARM_FCLK (Hardware Divider =1 ).
    ARM_FCLK =Core clock of Cortex-A8.
    L3 interconnect frequency:
    AXI_FCLK(L3 interconnect ) = ARM_FCLK/2 (Hardware divider of 2).
    SRAM frequency:
    Operates at L3 interconnect frequency.

    Please correct me if I am wrong. I thought I only need to configure DPLL1 in DM3730 for the Cortex-A8 and SRAM to be up and running.
    After endless digging through the manuals and going though the boot code (X-loader ). I finally devised a clock code to set the clock frequency of DPLL1. (I assumed that Cortex-A8 will run at DPLL1 o/p and SRAM runs at  DPLL1 o/p frequency/2).


    Clock address defnitions


    #define CM_CLKEN_PLL_MPU    0x48004904
    #define CM_CLKSEL1_PLL_MPU  0x48004940
    #define CM_CLKSEL2_PLL_MPU  0x48004944
    #define CM_IDLEST_PLL_MPU   0x48004924
    [size=2]#define CM_CLKSEL2_PLL_MPU_var      (volatile unsigned int *)(CM_CLKSEL2_PLL_MPU)[/size]
    #define CM_CLKSEL1_PLL_MPU_var      (volatile unsigned int *)(CM_CLKSEL1_PLL_MPU)
    #define CM_IDLEST_PLL_MPU_var    (volatile unsigned int *)(CM_IDLEST_PLL_MPU)
    #define CM_CLKEN_PLL_MPU_var        (volatile unsigned int *)(CM_CLKEN_PLL_MPU)





    // Unlocking the DPLL
            *CM_CLKEN_PLL_MPU_var= 0x35;
        // waiting for the pll to be put in to bypass mode
         while (*CM_IDLEST_PLL_MPU_var & 0x00);

        ////  clksel1= 12580c  for 600Mhz
        ////  clksel1= 112c0c  for 300Mhz
        ////  clksel1= 10640c  for 100Mhz
        ////  clksel1= 10320c  for 50Mhz
        // Setting M and clock dividers
         *CM_CLKSEL1_PLL_MPU_var=0x10320c;
         // Setting M2
         *CM_CLKSEL2_PLL_MPU_var=0x01;
         //Putting back the Dpll in to normal mode
         *CM_CLKEN_PLL_MPU_var= 0x37;
         //Waiting for the Dpll1 to lock on the frequency
            while (*CM_IDLEST_PLL_MPU_var & 0x01);


    By the way what is "L2 cache in lockdown" .Is it something related to DSP L2 cache which is used as memory map. If L2 can be directly used for program memory my code would accelerate.

    Luckily I am using ARM DS-5 and I got the initialization code with it. Looks like everything is configured correctly.  http://pastebin.com/hKeFGRmW  .
    After hardware reset the Reset_Handler()  takes over and configures everything related to Cortex-A8.  In the main() , I explicitly enable_caches()  and then I configure the above clocks.
    Main().http://pastebin.com/3QRdJjRX


    Coming to my FFT code . I am using the FFT code to kind of benchmark "ARM's-Neon unit capability". The code is borrowed and customized from  opensource library called NE10.
    ARM NEON  FFT  code to be optimized  http://pastebin.com/d6DNVygs. 

    I do not know if  I violated opensource violations....the entire code is at . https://code.google..../downloads/list
    Initially I am getting  50Mhz --45000cycles, 300Mhz  -76000cycles.  I was struct here for two weeks. After you suggested the instruction alignment the results were 50Mhz -33000 cycles, 300Mhz -66000 cycles. 
    As you can see there was a 10000 reduction in the cycle count at both the frequencies. I suspect that memory is an issue.

    I ran the fft  NEON code with only LOAD ,STORE and Loop counter instructions(removing the VMUL,VADD...) instructions and the following are the results.
    http://s20.postimg.o..._and_stores.png



    As you can see in my case the actual program is running only for 20,000 cycles and the loads and stores are taking a lot of cycles.  If we compare the last two columns of the above table I find that the difference between   "loads and stores" at different frequencies is same as different between "FFT" at different frequencies.

    My main aim is to reduce the number of cycles taken and also fix the issues related to variation of cycles with respect to frequencies.
    I do not know if  I violated opensource violations....the entire code is at .

    Any inputs will be very helpful to me. I have been working on this for 3 months now.




    Yes, please post your code. If it's large I recommend using something like pastebin.com instead of posting it directly.

    When you say you're using on-chip SRAM are you referring to the 64KB at 0x40200000? This is still on the other side of the L3 bus so you'd definitely be accessing it at bus speed and not something derived from the CPU clock. I can't find any SRAM internal to the CPU unless you're using part of L2 cache in lockdown. If you're going through L3 that's a 200MHz clock, if setup correctly of course.

    Still not sure why you're getting what appears to be variable perf/MHz for your larger data set if L2 cache is enabled. It could be that page attributes aren't setup correctly, or something uninitialized with NEON. There's a lot of stuff to setup.
Children
No data