This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:









Loop Unrolled code:





The following are the results I ran for different frequencies
 [size=2]T                     [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]




Parents
  • Note: This was originally posted on 27th March 2013 at http://forums.arm.com

    Hey ,
    Thank you very much for the inputs .
    I fixed the issue with the performance variation of code w.r.t to frequency. I was not configuring the Dplls well .
    New clock Code:


    *( CM_CLKEN_PLL_var) = 0x00110015;
        *( CM_CLKEN_PLL_var) = 0x00110015;
        //    ; MPU
        *( CM_CLKEN_PLL_MPU_var) = 0x00000015;
        //    ; EMU
        *( CM_CLKSEL1_EMU_var) = 0x02030A50;
        //
        //    ; Setup PLL's
        //    ; Clock control registers
        *(CM_CLKSEL1_PLL_var) = 0x094C0C00;
        *(CM_CLKSEL2_PLL_var) = 0x0001B00C;
        *(CM_CLKSEL3_PLL_var) = 0x00000009;
        *(CM_CLKEN_PLL_var)   = 0x00310035;
        //
        //    ; WKUP
        *(CM_CLKSEL_WKUP_var) = 0x00000015;
        //
        //    ; Core
        *(CM_ICLKEN1_CORE_var) =  0x00000042;
        *(CM_CLKSEL_CORE_var) = 0x0000020A;
        //
        //    ; MPU BYPASS
        //Setting up the frequency M multiplier ,N divider
        ////  clksel1= 0x12580c  for 600Mhz
        ////  clksel1= 0x11f40c for 500Mhz
        ////  clksel1= 0x112c0c  for 300Mhz
        ////  clksel1= 0x10640c  for 100Mhz
        ////  clksel1= 0x10320c  for 50Mhz

        *(CM_CLKSEL1_PLL_MPU_var) =0x12580c;
        *(CM_CLKEN_PLL_MPU_var) = 0x00000035;
        while (*CM_IDLEST_PLL_MPU_var & 0x00);
        //
        //    ; Enable PLL's
        //    ; Clock control registers
        *(CM_CLKEN_PLL_var) = 0x00370037;
        //
        //    ; MPU LOCK
        *(CM_CLKEN_PLL_MPU_var) = 0x00000037;
        while (*CM_IDLEST_PLL_MPU_var & 0x01);
        //
        //    ; Increase trace clock.
        *(CM_CLKSEL1_EMU_var) = 0x03020A55;


    It took me a lot of time to figure this.

    Coming to my FFT code . I am using the FFT code to kind of benchmark "ARM's-Neon unit capability". The code is borrowed and customized from  opensource library called NE10 .ARM NEON  FFT  code to be optimized .
    My entire code is at https://code.google..../downloads/list .  FFT neon code is  NE10_cfft.neon1.s  and corresponding C code is NE10_cfft.c .Please help me in optimization of the FFT code.
    By the way what is "L2 cache in lockdown" .Is it something related to DSP L2 cache which is used as memory map. If L2 can be directly used for program memory my code would accelerate. Please throw some light on this.

    Thanks again
    Vamsi
Reply
  • Note: This was originally posted on 27th March 2013 at http://forums.arm.com

    Hey ,
    Thank you very much for the inputs .
    I fixed the issue with the performance variation of code w.r.t to frequency. I was not configuring the Dplls well .
    New clock Code:


    *( CM_CLKEN_PLL_var) = 0x00110015;
        *( CM_CLKEN_PLL_var) = 0x00110015;
        //    ; MPU
        *( CM_CLKEN_PLL_MPU_var) = 0x00000015;
        //    ; EMU
        *( CM_CLKSEL1_EMU_var) = 0x02030A50;
        //
        //    ; Setup PLL's
        //    ; Clock control registers
        *(CM_CLKSEL1_PLL_var) = 0x094C0C00;
        *(CM_CLKSEL2_PLL_var) = 0x0001B00C;
        *(CM_CLKSEL3_PLL_var) = 0x00000009;
        *(CM_CLKEN_PLL_var)   = 0x00310035;
        //
        //    ; WKUP
        *(CM_CLKSEL_WKUP_var) = 0x00000015;
        //
        //    ; Core
        *(CM_ICLKEN1_CORE_var) =  0x00000042;
        *(CM_CLKSEL_CORE_var) = 0x0000020A;
        //
        //    ; MPU BYPASS
        //Setting up the frequency M multiplier ,N divider
        ////  clksel1= 0x12580c  for 600Mhz
        ////  clksel1= 0x11f40c for 500Mhz
        ////  clksel1= 0x112c0c  for 300Mhz
        ////  clksel1= 0x10640c  for 100Mhz
        ////  clksel1= 0x10320c  for 50Mhz

        *(CM_CLKSEL1_PLL_MPU_var) =0x12580c;
        *(CM_CLKEN_PLL_MPU_var) = 0x00000035;
        while (*CM_IDLEST_PLL_MPU_var & 0x00);
        //
        //    ; Enable PLL's
        //    ; Clock control registers
        *(CM_CLKEN_PLL_var) = 0x00370037;
        //
        //    ; MPU LOCK
        *(CM_CLKEN_PLL_MPU_var) = 0x00000037;
        while (*CM_IDLEST_PLL_MPU_var & 0x01);
        //
        //    ; Increase trace clock.
        *(CM_CLKSEL1_EMU_var) = 0x03020A55;


    It took me a lot of time to figure this.

    Coming to my FFT code . I am using the FFT code to kind of benchmark "ARM's-Neon unit capability". The code is borrowed and customized from  opensource library called NE10 .ARM NEON  FFT  code to be optimized .
    My entire code is at https://code.google..../downloads/list .  FFT neon code is  NE10_cfft.neon1.s  and corresponding C code is NE10_cfft.c .Please help me in optimization of the FFT code.
    By the way what is "L2 cache in lockdown" .Is it something related to DSP L2 cache which is used as memory map. If L2 can be directly used for program memory my code would accelerate. Please throw some light on this.

    Thanks again
    Vamsi
Children
No data