This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why is my Cortex-M4 taking too much cycles?

Note: This was originally posted on 10th September 2012 at http://forums.arm.com

Dear Arm-experts,

i wanted to use the FPU of my STM32F4 (Cortex-M4). To see if it's working properly i compared with this page:
http://www.micromouseonline.com/2011/10/26/stm32f4-the-first-taste-of-speed/?doing_wp_cron=1347294891.0981290340423583984375

He is using exactly the same processor and toolchain (With GCC Compiler).
Here is how long it takes with my settings:


REFERENCE / [font=Verdana, sans-serif][size=2]Reference // Mycontroller running from Flash // My controller running from Sram[/size][/font]

long lX, lY, lZ;
lX = 123L; // 2 cycle // 2 cycle // 5 cycles
lY = 456L; // 2 cycle // 3 cycles // 3 cycles
lZ = lX*lY; // 5 cycles // 7 cycles // 9 cycles
fX = 123.456; // 3 cycles // 5 cycles // 4 cycles
fY = 9.99; // 3 cycles // 5 cycles // 4 cycles
fZ = fX * fY; // 6 cycles // 10 cycles // 10 cycles
fZ = sqrt(fY); // 20 cycles // 2742 cycles // 3405 cycles
fZ = sin(1.23); // 124 cycles // 1918 cycles // 2552

The settings are      Arm architecture: v7EM
       Arm core type: Cortex-M4
       Arm FP Abi Type: Soft-FP (Or Hard, doens't make a huge difference)
       Arm FPU Type: FPv4-SP-D16
       GCC target: arm-unknown-eabi

So not only the floating point arithmetic is runing slower but also integer! And sin and sqrt are horrible!!
The offset of my cycle measurement is deducted.
In CP10 and CP11 is 0b11 so FPU should be enabled properly.


Do you have any idea what is wrong with my settings or my toolchain or whatever??

Thank you so much for you efforts!

Florian
  • Note: This was originally posted on 10th September 2012 at http://forums.arm.com

    Try using the single precision sinf() and sqrtf() rather than the double precision functions.
    Also, you should probably try using single precision constants, such as 1.23f rather than double precision 1.23.

    hth
    s.
  • Note: This was originally posted on 10th September 2012 at http://forums.arm.com

    Thank you for your advice! The sinf and sqrtf function help a lot to save time! Did'nt know that.
    But i don't understand why he is doing exactly the same code/compiler/mcu (even integer arithmetic!) and is needing less cycles than my controller.

    Bst regards,
    Florian



    Try using the single precision sinf() and sqrtf() rather than the double precision functions.
    Also, you should probably try using single precision constants, such as 1.23f rather than double precision 1.23.

    hth
    s.
  • Note: This was originally posted on 10th September 2012 at http://forums.arm.com

    I forgot to say that fx, fy,fz are normal float variables!!

    Thank you for your advice! The sinf and sqrtf function help a lot to save time! Did'nt know that.
    But i don't understand why he is doing exactly the same code/compiler/mcu (even integer arithmetic!) and is needing less cycles than my controller.

    Bst regards,
    Florian



  • Note: This was originally posted on 11th September 2012 at http://forums.arm.com

    To keep things simple, the compiler was also told to treat doubles as floats to  restrict everything to the 32 bit float format.

    hth
    s.
  • Note: This was originally posted on 11th September 2012 at http://forums.arm.com

    Thanks for paying so much attention to this!I played around with the settings (treat doubles as floats, optimization level 0-3 etc).
    the best results were for integer 12 cycles against 9 from the website and for float 18 compared to 12 from the website...

    This is sooo confusing and I don't know what to do!

    Have a nice evening!
  • Note: This was originally posted on 12th September 2012 at http://forums.arm.com

    How are you attempting to time the execution of each of the instructions?

    s.
  • Note: This was originally posted on 12th September 2012 at http://forums.arm.com

    I saw some code by Joseph Yiu for the Cortex M3 to count cycles. So I added a part to subtract the offset and this is what came out:


    int cyc[2],offset;
      float x;
      volatile unsigned int *DWT_CYCCNT = (volatile unsigned int *)0xE0001004; //address of the register
      volatile unsigned int *DWT_CONTROL = (volatile unsigned int *)0xE0001000; //address of the register
      volatile unsigned int *SCB_DEMCR = (volatile unsigned int *)0xE000EDFC; //address of the register
      #define STOPWATCH_START { cyc[0] = *DWT_CYCCNT;}
      #define STOPWATCH_STOP { cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0]-offset; }
      STOPWATCH_START
      __asm volatile("nop");
      cyc[1] = *DWT_CYCCNT; cyc[1] = cyc[1] - cyc[0];
      offset = cyc[1] - 1;
      STOPWATCH_START
         lX = 123L; // 2 cycle
         lY = 456L; // 2 cycle
         lZ = lX*lY; // 5 cycles
      STOPWATCH_STOP
    I'm running with optimization level 0, but if i switch to level 3 I save 1 cycle with the 3 integer operations but loose 4 cycles with the 3 float operations....

    This is so strange everything!!!

    Thank you very much, Sim!

    How are you attempting to time the execution of each of the instructions?

    s.