This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

An algorithm on a M7 is slower than on M4 - why?

Hi there,

we are working on an audio project, where we move some firmware from an STM32F407 (ARM Cortex M4) to an ATSAME70 (ARM Cortex M7). Despite the ATSAME70 runing at 300 MHz, while the STM32F407 runs at only 168 MHz, the ATSAME70 is definitely slower in execution speed. The MCU clocks have been checked and are (seemingly?) correct.

Here some details:

- Audio is captured by an extern audio codec (SGTL5000) and transferred via I2S to the MCU using 16 bit stereo at 44.1 KHz sampling rate. On the MCU the - DMA collects 32 stereo samples before it calls an ISR, where the captured data is elaborated in realtime.
- In the ISR a reverberation algorithm is called. The entire source code uses only intern memory and is optimized at "O3" optimisation level on both MCUs.
- The ISR execution time is 0.726 msec (= 1/(44100/32)). All times have been measured via oscilloscope, by setting and reseting a GPIO pin at start and end of the ISR.
- I made sure the ATSAME70 runs actually at 300 MHz by programming Timer0 to output a 1 kHz signal, which is correct.

Now, on the STM32F407 the algorithm needs about 0.22 msec of time to execute - on the ATSAME70 it needs instead 0.40 msec, which is far too long. What am I missing here?

Please give me any clue you can think of and thanks a lot in advance!
Michael

Top replies

Parents

0 Muragavino over 7 years ago in reply to Muragavino

Yes, we enabled both caches. They give us about 8-10% more speed, compared to disabling them. That's it. The DMA buffer is none-cacheable.

Then we moved the more important routines, and all const data, from Flash into RAM, which gives another 15% more speed.

Waitstates in Flash controller register EEFC_FMR.FWS are 5, we tried to lower it. At 4 waitstates it still works, but I dont trust it, at 3 execution is definitely impossible.

EEFC_FMR.SCOD and .CLOE are both enabled.

BTW, we did not experience any speed differences between optimization levels O2, O3 or Ofast.

I guess we have to consider another, more powerfull MCU - any suggestions?

Thanks a lot for your hints!
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Muragavino over 7 years ago in reply to Muragavino

Yes, we enabled both caches. They give us about 8-10% more speed, compared to disabling them. That's it. The DMA buffer is none-cacheable.

Then we moved the more important routines, and all const data, from Flash into RAM, which gives another 15% more speed.

Waitstates in Flash controller register EEFC_FMR.FWS are 5, we tried to lower it. At 4 waitstates it still works, but I dont trust it, at 3 execution is definitely impossible.

EEFC_FMR.SCOD and .CLOE are both enabled.

BTW, we did not experience any speed differences between optimization levels O2, O3 or Ofast.

I guess we have to consider another, more powerfull MCU - any suggestions?

Thanks a lot for your hints!
Cancel
Vote up 0 Vote down

Cancel

Children

0 42Bastian Schick over 7 years ago in reply to Muragavino

How about I.MX RT: 600MHz Cortex-M7

Regarding optimization, try -Os (small is good for caches)
Cancel
Vote up 0 Vote down

Cancel
0 Joseph Yiu over 7 years ago in reply to 42Bastian Schick

On the other hand, a compiler might not do any loop unrolling for -Os and might result with poor performance.

The performance also depends on tool chains - due to pipeline differences between Cortex-M4 and Cortex-M7 processors, different instruction scheduling is needed for best performance. As you mentioned that you have tried -Ofast, you might be using Arm Compiler 6? With Arm Compiler 6 you can also enable LTO (Link Time Optimisation) using

-Omax -flto

(If using that, you also need to add -Omax to linker option).

regards,

Joseph
Cancel
Vote up 0 Vote down

Cancel
0 Muragavino over 7 years ago in reply to 42Bastian Schick

Thanks, but -Os is far too slow.

Regarding another MCU, we are thinking about making a bigger jump to something like the OSD355x with a Cortex A in the 1 GHz range and 512 Mb of fast SDRAM (800 MHz). That is definitely future-proof. :-)
Cancel
Vote up 0 Vote down

Cancel
0 Muragavino over 7 years ago in reply to Joseph Yiu

Thanks a lot, I didn't knew about that! I will look into it.
Cancel
Vote up 0 Vote down

Cancel
0 42Bastian Schick over 7 years ago in reply to Muragavino

If the algo is fast enough on a Cortex-M3@168MHz I am pretty sure it should be fast enough on 300MHz CM7. Anyway, take care for the caches and MMU setup on the CA9 ;-)
Cancel
Vote up 0 Vote down

Cancel