Hi there,we are working on an audio project, where we move some firmware from an STM32F407 (ARM Cortex M4) to an ATSAME70 (ARM Cortex M7). Despite the ATSAME70 runing at 300 MHz, while the STM32F407 runs at only 168 MHz, the ATSAME70 is definitely slower in execution speed. The MCU clocks have been checked and are (seemingly?) correct.Here some details:- Audio is captured by an extern audio codec (SGTL5000) and transferred via I2S to the MCU using 16 bit stereo at 44.1 KHz sampling rate. On the MCU the - DMA collects 32 stereo samples before it calls an ISR, where the captured data is elaborated in realtime. - In the ISR a reverberation algorithm is called. The entire source code uses only intern memory and is optimized at "O3" optimisation level on both MCUs. - The ISR execution time is 0.726 msec (= 1/(44100/32)). All times have been measured via oscilloscope, by setting and reseting a GPIO pin at start and end of the ISR.- I made sure the ATSAME70 runs actually at 300 MHz by programming Timer0 to output a 1 kHz signal, which is correct.
Now, on the STM32F407 the algorithm needs about 0.22 msec of time to execute - on the ATSAME70 it needs instead 0.40 msec, which is far too long. What am I missing here?Please give me any clue you can think of and thanks a lot in advance!Michael
Yes, we enabled both caches. They give us about 8-10% more speed, compared to disabling them. That's it. The DMA buffer is none-cacheable.
Then we moved the more important routines, and all const data, from Flash into RAM, which gives another 15% more speed.
Waitstates in Flash controller register EEFC_FMR.FWS are 5, we tried to lower it. At 4 waitstates it still works, but I dont trust it, at 3 execution is definitely impossible.
EEFC_FMR.SCOD and .CLOE are both enabled.
BTW, we did not experience any speed differences between optimization levels O2, O3 or Ofast.
I guess we have to consider another, more powerfull MCU - any suggestions?
Thanks a lot for your hints!
How about I.MX RT: 600MHz Cortex-M7
Regarding optimization, try -Os (small is good for caches)
On the other hand, a compiler might not do any loop unrolling for -Os and might result with poor performance.
The performance also depends on tool chains - due to pipeline differences between Cortex-M4 and Cortex-M7 processors, different instruction scheduling is needed for best performance. As you mentioned that you have tried -Ofast, you might be using Arm Compiler 6? With Arm Compiler 6 you can also enable LTO (Link Time Optimisation) using
-Omax -flto
(If using that, you also need to add -Omax to linker option).
regards,
Joseph
Thanks, but -Os is far too slow.
Regarding another MCU, we are thinking about making a bigger jump to something like the OSD355x with a Cortex A in the 1 GHz range and 512 Mb of fast SDRAM (800 MHz). That is definitely future-proof. :-)
Thanks a lot, I didn't knew about that! I will look into it.
If the algo is fast enough on a Cortex-M3@168MHz I am pretty sure it should be fast enough on 300MHz CM7. Anyway, take care for the caches and MMU setup on the CA9 ;-)