Hi there,we are working on an audio project, where we move some firmware from an STM32F407 (ARM Cortex M4) to an ATSAME70 (ARM Cortex M7). Despite the ATSAME70 runing at 300 MHz, while the STM32F407 runs at only 168 MHz, the ATSAME70 is definitely slower in execution speed. The MCU clocks have been checked and are (seemingly?) correct.Here some details:- Audio is captured by an extern audio codec (SGTL5000) and transferred via I2S to the MCU using 16 bit stereo at 44.1 KHz sampling rate. On the MCU the - DMA collects 32 stereo samples before it calls an ISR, where the captured data is elaborated in realtime. - In the ISR a reverberation algorithm is called. The entire source code uses only intern memory and is optimized at "O3" optimisation level on both MCUs. - The ISR execution time is 0.726 msec (= 1/(44100/32)). All times have been measured via oscilloscope, by setting and reseting a GPIO pin at start and end of the ISR.- I made sure the ATSAME70 runs actually at 300 MHz by programming Timer0 to output a 1 kHz signal, which is correct.
Now, on the STM32F407 the algorithm needs about 0.22 msec of time to execute - on the ATSAME70 it needs instead 0.40 msec, which is far too long. What am I missing here?Please give me any clue you can think of and thanks a lot in advance!Michael
Did you enable the caches?Did you setup the correct wait states?
Thanks for your swift reply!
We do not use any external memory and have not experienced any difference by enabling/disabling the D or I caches. We do know that the D-Cache does not interfere with the DMA transfers between the hardware registers and the internal RAM.
Are there any wait states to consider for accessing the MCU internal RAM or flash (from where the firmware runs)? Are there any reasons those are slower than expected? I am really a bit lost with this issue...
You should at least enable the I-cache, as it accelerates the execution speed.
Sure, there are wait states.I know of no flash that can execute at 300MHz.
RAM might be zero-wait, but I guess you have a least one.
Regarding D-cache I suggest to make the part of the RAM used for DMA as non-cacheable.
Thanks a lot for your hints. Currently I am traveling but next Monday I will try your suggestions and then come back here with what I found out.
Thanks!
You should enable D-cache as well. Your software in firmware would contains constants and other literal data that are access via D side. For data regions that are shared, either:
- place them in D-TCM so that they are not cached (no data coherency issue in such arrangement), or
- use cache maintenance routines to ensure coherency between processor and DMA controller, or
- use the MPU to mark part of the internal data RAM as Non-cacheable.
regards,
Joseph
Yes, we enabled both caches. They give us about 8-10% more speed, compared to disabling them. That's it. The DMA buffer is none-cacheable.
Then we moved the more important routines, and all const data, from Flash into RAM, which gives another 15% more speed.
Waitstates in Flash controller register EEFC_FMR.FWS are 5, we tried to lower it. At 4 waitstates it still works, but I dont trust it, at 3 execution is definitely impossible.
EEFC_FMR.SCOD and .CLOE are both enabled.
BTW, we did not experience any speed differences between optimization levels O2, O3 or Ofast.
I guess we have to consider another, more powerfull MCU - any suggestions?
Thanks a lot for your hints!
How about I.MX RT: 600MHz Cortex-M7
Regarding optimization, try -Os (small is good for caches)
On the other hand, a compiler might not do any loop unrolling for -Os and might result with poor performance.
The performance also depends on tool chains - due to pipeline differences between Cortex-M4 and Cortex-M7 processors, different instruction scheduling is needed for best performance. As you mentioned that you have tried -Ofast, you might be using Arm Compiler 6? With Arm Compiler 6 you can also enable LTO (Link Time Optimisation) using
-Omax -flto
(If using that, you also need to add -Omax to linker option).
Thanks, but -Os is far too slow.
Regarding another MCU, we are thinking about making a bigger jump to something like the OSD355x with a Cortex A in the 1 GHz range and 512 Mb of fast SDRAM (800 MHz). That is definitely future-proof. :-)
Thanks a lot, I didn't knew about that! I will look into it.
If the algo is fast enough on a Cortex-M3@168MHz I am pretty sure it should be fast enough on 300MHz CM7. Anyway, take care for the caches and MMU setup on the CA9 ;-)