In the DSP lib files like arm_conv_f32, arm_fir_f32, the algorithm implementation in Cortex-M3/M4 and in Cortex-M0 is different. i.e., loop unrolling is used in M3/M4 and it is not used in M0.
Pls tell me the reason behind it. Is there any advantage of using loop unrolling in M3/M4.
Thanks
Indu
Loop-unrolling is used in order to gain speed.
This is accomplished by reducing the number of instructions that branch back in a loop, because such branches use clock cycles on doing no useful work.
-But the cost is program-space.
I believe the reason that the loops are not normally unrolled on the Cortex-M0, is that most Cortex-M0 microcontrollers have very little program space.
But if you are using a LPC43xx, you will have plenty of program space, so you can change the settings, so loop-unrolling is enabled.
Loop-unrolling will gain speed on all Cortex-M0, Cortex-M0+, Cortex-M3 and Cortex-M4 devices.
However, I think that loop-unrolling will not gain any speed at all on Cortex-M7, because this architecture has a Branch Predictor and a Branch Target Address Unit, which I believe would make branches execute in zero clock cycles.
The main advantage of loop unrolling is to schedule the memory accesses better. There can also be savings in regsiter moves and branches but they're normally secondary. Sometimes also one can merge memory accesses and save a bit that way.
daith wrote: The main advantage of loop unrolling is to schedule the memory accesses better.
daith wrote:
The main advantage of loop unrolling is to schedule the memory accesses better.
Yes this is true; I didn't think about that, because the question was about Cortex-M0, where scheduling of LDR instructions won't matter.
Still, it's possible to merge memory access on the Cortex-M0, which in some cases can change the task from being impossible to being possible.