This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How many clock cycles does a"for" loop take?

Raul77 over 3 years ago

Hello,

I work with cortex M3-Microcontroller (LPC1768) and I want to know how much clock has a loop (for)

for(i=0;i<1;i++);

Thanks

Parents

0 Former Member over 3 years ago
Using Godbolt's online compiler explorer (set to gcc 9.2.1 although the clang output is similar enough) you will find with no optimization:

movs r3, #0 str r3, [r7, #4] .loop: ldr r3, [r7, #4] cmp r3, #0 bgt .loopexit ldr r3, [r7, #4] adds r3, r3, #1 str r3, [r7, #4] b .loop .loopexit: ...

With any optimization, the loop is removed as it has no effect other than if used for timing.

So then, you'd need to look up the cycle count for each instruction, which is in the technical reference manual (TRM).

MOV/CMP/ADD are 1 cycle

STR/LDR are 2 cycles

Branches are 1 cycle with a pipeline reload if the branch is taken (adding 2 cycles)
The TRM references unconditional branches with and without stalls, but I don't really know which happens; 1 cycle or 3. Some professor somewhere says 3.

This would loop once, so ...
1 + 2 + 2 + 1 + 1 (branch not taken) + 2 + 1 + 2 + 3 (branch taken) + 2 + 1 + 3 (branch taken)
In other words, 21 instruction cycles without optimization if I'm not missing anything (like byte alignment or anything else). Just glancing at the clang output, it looks like it has one more taken branch and would thus take 24 instruction cycles.

Still, Andy has the right answer. It depends, and without an instruction that has an effect (like changing a pin output) the optimizer will erase the loop. Even with useful code in the loop, you'd probably need greater than 1 loop to see meaningful differences between the different optimization levels.
Cancel
Up 0 Down

Cancel

Reply

0 Former Member over 3 years ago
Using Godbolt's online compiler explorer (set to gcc 9.2.1 although the clang output is similar enough) you will find with no optimization:

movs r3, #0 str r3, [r7, #4] .loop: ldr r3, [r7, #4] cmp r3, #0 bgt .loopexit ldr r3, [r7, #4] adds r3, r3, #1 str r3, [r7, #4] b .loop .loopexit: ...

With any optimization, the loop is removed as it has no effect other than if used for timing.

So then, you'd need to look up the cycle count for each instruction, which is in the technical reference manual (TRM).

MOV/CMP/ADD are 1 cycle

STR/LDR are 2 cycles

Branches are 1 cycle with a pipeline reload if the branch is taken (adding 2 cycles)
The TRM references unconditional branches with and without stalls, but I don't really know which happens; 1 cycle or 3. Some professor somewhere says 3.

This would loop once, so ...
1 + 2 + 2 + 1 + 1 (branch not taken) + 2 + 1 + 2 + 3 (branch taken) + 2 + 1 + 3 (branch taken)
In other words, 21 instruction cycles without optimization if I'm not missing anything (like byte alignment or anything else). Just glancing at the clang output, it looks like it has one more taken branch and would thus take 24 instruction cycles.

Still, Andy has the right answer. It depends, and without an instruction that has an effect (like changing a pin output) the optimizer will erase the loop. Even with useful code in the loop, you'd probably need greater than 1 loop to see meaningful differences between the different optimization levels.
Cancel
Up 0 Down

Cancel

Children

0 WestfW over 3 years ago in reply to Former Member

Don't forget that the operation of the "flash accelerator" may (or may not) make timing of instructions fetches somewhat non-deterministic :-(
Cancel
Up 0 Down

Cancel
0 42Bastian Schick over 3 years ago in reply to WestfW

If I want "deterministic" instruction execution, I'd go for a 6502 ;-)
Cancel
Up 0 Down

Cancel
0 WestfW over 3 years ago in reply to 42Bastian Schick

Some chips offer the feature of turning off any "flash acceleration" in order to achieve determinism (but slower.)

And some have "tightly coupled RAM" memory where you can stick code to run at full rate...

But some SAMD21 cycle-counting delay code in Arduino went all wonky when ported to the SAMD51 (which has actual cache rather than just flash acceleration.) Even thought it was theoretically adjusted for the change in clock rate. github.com/.../71
Cancel
Up 0 Down

Cancel