Hi.
In the page 22 of the document below informs that the cortex-m7 has "zero overhead loops" capability. I would like to know how it is done? Is there a special instruction for it?
http://community.arm.com/servlet/JiveServlet/downloadBody/9595-102-4-18606/ARM_Cortex_M7_MCU_Johnson.pdf
Ari.
Hello,
if say about the zero-overhead-loop, it means the loop instruction.However, there are no such instructions in Thumb-2 ISA.I think that Cortex-M7 zero-overhead-loop means just the branch prediction by BTAC (Branch Target Address Buffer).
Best regards,Yasuhiko Koumoto.
Yasuhiko,
If the "branch prediction by BTAC" is not present in the cortex-m4. You are probably right.
by the doc above, this enhancement is only present in the cortex-m7, not in the cortex-m4.
Thanks.
The BTAC is only present in the Cortex-M7.
Unfortunately I do not have any hands-on experience regarding this, but I do have a few suggestions.
My suggestion is based upon experience with Cortex-M4 and other architectures; some which have out-of-order execution.
I think if jyiu is reading this, he can probably provide you with a much better answer.
Try placing your loop condition generation early in the loop and then the action after the condition and finally the branch at the end.
For instance:
copy_l:
cmp r2,r3
ittt lo
ldrlo r0,[r1],#4
strlo r0,[r2],#4
blo copy_l
Here I've included the branch in the IT instruction; it might help on getting a zero-overhead loop. Try comparing it with ...
itt lo
... and see if there is any difference.
For a counter-type loop, also keep the condition generation early; like this:
subs #1,r3
ldr r0,[r1],#4
str r0,[r2],#4
bhs copy_l
Hi,
There is no new/special instruction for loops in Cortex-M7.
The design help reducing loop overhead in a number of ways:
- BTAC enable good accuracy in branch predction, so in most cases, there is no branch penalty (of course you still got branch penalty if the prediction is wrong)
- a branch instruction can execute at the same cycle with another data processing instruction
- moving the condition generation instruction eariler helps in some cases too (but in general the design of Cortex-M7 enable high performance without too much of compiler optimization).
So in strictly computer "geek" language, I won't call it zero-over-head loops . But the result is essentially same as some zero-overhead-loop designs, so in "PR" language this description is "correct".
And Ari is right that there is no BTAC in Cortex-M4.
regards,
Joseph
(Disclaimer : this message is written before my 2nd cup of coffee this morning....may not be suitable for human consumption).