This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-M7 "zero overhead loop"

Hi.

In the page 22 of the document below informs that the cortex-m7 has "zero overhead loops" capability. I would like to know how it is done? Is there a special instruction for it?

http://community.arm.com/servlet/JiveServlet/downloadBody/9595-102-4-18606/ARM_Cortex_M7_MCU_Johnson.pdf

Ari.

  • Hello,


    if say about the zero-overhead-loop, it means the loop instruction.
    However, there are no such instructions in Thumb-2 ISA.
    I think that Cortex-M7 zero-overhead-loop means just the branch prediction by BTAC (Branch Target Address Buffer).

    Best regards,
    Yasuhiko Koumoto.

  • Yasuhiko,

       If the "branch prediction by BTAC" is not present in the cortex-m4. You are probably right.

      by the doc above, this enhancement is only present in the cortex-m7, not in the cortex-m4.

     

      Thanks.

       Ari.

  • The BTAC is only present in the Cortex-M7.

    Unfortunately I do not have any hands-on experience regarding this, but I do have a few suggestions.

    My suggestion is based upon experience with Cortex-M4 and other architectures; some which have out-of-order execution.

    I think if jyiu is reading this, he can probably provide you with a much better answer.

    Try placing your loop condition generation early in the loop and then the action after the condition and finally the branch at the end.

    For instance:

    copy_l:

        cmp     r2,r3

        ittt      lo

        ldrlo   r0,[r1],#4

        strlo   r0,[r2],#4

        blo     copy_l

    Here I've included the branch in the IT instruction; it might help on getting a zero-overhead loop. Try comparing it with ...

    copy_l:

        cmp     r2,r3

        itt       lo

        ldrlo   r0,[r1],#4

        strlo   r0,[r2],#4

        blo     copy_l

    ... and see if there is any difference.

    For a counter-type loop, also keep the condition generation early; like this:

    copy_l:

       subs    #1,r3

       ldr     r0,[r1],#4

       str     r0,[r2],#4

       bhs     copy_l


  • Hi,

    There is no new/special instruction for loops in Cortex-M7.

    The design help reducing loop overhead in a number of ways:

    - BTAC enable good accuracy in branch predction, so in most cases, there is no branch penalty (of course you still got branch penalty if the prediction is wrong)

    - a branch instruction can execute at the same cycle with another data processing instruction

    - moving the condition generation instruction eariler helps in some cases too (but in general the design of Cortex-M7 enable high performance without too much of compiler optimization).

    So in strictly computer "geek" language, I won't call it zero-over-head loops . But the result is essentially same as some zero-overhead-loop designs, so in "PR" language this description is "correct".

    And Ari is right that there is no BTAC in Cortex-M4.

    regards,

    Joseph

    (Disclaimer : this message is written before my 2nd cup of coffee this morning....may not be suitable for human consumption).

  • Zero-overhead-loop has nothing to do at all with branches. It is a well known technique widely used in DSPs. Almost every decent DSP that I used had this function in it.

    To make zero-overhead-loop work two things are needed: (1) a "loop" instruction that tells the processors the number of iterations to process (there are other fancy, mostly academic, techniques that can do without a dedicated loop instruction but I have never seen them used in a commercial processor) and (2) an optional loop buffer to store the loop instructions depending on the processor instruction memory configuration. When the loop instruction is decoded by the processor the number of iterations is loaded into a loop counter that automatically counts down until it reaches zero and the loop terminates. The updating of the loop counter is down in parallel with the execution of the loop instructions and therefore doesn't consume any additional CPU cycles thus "zero-overhead-loop". A branch instruction on the other hand will consume multiple cycles (condition check which might take several to execute plus the branch itself). Hope this explanation clarifies the difference between zero-overhead-loops and branches.

    Now, considering that the Cortex-M7 doesn't have a loop instruction (I searched infocenter.arm.com for "cortex-m7 zero overhead loop" and got 0 hits so I am relying on the other comments in the post) then it should not claim that it supports zero-overhead-loop. I am not that familiar with the Cortex-M7 as some processors have separate instructions to load the loop counter and then a regular unconditional branch instruction which gets executed only once at the start of the loop. Might need to ask ARM to clarify what exactly is meant by "zero-overhead-loop". The alternative is just ignore that marketing doc and rely on the programmers manual which is more accurate.