Hi folks,
Some weeks ago, I discover the mechanism of IT instruction folding supported by the cortex-M3.
As mentionned in 'Cortex-M3 Devices Generic User Guide', "In some situations, the processor can start executing the first instruction in an IT block while it is still executing the IT instruction. This behavior is called IT folding...".
Therefore, it leads that IT instruction timing cost is '0' cycle, Wonderful !!!
In fact, I would like to know what are those situations/conditions to anticipate/favorise this behaviour ?
Before posting here, I made several unsuccessful searches on the net.
Are those conditions associated to the instruction before IT one ? Alignment ? Type of instruction (16 or 32, data processing, load-store)?
Are those conditions associated to the instruction after IT one ? Alignment ? Type of instruction (16 or 32, data processing, load-store)?
I also have some subsidiary questions, for my personal curiosity, and that help to answer my previous question.
Based on my knowledge of this chip after reading some articles, I made the following assumptions that I would like to confirm:
Is 'IT folding' linked to the fact that the first instruction of an IT block is always executed (always marked as THEN)?
Is 'IT folding' linked to the fact that the EPSR is not directly accessible [Cortex™-M3 Technical Reference Manual, §2.3.2]?
For this kind of simultaneous execution, I suppose that the IT and another instruction need to be present in the decode stage at the same time?
But the behavior of the couple fetch/decode stages is not clear for me: could the fetch contains two 16-bit instructions and then decode stage requests only one or two instructions ?
I'm new on this kind of topics, don't hesistate to correct me if my previous assumptions are wrong.
Thanks for your help.
IT folding happens when the
- instruction preceeding the IT instruction is 16-bit (and not a branch instruction), and
- the IT instruction is already fetched in the instruction buffer. (It might not happen shortly after a branch as the IT instruction might still being fetched).
There is no other alignment requirement.
Regarding your assumptions:
> Is 'IT folding' linked to the fact that the first instruction of an IT block is always executed (always marked as THEN)?
No. The first instruction is always "T" so that we can save a bit in the encoding of the condition.
> Is 'IT folding' linked to the fact that the EPSR is not directly accessible [Cortex™-M3 Technical Reference Manual, §2.3.2]?
Not as far as I know.
>For this kind of simultaneous execution, I suppose that the IT and another instruction need to be present in the decode stage at the same time?
Yes. That's why the preceeding instruction need to be 16-bit.
> could the fetch contains two 16-bit instructions and then decode stage requests only one or two instructions ?
The path from instruction fetch to decode stage is 32-bit. But the second 16-bit might not have valid instruction (e.g. still being fetched due to memory waitstate). So it is possible the the IT fold cannot take place and need another decode cycle later.
regards,
Joseph
Those are excellent questions; I wish there were a "helpful question" button to reward them.
-And I like the detailed answers jyiu give on this subject.
There is one situation, which I think is not fully covered.
A CMP instruction can, if it's 16-bit, be folded into a preceding LOAD instruction, if the LOAD instruction is 16-bit.
An IT instruction can be folded into a preceding 16-bit instruction.
If I understand this correctly, those two cases cannot happen at the same time; it's "either / or".
My understanding is that if the 16-bit load instruction is located on a 32-bit aligned address, then the 16-bit CMP instruction following it will be folded into the load instruction; and the CMP may execute in one clock cycle, but the load instruction uses one clock cycle less, if it's a two-cycle instruction.
The IT instruction will then not be folded into the 16-bit CMP instruction, because the IT instruction is not part of the 32-bit word that was fetched with CMP.
However, if the load instruction is located on a non-32-bit aligned address, the IT instruction will be folded into the 16-bit CMP instruction.
In other words: The fold only happens if the preceding instruction is aligned on a 32-bit address.
(Did I understand this correctly ?)
I haven't look into this in details for a while so I could be wrong :
I don't think it is necessary that the instruction pair (16-bit thumb + IT) need to be aligned on a 32-bit address.
But sometimes it helps because the flash might not have the IT instruction in the instruction queue in time if the IT instruction is on the next instruction fetch.
(Don't forget the flash memories are usually a bit slow). The Cortex-M3 and Cortex-M4 both has a 3-word instruction buffer, and if the instruction buffer is filled up I think the IT fold could work as the IT instruction can be decoded at the same time as the preceeding instruction.
If a double-fold is possible, that's truly amazing (and unexpected)!
Thank you for these details. It's very helpful, when the cycles available are few (due to tight timing or low clock frequencies).
Happy that you appreciate my questions .
A good thing for my first question on this forum.
I will try contune in this way.
Joseph,
Thank you for detailled answer.
It is more clear for me now.
As the IT instruction is often preceded by a CMP instruction.
If the16-bit encoding is used for CMP, the IT should often be folded if it happens not shortly after a branch.
Therefore a special attention to the choice of CMP instruction could favorise the IT folding.
Regards,
Rémi.