Hi ARM specialists,
I have a question about Cortex-M series pipeline behavior.
According to the page 15 of "ARM Cortex-M Programming Guide to Memory Barrier Instructions Application Note 321", it is described that "Instruction fetch can happen several cycles before decode and execution". If fetch, decode. and execution stages are synchronized, the decode and execution stages would take the same cycles as the fetch stage. If it is true, the long prefetch (or fetch) stage makes performance lower. I think that the prefetch and decode stages are decoupled because the above assumption would be strange.
Is it true? I would like to know the relationship between the prefetch and decode stages of Cortex-M0/M0+/M3/M4. That is, would the prefetch stage latency affect the following stages or not?
Yasuhiko Koumoto.
Hello all,
my intention of this question is I want to know that the prefetch will be performed independently from the corresponding decode and execution stage. I think it is true in Cortex-M0/M3/M4 case. However it is not true in Cortex-M0+ case. Is it true?
I will attache the figure in which my guess of Cortex-M3/M4 pipleline scheme is described. I am lokking forward to the feedback.Best regards,Yasuhiko Koumoto.
Hi yasuhikokoumoto have you managed to solve your problem here? I wonder whether some of our regular contributors might have some thoughts on this.
Hi Tom-san,
No, I have not. It is not a problem but just a question.I think independent prefetch to the prefetch buffer will hide performance decrease by the slow access speed of the instruction flash. If the prefetch buffer has some significance, the pipeline prefetch stage should be decoupled from the decode and execution stages. Is my guess true? I would like to get the comments from the contributors.
Best regards,Yasuhiko Koumoto.
Hi Yasuhiko,
I'm afraid we don't generally publish detailed microarchitectural details like this on any of our cores. The major reason is that we don't want developers to come to rely on particular behaviour which is not architectural but is a feature of a specific implementation. Since different implementations of the same architecture may have quite significantly different behaviour when it comes to things like precise timing and pipeline behaviour, it makes sense for developers not to rely on specific behaviour as it may change at some point in the future.
Hope this helps.
Chris
Hello Chris-san,
thank you for your reply.As mentioned above, my guess comes from the document "ARM Cortex-M Programming Guide to Memory Barrier Instructions Application Note 321".Also I believe that ARM would have the good solution for the performance decrease when Cortex-M fetches instructions from the slower flash memory.However, I understood ARM situation.I would close this case.