Hi ARM specialists,
I have a question about Cortex-M series pipeline behavior.
According to the page 15 of "ARM Cortex-M Programming Guide to Memory Barrier Instructions Application Note 321", it is described that "Instruction fetch can happen several cycles before decode and execution". If fetch, decode. and execution stages are synchronized, the decode and execution stages would take the same cycles as the fetch stage. If it is true, the long prefetch (or fetch) stage makes performance lower. I think that the prefetch and decode stages are decoupled because the above assumption would be strange.
Is it true? I would like to know the relationship between the prefetch and decode stages of Cortex-M0/M0+/M3/M4. That is, would the prefetch stage latency affect the following stages or not?
Yasuhiko Koumoto.
Hi Yasuhiko,
I'm afraid we don't generally publish detailed microarchitectural details like this on any of our cores. The major reason is that we don't want developers to come to rely on particular behaviour which is not architectural but is a feature of a specific implementation. Since different implementations of the same architecture may have quite significantly different behaviour when it comes to things like precise timing and pipeline behaviour, it makes sense for developers not to rely on specific behaviour as it may change at some point in the future.
Hope this helps.
Chris
Hello Chris-san,
thank you for your reply.As mentioned above, my guess comes from the document "ARM Cortex-M Programming Guide to Memory Barrier Instructions Application Note 321".Also I believe that ARM would have the good solution for the performance decrease when Cortex-M fetches instructions from the slower flash memory.However, I understood ARM situation.I would close this case.
Best regards,Yasuhiko Koumoto.