Recently I'm learning the implement of ARM cortex m core in order to optimize my software to be more efficient and be easier to predict its execute time. But now I'm confused about the clock cycle, machine cycle and the pipeline of cortex-m.
The first one is Cortex-M7.
2 days ago I posted a question https://community.arm.com/processors/f/discussions/8952/where-to-find-the-execution-cycles-of-cortex-m7-instruction in order to get the execute time of Cortex-M7 instruction. The answer is, because of the dual-issue in this core, it is useless to get how many clock cycle the instruction takes.
I measured some instructions like ADD, CMP, VLDR, STR, SMUL, MOV in keil-MDK, using STM32F746NG. And finally I got the exact value. They are all takes 12 clock cycle. That makes me confused.
According to this post, https://community.arm.com/processors/f/discussions/5219/how-long-are-the-cortex-m7-pipeline-stages/26926#26926, the pipline stage in M7 is looks like this:
As an example, ADD r1, r2. The progress it needed is:
1. Fetch ( instruction ),
2. Decode (1st dec),
3. issue(2nd dec, in fact there is no the other Instruction which can be dual-issued with the ADD, but I don't know the principle and how the core works, so I let it takes 1 clock cycle)
4. Execute #1 ( I don't know what's happened and what's the difference with the execute #2)
5. Execute #2
6. Write/Store
The whole progress is 6 stage, if every stage takes 1 clock cycle, it's just 6 clock cycle rather than 12 clock cycle measured by me.
So my questions are:
1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?
2. Will the issue stage be ignored if there is no dual-issuable Instruction following?
For Cortex M4 and M3, which are both 3 stage pipeline core
http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores
According to Document DAI0321A:
If we ignore the exception, branch etc...., It will takes only 1 clock cycle in Cortex M0,3,4, Is that right?
Above all, the third question is:
3. How's the pipeline organized in Cortex-M7? If it can't be shown because of the dual-issue issue, just according to my measurement (12 clock cycle for each instruction) could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code?
Read about SysTick system timer, try to use it in your measurements. Connect processor clock to the SysTick timer and load some value which will be decremented each processor's clock cycle. Halt the processor and sample value of SysTick current value register.
Thank you very much. I got it.