Recently I'm learning the implement of ARM cortex m core in order to optimize my software to be more efficient and be easier to predict its execute time. But now I'm confused about the clock cycle, machine cycle and the pipeline of cortex-m.
The first one is Cortex-M7.
2 days ago I posted a question https://community.arm.com/processors/f/discussions/8952/where-to-find-the-execution-cycles-of-cortex-m7-instruction in order to get the execute time of Cortex-M7 instruction. The answer is, because of the dual-issue in this core, it is useless to get how many clock cycle the instruction takes.
I measured some instructions like ADD, CMP, VLDR, STR, SMUL, MOV in keil-MDK, using STM32F746NG. And finally I got the exact value. They are all takes 12 clock cycle. That makes me confused.
According to this post, https://community.arm.com/processors/f/discussions/5219/how-long-are-the-cortex-m7-pipeline-stages/26926#26926, the pipline stage in M7 is looks like this:
As an example, ADD r1, r2. The progress it needed is:
1. Fetch ( instruction ),
2. Decode (1st dec),
3. issue(2nd dec, in fact there is no the other Instruction which can be dual-issued with the ADD, but I don't know the principle and how the core works, so I let it takes 1 clock cycle)
4. Execute #1 ( I don't know what's happened and what's the difference with the execute #2)
5. Execute #2
6. Write/Store
The whole progress is 6 stage, if every stage takes 1 clock cycle, it's just 6 clock cycle rather than 12 clock cycle measured by me.
So my questions are:
1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?
2. Will the issue stage be ignored if there is no dual-issuable Instruction following?
For Cortex M4 and M3, which are both 3 stage pipeline core
http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores
According to Document DAI0321A:
If we ignore the exception, branch etc...., It will takes only 1 clock cycle in Cortex M0,3,4, Is that right?
Above all, the third question is:
3. How's the pipeline organized in Cortex-M7? If it can't be shown because of the dual-issue issue, just according to my measurement (12 clock cycle for each instruction) could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code?
First, how did you get the 12 cycle results? That looks incorrect. People make a long pipeline to improve the performance otherwise why? The CoreMark and other benchmarks can prove that Cortex-M7 is indeed better than Cortex-M3/M4.
So, for us, the first question is, what's your method for measuring the CPI? Further more, have you take the memory wait state into consideration. If you want to get the best performance, you should use TCM to store code and data (so there will be no EXTRA wait state when fetching instructions and accessing data).
We cannot tell you how the pipeline organized in detailed, but the conclusion is wrong: "could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code". You can simply test it with existing benchmarks.
Thanks for your response.
I tested the clock cycle by this: in the debug of the STM32F746NG,
Using Keil-MDK, make 2 breakpoint in the front of the code, and watch the change of the Internal->states' value.
And I've found that, for every break, it takes 12 clock cycles. Is that an incorrect method?
Step by step
Dual-issue means that in 1 processor cycle 2 instructions are issuing to different pipelines (you can see Cortex M7 pipelines in your posted image). For example, cortex m7 may issue two instructions (if their operands are in registers) into two different pipelines, e.g. load or store operation to load/store unit and integer operation to ALU #1, thereby two operations sequential in program order are executed in parallel instead of being executed sequentially. As a conclusion, execution time is measured by instruction with longer execution time, while in the case of in-order execution you have to add execution time of each instruction. That is why this architecture have better performance (parallel execution).
If you want to calculate overall execution time for some portion of code you should have deep knowledge about in-order and out-of-order execution paradigm, taking into account time for loading instructions into pipeline (fetching), time of decoding instructions, at the decoding stage instructions may be speculative executed, so runtime of your code becomes shorter, in case of dual-issue, for example second instruction may not be ready for issuing into execution stage, bacause of waiting an operand. It is difficult to calculate precise time for performing individual instructions from sequence of instructions.
If I am not mistaken Cortex M7 contains dual-issue superscalar architecture.
Issue stage is a stage where instructions are prepared for execution (can't be ignored).
Cortex M7 performance in general better than Cortex M3 or M4 because of superscalar architecture.
But if you want to calculate execution time for one individual instruction through all stages of pipeline you could see that Cortex M3 or M4 expends less cycles than Cortex M7))
Hi, Thanks for reply. I'm not sure if the 12 clock cycles' latency is cause by the debugger, ST-LINK. maybe it needs a instruction's position for breakout?
about the issue stage, if there is no dual-issuable instruction followed, the pipeline will do nothing in this stage, is which a kind of wasting pipeline capacity?
That's the wrong method. You should use PMU without debug.
if there is no dual issueable instruction followed, only one instruction will be issued if there is no any hazard.
Read about SysTick system timer, try to use it in your measurements. Connect processor clock to the SysTick timer and load some value which will be decremented each processor's clock cycle. Halt the processor and sample value of SysTick current value register.
Thank you very much. I got it.
Hi, recently I have tested Data Watchpoint and Trace (DWT) of Cortex M4, and if you still want to count instruction cycles try DWT counters in your measurements.