This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Question about the Pipeline, clock cycle and machine cycle in Cortex-M Series.

tyskin over 8 years ago

Recently I'm learning the implement of ARM cortex m core in order to optimize my software to be more efficient and be easier to predict its execute time. But now I'm confused about the clock cycle, machine cycle and the pipeline of cortex-m.

The first one is Cortex-M7.

2 days ago I posted a question https://community.arm.com/processors/f/discussions/8952/where-to-find-the-execution-cycles-of-cortex-m7-instruction in order to get the execute time of Cortex-M7 instruction. The answer is, because of the dual-issue in this core, it is useless to get how many clock cycle the instruction takes.

I measured some instructions like ADD, CMP, VLDR, STR, SMUL, MOV in keil-MDK, using STM32F746NG. And finally I got the exact value. They are all takes 12 clock cycle. That makes me confused.

According to this post, https://community.arm.com/processors/f/discussions/5219/how-long-are-the-cortex-m7-pipeline-stages/26926#26926, the pipline stage in M7 is looks like this:

As an example, ADD r1, r2. The progress it needed is:

1. Fetch ( instruction ),

2. Decode (1st dec),

3. issue(2nd dec, in fact there is no the other Instruction which can be dual-issued with the ADD, but I don't know the principle and how the core works, so I let it takes 1 clock cycle)

4. Execute #1 ( I don't know what's happened and what's the difference with the execute #2)

5. Execute #2

6. Write/Store

The whole progress is 6 stage, if every stage takes 1 clock cycle, it's just 6 clock cycle rather than 12 clock cycle measured by me.

So my questions are:

1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?

2. Will the issue stage be ignored if there is no dual-issuable Instruction following?

For Cortex M4 and M3, which are both 3 stage pipeline core

http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores

According to Document DAI0321A:

If we ignore the exception, branch etc...., It will takes only 1 clock cycle in Cortex M0,3,4, Is that right?

Above all, the third question is:

3. How's the pipeline organized in Cortex-M7? If it can't be shown because of the dual-issue issue, just according to my measurement (12 clock cycle for each instruction) could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code?

Top replies

Parents

+1 Vanhealsing over 8 years ago

Step by step

1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?

Dual-issue means that in 1 processor cycle 2 instructions are issuing to different pipelines (you can see Cortex M7 pipelines in your posted image). For example, cortex m7 may issue two instructions (if their operands are in registers) into two different pipelines, e.g. load or store operation to load/store unit and integer operation to ALU #1, thereby two operations sequential in program order are executed in parallel instead of being executed sequentially. As a conclusion, execution time is measured by instruction with longer execution time, while in the case of in-order execution you have to add execution time of each instruction. That is why this architecture have better performance (parallel execution).

If you want to calculate overall execution time for some portion of code you should have deep knowledge about in-order and out-of-order execution paradigm, taking into account time for loading instructions into pipeline (fetching), time of decoding instructions, at the decoding stage instructions may be speculative executed, so runtime of your code becomes shorter, in case of dual-issue, for example second instruction may not be ready for issuing into execution stage, bacause of waiting an operand. It is difficult to calculate precise time for performing individual instructions from sequence of instructions.

If I am not mistaken Cortex M7 contains dual-issue superscalar architecture.
Cancel
Vote up +2 Vote down

Cancel

Reply

+1 Vanhealsing over 8 years ago

Step by step

1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?

Dual-issue means that in 1 processor cycle 2 instructions are issuing to different pipelines (you can see Cortex M7 pipelines in your posted image). For example, cortex m7 may issue two instructions (if their operands are in registers) into two different pipelines, e.g. load or store operation to load/store unit and integer operation to ALU #1, thereby two operations sequential in program order are executed in parallel instead of being executed sequentially. As a conclusion, execution time is measured by instruction with longer execution time, while in the case of in-order execution you have to add execution time of each instruction. That is why this architecture have better performance (parallel execution).

If you want to calculate overall execution time for some portion of code you should have deep knowledge about in-order and out-of-order execution paradigm, taking into account time for loading instructions into pipeline (fetching), time of decoding instructions, at the decoding stage instructions may be speculative executed, so runtime of your code becomes shorter, in case of dual-issue, for example second instruction may not be ready for issuing into execution stage, bacause of waiting an operand. It is difficult to calculate precise time for performing individual instructions from sequence of instructions.

If I am not mistaken Cortex M7 contains dual-issue superscalar architecture.
Cancel
Vote up +2 Vote down

Cancel

Children

0 tyskin over 8 years ago in reply to Vanhealsing

Hi, Thanks for reply. I'm not sure if the 12 clock cycles' latency is cause by the debugger, ST-LINK. maybe it needs a instruction's position for breakout?
Cancel
Vote up 0 Vote down

Cancel
0 Vanhealsing over 8 years ago in reply to tyskin

Read about SysTick system timer, try to use it in your measurements. Connect processor clock to the SysTick timer and load some value which will be decremented each processor's clock cycle. Halt the processor and sample value of SysTick current value register.
Cancel
Vote up 0 Vote down

Cancel
0 tyskin over 8 years ago in reply to Vanhealsing

Thank you very much. I got it.
Cancel
Vote up 0 Vote down

Cancel
+1 Vanhealsing over 8 years ago in reply to tyskin

Hi, recently I have tested Data Watchpoint and Trace (DWT) of Cortex M4, and if you still want to count instruction cycles try DWT counters in your measurements.
Cancel
Vote up +1 Vote down

Cancel