This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Question about the Pipeline, clock cycle and machine cycle in Cortex-M Series.

Recently I'm learning the implement of ARM cortex m core in order to optimize my software to be more efficient and be easier to predict its execute time. But now I'm confused about the clock cycle, machine cycle and the pipeline of cortex-m.

The first one is Cortex-M7. 

2 days ago I posted a question  https://community.arm.com/processors/f/discussions/8952/where-to-find-the-execution-cycles-of-cortex-m7-instruction in order to get the execute time of Cortex-M7 instruction. The answer is, because of the dual-issue in this core, it is useless to get how many clock cycle the instruction takes. 

I measured some instructions like ADD, CMP, VLDR, STR, SMUL, MOV in keil-MDK, using STM32F746NG. And finally I got the exact value. They are all takes 12 clock cycle. That makes me confused.

According to this post, https://community.arm.com/processors/f/discussions/5219/how-long-are-the-cortex-m7-pipeline-stages/26926#26926, the pipline stage in M7 is looks like this:

As an example, ADD r1, r2. The progress it needed is:

1. Fetch ( instruction ),

2. Decode (1st dec),

3. issue(2nd dec, in fact there is no the other Instruction which can be dual-issued with the ADD, but I don't know the principle and how the core works, so I let it takes 1 clock cycle)

4. Execute #1 ( I don't know what's happened and what's the difference with the execute #2)

5. Execute #2

6. Write/Store

The whole progress is 6 stage, if every stage takes 1 clock cycle, it's just 6 clock cycle rather than 12 clock cycle measured by me.

So my questions are:

1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?

2. Will the issue stage be ignored if there is no dual-issuable Instruction following?

For Cortex M4 and M3, which are both 3 stage pipeline core

http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores

According to Document DAI0321A:

If we ignore the exception, branch etc...., It will takes only 1 clock cycle in Cortex M0,3,4, Is that right?

Above all, the third question is:

3. How's the pipeline organized in Cortex-M7? If it can't be shown because of the dual-issue issue, just according to my measurement (12 clock cycle for each instruction)  could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code?

Parents
  • Step by step

    1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?

    Dual-issue means that in 1 processor cycle 2 instructions are issuing to different pipelines (you can see Cortex M7 pipelines in your posted image). For example, cortex m7 may issue two instructions (if their operands are in registers) into two different pipelines, e.g. load or store operation to load/store unit and integer operation to ALU #1, thereby two operations sequential in program order are executed in parallel instead of being executed sequentially. As a conclusion, execution time is measured by instruction with longer execution time, while in the case of in-order execution you have to add execution time of each instruction. That is why this architecture have better performance (parallel execution).

    If you want to calculate overall execution time for some portion of code you should have deep knowledge about in-order and out-of-order execution paradigm, taking into account time for loading instructions into pipeline (fetching), time of decoding instructions, at the decoding stage instructions may be speculative executed, so runtime of your code becomes shorter, in case of dual-issue, for example second instruction may not be ready for issuing into execution stage, bacause of waiting an operand. It is difficult to calculate precise time for performing individual instructions from sequence of instructions.

    If I am not mistaken Cortex M7 contains dual-issue superscalar architecture.

Reply
  • Step by step

    1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?

    Dual-issue means that in 1 processor cycle 2 instructions are issuing to different pipelines (you can see Cortex M7 pipelines in your posted image). For example, cortex m7 may issue two instructions (if their operands are in registers) into two different pipelines, e.g. load or store operation to load/store unit and integer operation to ALU #1, thereby two operations sequential in program order are executed in parallel instead of being executed sequentially. As a conclusion, execution time is measured by instruction with longer execution time, while in the case of in-order execution you have to add execution time of each instruction. That is why this architecture have better performance (parallel execution).

    If you want to calculate overall execution time for some portion of code you should have deep knowledge about in-order and out-of-order execution paradigm, taking into account time for loading instructions into pipeline (fetching), time of decoding instructions, at the decoding stage instructions may be speculative executed, so runtime of your code becomes shorter, in case of dual-issue, for example second instruction may not be ready for issuing into execution stage, bacause of waiting an operand. It is difficult to calculate precise time for performing individual instructions from sequence of instructions.

    If I am not mistaken Cortex M7 contains dual-issue superscalar architecture.

Children