This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Question about the Pipeline, clock cycle and machine cycle in Cortex-M Series.

tyskin over 8 years ago

Recently I'm learning the implement of ARM cortex m core in order to optimize my software to be more efficient and be easier to predict its execute time. But now I'm confused about the clock cycle, machine cycle and the pipeline of cortex-m.

The first one is Cortex-M7.

2 days ago I posted a question https://community.arm.com/processors/f/discussions/8952/where-to-find-the-execution-cycles-of-cortex-m7-instruction in order to get the execute time of Cortex-M7 instruction. The answer is, because of the dual-issue in this core, it is useless to get how many clock cycle the instruction takes.

I measured some instructions like ADD, CMP, VLDR, STR, SMUL, MOV in keil-MDK, using STM32F746NG. And finally I got the exact value. They are all takes 12 clock cycle. That makes me confused.

According to this post, https://community.arm.com/processors/f/discussions/5219/how-long-are-the-cortex-m7-pipeline-stages/26926#26926, the pipline stage in M7 is looks like this:

As an example, ADD r1, r2. The progress it needed is:

1. Fetch ( instruction ),

2. Decode (1st dec),

3. issue(2nd dec, in fact there is no the other Instruction which can be dual-issued with the ADD, but I don't know the principle and how the core works, so I let it takes 1 clock cycle)

4. Execute #1 ( I don't know what's happened and what's the difference with the execute #2)

5. Execute #2

6. Write/Store

The whole progress is 6 stage, if every stage takes 1 clock cycle, it's just 6 clock cycle rather than 12 clock cycle measured by me.

So my questions are:

1. Why the 6 stages instruction(ADD) takes 12 clock cycles, does every pipeline stage cost 2 clock cycle in Cortex M7?

2. Will the issue stage be ignored if there is no dual-issuable Instruction following?

For Cortex M4 and M3, which are both 3 stage pipeline core

http://www.anandtech.com/show/8400/arms-cortex-m-even-smaller-and-lower-power-cpu-cores

According to Document DAI0321A:

If we ignore the exception, branch etc...., It will takes only 1 clock cycle in Cortex M0,3,4, Is that right?

Above all, the third question is:

3. How's the pipeline organized in Cortex-M7? If it can't be shown because of the dual-issue issue, just according to my measurement (12 clock cycle for each instruction) could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code?

Top replies

Parents

0 Gabriel Wang over 8 years ago

First, how did you get the 12 cycle results? That looks incorrect. People make a long pipeline to improve the performance otherwise why? The CoreMark and other benchmarks can prove that Cortex-M7 is indeed better than Cortex-M3/M4.

So, for us, the first question is, what's your method for measuring the CPI? Further more, have you take the memory wait state into consideration. If you want to get the best performance, you should use TCM to store code and data (so there will be no EXTRA wait state when fetching instructions and accessing data).

We cannot tell you how the pipeline organized in detailed, but the conclusion is wrong: "could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code". You can simply test it with existing benchmarks.
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Gabriel Wang over 8 years ago

First, how did you get the 12 cycle results? That looks incorrect. People make a long pipeline to improve the performance otherwise why? The CoreMark and other benchmarks can prove that Cortex-M7 is indeed better than Cortex-M3/M4.

So, for us, the first question is, what's your method for measuring the CPI? Further more, have you take the memory wait state into consideration. If you want to get the best performance, you should use TCM to store code and data (so there will be no EXTRA wait state when fetching instructions and accessing data).

We cannot tell you how the pipeline organized in detailed, but the conclusion is wrong: "could I say the cortex M7 is less efficient than M4,3,0,0+ when dealing this kind of code". You can simply test it with existing benchmarks.
Cancel
Vote up 0 Vote down

Cancel

Children

0 tyskin over 8 years ago in reply to Gabriel Wang

Thanks for your response.

I tested the clock cycle by this: in the debug of the STM32F746NG,

Using Keil-MDK, make 2 breakpoint in the front of the code, and watch the change of the Internal->states' value.

And I've found that, for every break, it takes 12 clock cycles. Is that an incorrect method?
Cancel
Vote up 0 Vote down

Cancel
0 Gabriel Wang over 8 years ago in reply to tyskin

That's the wrong method. You should use PMU without debug.
Cancel
Vote up 0 Vote down

Cancel