recently ARM updated the Cortex-M7 information.
I think the biggest topic would be that the pipeline details were opened.
The new information says that the integer pipeline is 4 stage and the floating point pipeline is 5 stage.
However, the past information said that it was 6 stage.
From where this differences came?
I would like to know the concrete explanation for each stage.
What is the first stage, what is the second stage, what is the third stage, what is the fourth stage, and so on?
This is the 1s time for me to take a look at the Cortex-M7 so thanks for sharing this info. My first observation is that this pipeline diagram looks more like a CISC (instructions with different latencies) than a pure RISC pipeline hence the confusion. The shortest ALU operation takes 4 cycles which explains the 4-stage pipe. The FPU takes an additional cycle to access the FP-RF and therefore uses 5 stages.
I would just ignore the old diagram as it has incorrect info (it might have been created by the marketing department without consultation with the eng. team). For example, the write/store in the ALU pipe is not a separate stage because writes are executed at the end of the execute stage. Same for the prefetch it is only activated when predicting branches so it is not really a separate pipeline stage. In normal program execution instructions are fetched in-order.
PS: I doubt that ARM will share with you the architectural details for each pipeline stage.
I don't need the details of each pipeline stage. I just want to know the name of each stage.
In the old slide, I can read as the followings.
1st = Fetch
2nd = Decode
3rd = Issue
4th = Execute #1
5th = Execute #2
6th = Write/Store
In the new slide, I think it as the followings.
1st = Instruction Decoders
2nd = Integer Register File
3rd = Shift
4th = ALU
Is it correct?
I would like to know that the pipeline figure had been drastically changed.
Also I would like to know the relationship between the old and the new figures.
Thank you and best regards,
"I would like to know the concrete explanation for each stage." sounded like you wanted more than just stage names.
I don't see any difference between these two diagrams. The new one is polished and more detailed whereas the first one has confusing terminology. Here is a comparison between the two diagrams stage per stage, hope it helps.
1. Fetch 1, Instruction Buffer
2. Decode 2. Instruction Decoder
3. Issue 3. Integer Register File (RF) access (which can be considered a part of the decoder) consumes one cycle because ports are interleaved between the two ALU pipelines (I do have some questions on this see below)
4. Execute 4. Execute
ALU #1 ALU0 (sequential operations e.g. shift and/or ALU)
ALU #2 ALU1 (parallel operations shift+ALU)
The FP pipeline has an additional stage to access the FP-RegisterFile.
What is confusing for me is that:
1. the MAC seems to be in a pipeline stage by itself. Does this mean that a Cortex-M7 is ultimately capable of issuing 2 ALU + 1 MAC/MUL operations per cycle? which explains the 6 read ports in the RF.
2. The Integer RF has 4 write ports. Two for the two integer ALU pipelines and one, I assume, for the MAC, am I right? How about the 4th write port?
If there is an ARM Cortex-M7 FAE (or a Cortex-M7 expert) patrolling this community he/she can answer these questions.
Hello Hanni Lozano,
I cannot catch your explanation.
Regarding older pipeline, the number of stages are six.
Your explanation seems that both are four.
Good morning Yasuhiko,
The six stages in the old diagram refers to the longest integer pipeline (load pipeline).
The shortest integer ALU pipeline (ALU #2) has only 4 stages (Fetch, Decode, Issue, Execute #2). The Write/Store step included in the ALU #2 pipeline is for writing/storing ALU results into the Register File and is not a separate pipeline stage.
The ALU #1 and MAC pipelines have five stages each.
I would really ignore the old diagram and focus on the new one which is more detailed.
BTW, is there a technical product brief for Cortex-M7 that explains the basic architecture? It might has answers to my previous questions which I couldn't find on infocenter.arm.com.
I wonder why did ARM mention ALU#2 for the 4 stage piple.
I think that the six stage pipe indicated the ALU#1 pipe in the old slide.
Now ALU#1 pipe seems to be four stages and I would like to know why the differential 2 stages have been vanished.
Regarding ALU#2, it will perform the shift or ALU for the simple instruction (not the parallel execution).
Also regarding MAC pipeline, it will be 4 stages.
the 1st is the Decode, the 2nd is Register File access, the 3rd is Multiply and the 4th is Accumulate.
By the way, are you ARM person?
I would like to get answer from ARM person.
Sorry, I am not an ARM employee just a partner. We are not using Cortex-M7 currently but we are definitely interested in it because of the potential big performance improvement over Cortex-M4. There are not that many commercially available M7 anyway.
The MAC is actually a 5-stage pipeline. You forgot the Instruction Fetch stage.
Getting an answer from ARM would be really nice. Is there a way to poke an ARM FAE with these specific questions? We've just recently joined the ARM community so not very familiar with the protocol.
> I would like to get answer from ARM person.
> Getting an answer from ARM would be really nice. Is there a way to poke an ARM FAE with these specific questions? We've just recently joined the ARM community so not very familiar with the protocol.
In general we avoid commenting on the implementation detail of the processor microarchitectures, above and beyond any public presentations which have been made. If you are an ARM licensee then you can get more detailed information via your usual ARM commercial contact or firstname.lastname@example.org, but it's not something which we will usually discus on the public forums unless the information is already public in some other form.
We are however very willing to discuss the programmer visible behaviour at the architectural level as this is something which developers need to know to use ARM effectively.
Hello Peter Harris,
I don't want to know the implementation details.
I only want to know what is the name for each pipeline stage.
What does the 6 stage pipeline stage consist of in the old slide and 4 stage in the new slide?
Does it also include implementation matter?
I think you've got all of the information in your first post - the diagram of the stages includes all of the names.
In terms of stage counting, it basically depends if you include instruction fetch, instruction decode, and/or register writeback as part of the processing pipeline or not.
There is no entirely consistent standard convention here - and a lot depends on how the architecture functions and what makes sense for one design may not make sense for another. I've seen multiple different approaches across multiple architectures (not just ARM).
It is a relatively common convention in CPUs not include instruction fetch in the pipeline length, as that can be very aggressively pipelined and as instructions are small can be "over fetched", meaning that it is rarely a critical path.
Retire and/or register writeback may or may not be included. If you have result forwarding from the execute stage so that the next instruction can use the result of an instruction without waiting for the physical register writeback then it may make sense not to count the writeback cycle as it doesn't impact performance. If you have a simpler design where data results are only exchanged via the main register file then that cycle becomes important as you may get bubbles, so you probably want to include it.
At a guess (I don't work on the CPUs, so I'm guessing slightly) the "4 cycles" number in this case seems to not include instruction fetch, decode, or retire, so only counts the issue cycle and three data processing cycles. I assume the floating point pipeline has one extra data processing cycle, hence 5 not 4.
I think Yasuhiko has a valid point here. These two diagrams are conflicting and it makes a lot of sense to set the record straight to determine which one is the golden reference. I prefer the new diagram because it is more accurate and has more details in it.
Now in terms of your explanation of the discrepancy in pipeline stage counting (I will just focus on the main points):
1. Instruction fetch is probably the most critical stage in the pipeline especially if you're fetching instructions from flash (slower memory/cache). The fetch width, speed, etc. will determine the pace for the rest of the stages. You can either starve or flood the processor at this stage. In a dual-issue processor like Cortex-M7 the fetch bandwidth is even more critical. I very much doubt that the the fetch stages was not included in the stage count. I am not too worried about this as I know the answer.
2. Result forwarding or RegisterFile bypass is not an architectural feature but an implementation decision. Result forwarding is performed in conjunction with and not instead of writing the result back into the RF which means that writing back result to the RF is always executed regardless result is forwarded or not and therefore does not affect pipeline stage counting,
For me personally I would really appreciate an answer for these two questions.
2. The Integer RF has 4 write ports which implies that the Cortex-M7 can retire 4 instructions in parallel. Two integer ALU operations and one, I assume, for the MAC, am I right? How about the 4th operation?
Actually, these questions are almost identical and can be replaced with: how many operations can the Cortex-M7 issue simultaneously in a single cycle? and what are these operations? The official Cortex-M7 manual is not very clear on this.
> 1. Instruction fetch is probably the most critical stage in the pipeline especially if you're fetching instructions from flash (slower memory/cache).
If you're limited by flash and memory bandwidth on instruction fetch, then I'm not sure counting pipeline stages is going to solve your problem. Your memory fetch latency will far outweigh the differences in pipeline stage counting.
Speaking generally (not about Cortex-M7 in particular) - due to speculative prefetch, branch prediction, instruction caching, and loop buffers instruction fetch is not often on the critical path for many CPU design. Even very wide ones - if you have a cache which can return 128-bits per clock you can issue 8 16-bit Thumb2 instructions on the back of that single fetch cycle. Decode may or may not be more tightly coupled - YMMV - it does depend on the design. The convention is not to include it as the assumption is you're not thrashing the I-cache and don't have fetch bandwidth problems.
> 1. the MAC seems to be in a pipeline stage by itself. Does this mean that a Cortex-M7 is ultimately capable of issuing 2 ALU + 1 MAC/MUL operations per cycle? which explains the 6 read ports in the RF.
I don't believe so. I would expect maximum issue rate is dual issue. Note that 6 ports is entirely consistent with this (two 32-bit input operands for an ALU op, two 32-bit and one 64-bit input operands for a 32*32 + 64 MAC running in parallel). I'm not entirely sure how the 32x32+64 MAC pipelines - based on the comments in the slide about 16x16+32 being single cycle I assume that the wider one isn't.
6 would also be needed for a 16x1+32 MAC (three registers) and a parallel 64-bit store (two data registers, one address register).
> 2. The Integer RF has 4 write ports which implies that the Cortex-M7 can retire 4 instructions in parallel. Two integer ALU operations and one, I assume, for the MAC, am I right? How about the 4th operation?
The MAC result can be 64-bit so needs two write ports, the load/store unit is 64-bit, so needs two write ports.
Actually, if you update the address as part of the load or store then the store unit needs three write ports (two for data one for modified address), so normal ALU + ST could use 4 easily enough.
Again nothing is inconsistent with simple dual issue here.
I think I saw somewhere about that one MAC or FMAC instruction could be completed out of order, something about enabling it to start another one every cycle, whereas everything else was retired strictly in order. Sounds a little hairy to me but they're important operations so worth some extra work. If so those pipelines are quite different from anything else so that may be why the MAC is shown as a different pipeline.
Not sure what you mean by "completed out of order". No instruction can complete/commit out-of-order otherwise the program will crash. Actually, there are several academic research on out-of-order completion but the overhead is so high that's it's counter productive to implement in an actual processor especially low power embedded GPP. Most likely you're referring to issued out-of-order. However Cortex-M7 is in-order issue processor. Plus there is nothing especial about the MAC operation; on most processors a MAC will consume a single cycle. On the Cortex-M7 the MAC latency is 2 cycles with 1 MAC/cycle throughput.
As for why the MAC uses so many RF ports (which also explain why it's in a track/pipeline by itself) I had to review the Cortex-M7 technical manual to understand what's going on. The MAC operation in the Cortex-M7, similar to the rest of the Cortex-M processors, unfortunately, is not a pure MAC but a Multiply+Add operation. A MAC instruction takes 3 operands (2 for the multiplication and 1 for the accumulator) as the accumulator is stored in the RF and has to be loaded and stored with every MAC operation. This explains the additional ports in the RF as Pete mentioned in his last message (2 read and 2 write ports to load/store 64 bits operands).
During my investigation I checked Cortex-M7 machine description in the GNU compiler source code and saw that none of the instructions are single latency! The diagrams provided by Yasuhiko suggest that the ALU#2 pipeline has 1 cycle latency and the ALU#1 pipeline has 2 cycles latency which is not the case in the GNU compiler description. Did anyone use the GNU compiler for Cortex-M7? Any problems or issues? The version I am using is (gcc-arm-none-eabi-4_9-2015q1-20150306).
I also couldn't find any instruction timing details for the Cortex-M7 in ARM documentations. It would be nice to have something similar to this Cortex-M4 instruction summary: ARM Information Center. Does anyone know if this is available somewhere?
Found where it said that
Supercharging the Embedded Device: ARM Cortex-M7
which is a lot more reliable source doesn't say anything like that.
View all questions in Cortex-M / M-Profile forum