Hello experts,
recently ARM updated the Cortex-M7 information.
I think the biggest topic would be that the pipeline details were opened.
The new information says that the integer pipeline is 4 stage and the floating point pipeline is 5 stage.
However, the past information said that it was 6 stage.
From where this differences came?
I would like to know the concrete explanation for each stage.
What is the first stage, what is the second stage, what is the third stage, what is the fourth stage, and so on?
Best regards,
Yasuhiko Koumoto.
Hello Peter Harris,
I don't want to know the implementation details.
I only want to know what is the name for each pipeline stage.
What does the 6 stage pipeline stage consist of in the old slide and 4 stage in the new slide?
Does it also include implementation matter?
I think you've got all of the information in your first post - the diagram of the stages includes all of the names.
In terms of stage counting, it basically depends if you include instruction fetch, instruction decode, and/or register writeback as part of the processing pipeline or not.
There is no entirely consistent standard convention here - and a lot depends on how the architecture functions and what makes sense for one design may not make sense for another. I've seen multiple different approaches across multiple architectures (not just ARM).
It is a relatively common convention in CPUs not include instruction fetch in the pipeline length, as that can be very aggressively pipelined and as instructions are small can be "over fetched", meaning that it is rarely a critical path.
Retire and/or register writeback may or may not be included. If you have result forwarding from the execute stage so that the next instruction can use the result of an instruction without waiting for the physical register writeback then it may make sense not to count the writeback cycle as it doesn't impact performance. If you have a simpler design where data results are only exchanged via the main register file then that cycle becomes important as you may get bubbles, so you probably want to include it.
At a guess (I don't work on the CPUs, so I'm guessing slightly) the "4 cycles" number in this case seems to not include instruction fetch, decode, or retire, so only counts the issue cycle and three data processing cycles. I assume the floating point pipeline has one extra data processing cycle, hence 5 not 4.
HTH, Pete
Pete,
I think Yasuhiko has a valid point here. These two diagrams are conflicting and it makes a lot of sense to set the record straight to determine which one is the golden reference. I prefer the new diagram because it is more accurate and has more details in it.
Now in terms of your explanation of the discrepancy in pipeline stage counting (I will just focus on the main points):
1. Instruction fetch is probably the most critical stage in the pipeline especially if you're fetching instructions from flash (slower memory/cache). The fetch width, speed, etc. will determine the pace for the rest of the stages. You can either starve or flood the processor at this stage. In a dual-issue processor like Cortex-M7 the fetch bandwidth is even more critical. I very much doubt that the the fetch stages was not included in the stage count. I am not too worried about this as I know the answer.
2. Result forwarding or RegisterFile bypass is not an architectural feature but an implementation decision. Result forwarding is performed in conjunction with and not instead of writing the result back into the RF which means that writing back result to the RF is always executed regardless result is forwarded or not and therefore does not affect pipeline stage counting,
For me personally I would really appreciate an answer for these two questions.
1. the MAC seems to be in a pipeline stage by itself. Does this mean that a Cortex-M7 is ultimately capable of issuing 2 ALU + 1 MAC/MUL operations per cycle? which explains the 6 read ports in the RF.
2. The Integer RF has 4 write ports which implies that the Cortex-M7 can retire 4 instructions in parallel. Two integer ALU operations and one, I assume, for the MAC, am I right? How about the 4th operation?
Actually, these questions are almost identical and can be replaced with: how many operations can the Cortex-M7 issue simultaneously in a single cycle? and what are these operations? The official Cortex-M7 manual is not very clear on this.
Thanks,
HBL
> 1. Instruction fetch is probably the most critical stage in the pipeline especially if you're fetching instructions from flash (slower memory/cache).
If you're limited by flash and memory bandwidth on instruction fetch, then I'm not sure counting pipeline stages is going to solve your problem. Your memory fetch latency will far outweigh the differences in pipeline stage counting.
Speaking generally (not about Cortex-M7 in particular) - due to speculative prefetch, branch prediction, instruction caching, and loop buffers instruction fetch is not often on the critical path for many CPU design. Even very wide ones - if you have a cache which can return 128-bits per clock you can issue 8 16-bit Thumb2 instructions on the back of that single fetch cycle. Decode may or may not be more tightly coupled - YMMV - it does depend on the design. The convention is not to include it as the assumption is you're not thrashing the I-cache and don't have fetch bandwidth problems.
> 1. the MAC seems to be in a pipeline stage by itself. Does this mean that a Cortex-M7 is ultimately capable of issuing 2 ALU + 1 MAC/MUL operations per cycle? which explains the 6 read ports in the RF.
I don't believe so. I would expect maximum issue rate is dual issue. Note that 6 ports is entirely consistent with this (two 32-bit input operands for an ALU op, two 32-bit and one 64-bit input operands for a 32*32 + 64 MAC running in parallel). I'm not entirely sure how the 32x32+64 MAC pipelines - based on the comments in the slide about 16x16+32 being single cycle I assume that the wider one isn't.
6 would also be needed for a 16x1+32 MAC (three registers) and a parallel 64-bit store (two data registers, one address register).
> 2. The Integer RF has 4 write ports which implies that the Cortex-M7 can retire 4 instructions in parallel. Two integer ALU operations and one, I assume, for the MAC, am I right? How about the 4th operation?
The MAC result can be 64-bit so needs two write ports, the load/store unit is 64-bit, so needs two write ports.
Actually, if you update the address as part of the load or store then the store unit needs three write ports (two for data one for modified address), so normal ALU + ST could use 4 easily enough.
Again nothing is inconsistent with simple dual issue here.
HTH,
Pete
I think I saw somewhere about that one MAC or FMAC instruction could be completed out of order, something about enabling it to start another one every cycle, whereas everything else was retired strictly in order. Sounds a little hairy to me but they're important operations so worth some extra work. If so those pipelines are quite different from anything else so that may be why the MAC is shown as a different pipeline.
Not sure what you mean by "completed out of order". No instruction can complete/commit out-of-order otherwise the program will crash. Actually, there are several academic research on out-of-order completion but the overhead is so high that's it's counter productive to implement in an actual processor especially low power embedded GPP. Most likely you're referring to issued out-of-order. However Cortex-M7 is in-order issue processor. Plus there is nothing especial about the MAC operation; on most processors a MAC will consume a single cycle. On the Cortex-M7 the MAC latency is 2 cycles with 1 MAC/cycle throughput.
As for why the MAC uses so many RF ports (which also explain why it's in a track/pipeline by itself) I had to review the Cortex-M7 technical manual to understand what's going on. The MAC operation in the Cortex-M7, similar to the rest of the Cortex-M processors, unfortunately, is not a pure MAC but a Multiply+Add operation. A MAC instruction takes 3 operands (2 for the multiplication and 1 for the accumulator) as the accumulator is stored in the RF and has to be loaded and stored with every MAC operation. This explains the additional ports in the RF as Pete mentioned in his last message (2 read and 2 write ports to load/store 64 bits operands).
During my investigation I checked Cortex-M7 machine description in the GNU compiler source code and saw that none of the instructions are single latency! The diagrams provided by Yasuhiko suggest that the ALU#2 pipeline has 1 cycle latency and the ALU#1 pipeline has 2 cycles latency which is not the case in the GNU compiler description. Did anyone use the GNU compiler for Cortex-M7? Any problems or issues? The version I am using is (gcc-arm-none-eabi-4_9-2015q1-20150306).
I also couldn't find any instruction timing details for the Cortex-M7 in ARM documentations. It would be nice to have something similar to this Cortex-M4 instruction summary: ARM Information Center. Does anyone know if this is available somewhere?
Found where it said that
https://semiaccurate.com/2015/04/30/arm-goes-great-detail-m7-core/
However
Supercharging the Embedded Device: ARM Cortex-M7
which is a lot more reliable source doesn't say anything like that.
I seriously wouldn't waste any time on that 1st article. It is a complete joke. I also noticed that numerous online references use the term "out-of-order completion" synonymously with "out-of-order execution" or "out-of-order issue". Anyway, the Cortex-M7 is an in-order processor so none of these techniques are relevant to it.
I am a bit disappointed though that SIMD is only supported in a single pipeline. It would be interesting to see how the Cortex-M7 compares to the Cortex-M4 in terms of SIMD performance. Anyone has done such a comparison and can share the results?