Hello experts,
recently ARM updated the Cortex-M7 information.
I think the biggest topic would be that the pipeline details were opened.
The new information says that the integer pipeline is 4 stage and the floating point pipeline is 5 stage.
However, the past information said that it was 6 stage.
From where this differences came?
I would like to know the concrete explanation for each stage.
What is the first stage, what is the second stage, what is the third stage, what is the fourth stage, and so on?
Best regards,
Yasuhiko Koumoto.
> 1. Instruction fetch is probably the most critical stage in the pipeline especially if you're fetching instructions from flash (slower memory/cache).
If you're limited by flash and memory bandwidth on instruction fetch, then I'm not sure counting pipeline stages is going to solve your problem. Your memory fetch latency will far outweigh the differences in pipeline stage counting.
Speaking generally (not about Cortex-M7 in particular) - due to speculative prefetch, branch prediction, instruction caching, and loop buffers instruction fetch is not often on the critical path for many CPU design. Even very wide ones - if you have a cache which can return 128-bits per clock you can issue 8 16-bit Thumb2 instructions on the back of that single fetch cycle. Decode may or may not be more tightly coupled - YMMV - it does depend on the design. The convention is not to include it as the assumption is you're not thrashing the I-cache and don't have fetch bandwidth problems.
> 1. the MAC seems to be in a pipeline stage by itself. Does this mean that a Cortex-M7 is ultimately capable of issuing 2 ALU + 1 MAC/MUL operations per cycle? which explains the 6 read ports in the RF.
I don't believe so. I would expect maximum issue rate is dual issue. Note that 6 ports is entirely consistent with this (two 32-bit input operands for an ALU op, two 32-bit and one 64-bit input operands for a 32*32 + 64 MAC running in parallel). I'm not entirely sure how the 32x32+64 MAC pipelines - based on the comments in the slide about 16x16+32 being single cycle I assume that the wider one isn't.
6 would also be needed for a 16x1+32 MAC (three registers) and a parallel 64-bit store (two data registers, one address register).
> 2. The Integer RF has 4 write ports which implies that the Cortex-M7 can retire 4 instructions in parallel. Two integer ALU operations and one, I assume, for the MAC, am I right? How about the 4th operation?
The MAC result can be 64-bit so needs two write ports, the load/store unit is 64-bit, so needs two write ports.
Actually, if you update the address as part of the load or store then the store unit needs three write ports (two for data one for modified address), so normal ALU + ST could use 4 easily enough.
Again nothing is inconsistent with simple dual issue here.
HTH,
Pete
I think I saw somewhere about that one MAC or FMAC instruction could be completed out of order, something about enabling it to start another one every cycle, whereas everything else was retired strictly in order. Sounds a little hairy to me but they're important operations so worth some extra work. If so those pipelines are quite different from anything else so that may be why the MAC is shown as a different pipeline.
Not sure what you mean by "completed out of order". No instruction can complete/commit out-of-order otherwise the program will crash. Actually, there are several academic research on out-of-order completion but the overhead is so high that's it's counter productive to implement in an actual processor especially low power embedded GPP. Most likely you're referring to issued out-of-order. However Cortex-M7 is in-order issue processor. Plus there is nothing especial about the MAC operation; on most processors a MAC will consume a single cycle. On the Cortex-M7 the MAC latency is 2 cycles with 1 MAC/cycle throughput.
As for why the MAC uses so many RF ports (which also explain why it's in a track/pipeline by itself) I had to review the Cortex-M7 technical manual to understand what's going on. The MAC operation in the Cortex-M7, similar to the rest of the Cortex-M processors, unfortunately, is not a pure MAC but a Multiply+Add operation. A MAC instruction takes 3 operands (2 for the multiplication and 1 for the accumulator) as the accumulator is stored in the RF and has to be loaded and stored with every MAC operation. This explains the additional ports in the RF as Pete mentioned in his last message (2 read and 2 write ports to load/store 64 bits operands).
During my investigation I checked Cortex-M7 machine description in the GNU compiler source code and saw that none of the instructions are single latency! The diagrams provided by Yasuhiko suggest that the ALU#2 pipeline has 1 cycle latency and the ALU#1 pipeline has 2 cycles latency which is not the case in the GNU compiler description. Did anyone use the GNU compiler for Cortex-M7? Any problems or issues? The version I am using is (gcc-arm-none-eabi-4_9-2015q1-20150306).
I also couldn't find any instruction timing details for the Cortex-M7 in ARM documentations. It would be nice to have something similar to this Cortex-M4 instruction summary: ARM Information Center. Does anyone know if this is available somewhere?
Found where it said that
https://semiaccurate.com/2015/04/30/arm-goes-great-detail-m7-core/
However
Supercharging the Embedded Device: ARM Cortex-M7
which is a lot more reliable source doesn't say anything like that.
I seriously wouldn't waste any time on that 1st article. It is a complete joke. I also noticed that numerous online references use the term "out-of-order completion" synonymously with "out-of-order execution" or "out-of-order issue". Anyway, the Cortex-M7 is an in-order processor so none of these techniques are relevant to it.
I am a bit disappointed though that SIMD is only supported in a single pipeline. It would be interesting to see how the Cortex-M7 compares to the Cortex-M4 in terms of SIMD performance. Anyone has done such a comparison and can share the results?