Hello experts,
recently ARM updated the Cortex-M7 information.
I think the biggest topic would be that the pipeline details were opened.
The new information says that the integer pipeline is 4 stage and the floating point pipeline is 5 stage.
However, the past information said that it was 6 stage.
From where this differences came?
I would like to know the concrete explanation for each stage.
What is the first stage, what is the second stage, what is the third stage, what is the fourth stage, and so on?
Best regards,
Yasuhiko Koumoto.
I think I saw somewhere about that one MAC or FMAC instruction could be completed out of order, something about enabling it to start another one every cycle, whereas everything else was retired strictly in order. Sounds a little hairy to me but they're important operations so worth some extra work. If so those pipelines are quite different from anything else so that may be why the MAC is shown as a different pipeline.
Not sure what you mean by "completed out of order". No instruction can complete/commit out-of-order otherwise the program will crash. Actually, there are several academic research on out-of-order completion but the overhead is so high that's it's counter productive to implement in an actual processor especially low power embedded GPP. Most likely you're referring to issued out-of-order. However Cortex-M7 is in-order issue processor. Plus there is nothing especial about the MAC operation; on most processors a MAC will consume a single cycle. On the Cortex-M7 the MAC latency is 2 cycles with 1 MAC/cycle throughput.
As for why the MAC uses so many RF ports (which also explain why it's in a track/pipeline by itself) I had to review the Cortex-M7 technical manual to understand what's going on. The MAC operation in the Cortex-M7, similar to the rest of the Cortex-M processors, unfortunately, is not a pure MAC but a Multiply+Add operation. A MAC instruction takes 3 operands (2 for the multiplication and 1 for the accumulator) as the accumulator is stored in the RF and has to be loaded and stored with every MAC operation. This explains the additional ports in the RF as Pete mentioned in his last message (2 read and 2 write ports to load/store 64 bits operands).
During my investigation I checked Cortex-M7 machine description in the GNU compiler source code and saw that none of the instructions are single latency! The diagrams provided by Yasuhiko suggest that the ALU#2 pipeline has 1 cycle latency and the ALU#1 pipeline has 2 cycles latency which is not the case in the GNU compiler description. Did anyone use the GNU compiler for Cortex-M7? Any problems or issues? The version I am using is (gcc-arm-none-eabi-4_9-2015q1-20150306).
I also couldn't find any instruction timing details for the Cortex-M7 in ARM documentations. It would be nice to have something similar to this Cortex-M4 instruction summary: ARM Information Center. Does anyone know if this is available somewhere?
Found where it said that
https://semiaccurate.com/2015/04/30/arm-goes-great-detail-m7-core/
However
Supercharging the Embedded Device: ARM Cortex-M7
which is a lot more reliable source doesn't say anything like that.
I seriously wouldn't waste any time on that 1st article. It is a complete joke. I also noticed that numerous online references use the term "out-of-order completion" synonymously with "out-of-order execution" or "out-of-order issue". Anyway, the Cortex-M7 is an in-order processor so none of these techniques are relevant to it.
I am a bit disappointed though that SIMD is only supported in a single pipeline. It would be interesting to see how the Cortex-M7 compares to the Cortex-M4 in terms of SIMD performance. Anyone has done such a comparison and can share the results?