Hi all,
I did some of the investigation based on comparison of FPU based algorithms on CM4 and CM7 cores. All the code/data were placed into the single cycle memory with full utilization of modified / true Harvard architecture, it means:
- on CM4 - code in SRAM accesible via CODE bus, data in SRAM accesible via SYSTEM bus with fully utilized modified Harvard architecture
- on CM7 - code in I-TCM memory, data in DTCM memory
Most of the code (instructions) are floating point (99%), it means thay are not interleaved with integer instructions (well this is most probably caused by compiler - to be honest I have check the assembly for both codes CM4 / CM7 and they looked the same). The code mostly contains general math calculations mul, mac, sqrt, div + load / store, all in floating point. The result I am getting are confusing me. Cortex M4 shows even better results that Cortex M7.
Questions:
- are the differencies caused by cores pipelines? not sure how "dynamic branch prediction" works, if it is really posible to get branch in single cycle or it is required to flush whole pipeline (6 cycles) in a case of floating point pipeline on CM7
- what are the best practices in coding to get the best from CM7 over CM4 in floating point meaning? (not sure if the compilers are now in best condition regarding to CM7)
thanks in advance.
regards
Rastislav
Hello Rastislav,
- The key advantage doing the floating point arithmetic on CM7 over CM4 is higher CPU clock
Basically yes.However, several floating point benchmarks show CM7 is about 10% faster than CM4 at the same CPU clock.
- The floating point instruction latencies of CM7 (number of cycles for execution) are not present in technical reference manual. So it is quite difficult to compare cycles of execution with CM4 (which includes this data) in theoretical point of view.
Yes.
- It is also possible to take the advantage of CM7 when code includes interleaved integer and floating point instructions (parallelism in instruction execution due to superscalar pipeline nature of CM7). This is quite complex and difficult to implement. Looks like the compilers are not prepared for that nowadays. So, (inline) assembler code implementation required.
Basically yes.However, I believe Cortex-M7 specific option of a compiler would mix integer and floating point instructions as much as possible.
- Considering the same core clock and non-linear code (embedded functions, number of conditional/unconditional branch instructions) the CM4 can achieve even better results due to pipeline flush / fill. (My experiences are that the linear code can achieve a bit better results in a case of CM7 but not non-linear)
Probably yes.
Best regards,Yasuhiko Koumoto.
Yasuhiko,
Please correct me in my short conclusion regarding to floating point arithmetic CM4 vs. CM7:
- The floating point instruction latencies of CM4 (number of cycles for execution) are not present in technical reference manual (https://www.google.sk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB8QFjAAahUKEwi2y8qVhu7GAhWlmtsKHU5sC24&url=http%3A%2F%2Finfocenter.arm.com%2Fhelp%2Ftopic%2Fcom.arm.doc.ddi0489b%2FDDI0489B_cortex_m7_trm.pdf&ei=xS-vVfbUK6W17gbO2K3wBg&usg=AFQjCNEobfK8Z60RjjkwQe933V5MAobEyQ&bvm=bv.98197061,d.ZGU&cad=rja ). So it is quite difficult to compare cycles of execution with CM4 (which includes this data) in theoretical point of view.
Of course everything above mentioned is related to pure floating point calculations. Considering integer / fractional calculation we can expect strong improvement in performance of CM7 over CM4.
Regards
Od: yasuhikokoumoto
Odoslané: Tuesday, July 21, 2015 23:40
Komu: Pavlanin Rastislav-B34185
Predmet: Re: - Re: What is the advantage of floating point of CM7 versus CM4
<http://community.arm.com/?et=watches.email.thread>
Re: What is the advantage of floating point of CM7 versus CM4
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29517?et=watches.email.thread#29517>
Hello,
- What we can offer to customer when using pure floating point arithmetic on CM7?
I think it would offer the double precision operations.
- What are the benefits of CM7 over CM4 they can utilize (excluding integer / fraction calculation)?
CM7 can operate faster than CM4 in a clock frequency view point.Also the DSP performance is 2 times better than CM4.
- Could you please explain the CM7 superscalar pipeline to me (especially floating point pipeline)? (the documentation is quite poor, and the presentation we can find on web are too much general in a case of pipeline)
The below are the slides opened by ARM.The floating point operation laency seems to 4 cycles in the slide.CM7 TRM doesn't include the floating point instruction latency.So the true number of latency is unknown.On the other hand, the latencies of CM4 floating point are almost one cycle according to CM4 TRM.The pipelining will hide the latency but the long pipeline stages will be affected by the hazards.Anyway, the EEMBC FPmark shows CM7 is 1.6 times better performance than CM4 (at the usual frequency for each other?).Acordig to STMicro presentaions, the CM7 is 1.7 times better performance than CM4 with CM7 is 200MHz and CM4 is 180MHz.
- Is it true that we can take the advantage of CM7 over CM4 (floating point arithmetic) in a case we will interleave integer and float instruction (using assembler, according to my experiences compilers are not doing it yet) to utilize pipeline parallelism (dual issue of integer and float instruction)?
Yes. CM7 can operate integer and floating point operations simultaneously.Because CM7 equips in-order pipeline, the parallel execution will be arranged by hand.
Therefore, you will take the advantage of CM7 over CM4 in a case you will interleave integer and float instruction using assembler.
Hi Yasuhiko san,
Well, I was focused on number of cycles for specified floating point arithmetic function execution. So, the total execution time did not play a role in my case.
Could you please explain me it in more details:
Thanks.
Odoslané: Tuesday, July 21, 2015 03:43
Predmet: Re: - What is the advantage of floating point of CM7 versus CM4
What is the advantage of floating point of CM7 versus CM4
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29498?et=watches.email.thread#29498>
I would like to know at what clock frequency of each processor operates.
If they are the same frequency, Cortex-M7 would be less performance than Cortex-M4 because of Cortex-M7's deeper pipeline.
Best regards,
Yasuhiko Koumoto.
View all questions in Cortex-M / M-Profile forum