Hi all,
I did some of the investigation based on comparison of FPU based algorithms on CM4 and CM7 cores. All the code/data were placed into the single cycle memory with full utilization of modified / true Harvard architecture, it means:
- on CM4 - code in SRAM accesible via CODE bus, data in SRAM accesible via SYSTEM bus with fully utilized modified Harvard architecture
- on CM7 - code in I-TCM memory, data in DTCM memory
Most of the code (instructions) are floating point (99%), it means thay are not interleaved with integer instructions (well this is most probably caused by compiler - to be honest I have check the assembly for both codes CM4 / CM7 and they looked the same). The code mostly contains general math calculations mul, mac, sqrt, div + load / store, all in floating point. The result I am getting are confusing me. Cortex M4 shows even better results that Cortex M7.
Questions:
- are the differencies caused by cores pipelines? not sure how "dynamic branch prediction" works, if it is really posible to get branch in single cycle or it is required to flush whole pipeline (6 cycles) in a case of floating point pipeline on CM7
- what are the best practices in coding to get the best from CM7 over CM4 in floating point meaning? (not sure if the compilers are now in best condition regarding to CM7)
thanks in advance.
regards
Rastislav
Hello Rastislav,
I would like to know at what clock frequency of each processor operates.
If they are the same frequency, Cortex-M7 would be less performance than Cortex-M4 because of Cortex-M7's deeper pipeline.
Best regards,
Yasuhiko Koumoto.
Hi Yasuhiko san,
Well, I was focused on number of cycles for specified floating point arithmetic function execution. So, the total execution time did not play a role in my case.
Could you please explain me it in more details:
- What we can offer to customer when using pure floating point arithmetic on CM7?
- What are the benefits of CM7 over CM4 they can utilize (excluding integer / fraction calculation)?
- Could you please explain the CM7 superscalar pipeline to me (especially floating point pipeline)? (the documentation is quite poor, and the presentation we can find on web are too much general in a case of pipeline)
- Is it true that we can take the advantage of CM7 over CM4 (floating point arithmetic) in a case we will interleave integer and float instruction (using assembler, according to my experiences compilers are not doing it yet) to utilize pipeline parallelism (dual issue of integer and float instruction)?
Thanks.
Regards
Od: yasuhikokoumoto
Odoslané: Tuesday, July 21, 2015 03:43
Komu: Pavlanin Rastislav-B34185
Predmet: Re: - What is the advantage of floating point of CM7 versus CM4
<http://community.arm.com/?et=watches.email.thread>
What is the advantage of floating point of CM7 versus CM4
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29498?et=watches.email.thread#29498>
Hello,
I think it would offer the double precision operations.
CM7 can operate faster than CM4 in a clock frequency view point.Also the DSP performance is 2 times better than CM4.
The below are the slides opened by ARM.The floating point operation laency seems to 4 cycles in the slide.CM7 TRM doesn't include the floating point instruction latency.So the true number of latency is unknown.On the other hand, the latencies of CM4 floating point are almost one cycle according to CM4 TRM.The pipelining will hide the latency but the long pipeline stages will be affected by the hazards.Anyway, the EEMBC FPmark shows CM7 is 1.6 times better performance than CM4 (at the usual frequency for each other?).Acordig to STMicro presentaions, the CM7 is 1.7 times better performance than CM4 with CM7 is 200MHz and CM4 is 180MHz.
Yes. CM7 can operate integer and floating point operations simultaneously.Because CM7 equips in-order pipeline, the parallel execution will be arranged by hand.
Therefore, you will take the advantage of CM7 over CM4 in a case you will interleave integer and float instruction using assembler.
Best regards,Yasuhiko Koumoto.
Yasuhiko,
Please correct me in my short conclusion regarding to floating point arithmetic CM4 vs. CM7:
- The key advantage doing the floating point arithmetic on CM7 over CM4 is higher CPU clock
- The floating point instruction latencies of CM4 (number of cycles for execution) are not present in technical reference manual (https://www.google.sk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB8QFjAAahUKEwi2y8qVhu7GAhWlmtsKHU5sC24&url=http%3A%2F%2Finfocenter.arm.com%2Fhelp%2Ftopic%2Fcom.arm.doc.ddi0489b%2FDDI0489B_cortex_m7_trm.pdf&ei=xS-vVfbUK6W17gbO2K3wBg&usg=AFQjCNEobfK8Z60RjjkwQe933V5MAobEyQ&bvm=bv.98197061,d.ZGU&cad=rja ). So it is quite difficult to compare cycles of execution with CM4 (which includes this data) in theoretical point of view.
- It is also possible to take the advantage of CM7 when code includes interleaved integer and floating point instructions (parallelism in instruction execution due to superscalar pipeline nature of CM7). This is quite complex and difficult to implement. Looks like the compilers are not prepared for that nowadays. So, (inline) assembler code implementation required.
- Considering the same core clock and non-linear code (embedded functions, number of conditional/unconditional branch instructions) the CM4 can achieve even better results due to pipeline flush / fill. (My experiences are that the linear code can achieve a bit better results in a case of CM7 but not non-linear)
Of course everything above mentioned is related to pure floating point calculations. Considering integer / fractional calculation we can expect strong improvement in performance of CM7 over CM4.
Odoslané: Tuesday, July 21, 2015 23:40
Predmet: Re: - Re: What is the advantage of floating point of CM7 versus CM4
Re: What is the advantage of floating point of CM7 versus CM4
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29517?et=watches.email.thread#29517>
Basically yes.However, several floating point benchmarks show CM7 is about 10% faster than CM4 at the same CPU clock.
- The floating point instruction latencies of CM7 (number of cycles for execution) are not present in technical reference manual. So it is quite difficult to compare cycles of execution with CM4 (which includes this data) in theoretical point of view.
Yes.
Basically yes.However, I believe Cortex-M7 specific option of a compiler would mix integer and floating point instructions as much as possible.
Probably yes.
I am product manager for Cortex-M7 and Cortex-M4. This is an interesting thread, so I thought I would add my comments.
Of course, it will always be important which compiler, version and options combination you are using - and also the exact arrangement of memory.
Just for our info, which compiler are you using? and which options?
The "underlying" cycle timings (ie ignoring dual issue and dependencies) of the Cortex-M7 FP instructions are the same as for the Cortex-M4 FP instructions.
FP loads can be dual-issued so that is one area where the Cortex-M7 can have an advantage over Cortex-M4.
Also, as has been pointed out, FP instructions can dual-issue with integer instructions.
During Cortex-M7 development we measured a body of code as benchmarks, including the standard EEMBC FPMark on which we found a 60% uplift from Cortex-M4 to Cortex-M7.
We also measured key functions (FFT, FIR, IIR, Biquad etc) from our CMSIS DSP Library by:
Compiling for Cortex-M4 and running on Cortex-M4
Compiling for Cortex-M4 and running on Cortex-M7 (ie run the same binary we just ran on Cortex-M4, but on Cortex-M7)
Compiling for Cortex-M7 and running on Cortex-M7
Source-level rearrangement and then compiling for Cortex-M7 and running on Cortex-M7
This was done for Q15, Q31 and F32 data types, so we did get to compare SIMD integer and single-precision FP between Cortex-M4 and Cortex-M7.
In most cases, simply running the same (Cortex-M4) code on Cortex-M7 gave most of the uplift.
There were obviously some cases where due to the coding style of the particular function, we got only a small uplift due to the microarchitecture and this uplift grew when compiling explicitly for Cortex-M7.
In some cases we found even more uplift when rearranging the DSP Lib source to encourage more dual issue, by making source-level changes which a compiler is unlikely to find, since they involve changes in the algorithm itself.
For example, some DSP Lib functions had been coded to perform "loads" first, then a block of arithmetic operations mostly using register values loaded at the beginning of a loop, then a block of "saves" back to memory.
By interleaving the "loads" and the arithmetic operations we were able to get more uplift by exploiting the dual-issue of loads and arithmetic operations.
In an ideal world this would be done entirely by the compiler, but this is not always possible.
This is all done in C by the way, and then looking at the resultant disassembly of the compiled code and moving C statements around, which it is reasonably easy to do for a RISC ISA.
In a very small number of cases, there are functions which consume more cycles on Cortex-M7 than Cortex-M4 without source-leve rearrangement, but we didn't find any that we were not able to optimize using a combination of compiler options and algorithmic changes.
That said, I am sure it is possible to find individual cases where code runs in less cycles on Cortex-M4 than Cortex-M7 due to the nature of the code and register dependencies between instructions.
If the original poster would like to send us the disassembly of his code (or even the source itself if it is not confidential), then we can take a look at it to explain what you are seeing.
(Ian.Johnson@arm.com).
Ian
Hello Ian,
thank you for you comments.Current my opinions came from the many internet articles and I have not yet have a real device experience.I will get the discovery board of STM32F7 in a few days and try the facts by myself.I think the sample codes of the original poster would be made only by floating point operations (excluding loads).In this case, I guess the register dependencies will affect the Cortex-M7 performance as you say.Anyway I am very glad to get comments form the developer of Cortex-M7.
Hello again Ian,
may I ask a question?
ARM says Cortex-M7 is 1.6 times better performance than Cortex-M4 by EEMBC FPMark.
Is the condition the same clock frequency?
I have thought it would be the different frequency (e.g. Cortex-M7 is 200MHz and Cortex-M4 is 100MHz).
What is the truth?
Because FPMark performs a lot of memory accesses, I am not surprised at that it is the same clock frequency.
Thank you in advance.
I'm sorry.
I have found your presentation material http://community.arm.com/servlet/JiveServlet/previewBody/9595-102-4-18606/ARM_Cortex_M7_MCU_Johnson.pdf .
In the materials FPmark socre is described that assumes all processors running at the same clock frequency.
Yes, the EEMBC benchmarks are run at the same frequency.
Some of our other benchmarks include uplift due to frequency. For example, in our DSP benchmarks we see on average 1.6-1.7x speedup due to IPC and a further 0.3-0.4x due to being able to run the processor at a faster frequency.
Of course these are "average" results across which we have taken a geometric mean.
Individual benchmarks will show varying results depending on the exact mix of FP vs integer arithmetic vs load/store.
All,
Well, I am a bit confused. The answers here are not clear for me yet. Some of you saying something and some of you saying something different.
The clear benefit of performance of CM7 over CM4 are:
1. Doing pure integer / fraction math (DSP)
2. Doing mixed (interleaved) integer and floating math (DSP)
What is still completely unclear is:
3. Doing pure floating point math (DSP). For example: motor control algorithm executed in ADC isr: between ADC results reading and PWM duty setting (integer pipelines active) it is required to do quickly floating point motor control algorithm (quite lot of pure floating point instructions with couple of branching due to embedded functions and conditional branching). Benchmarking this kind of code is resulting in higher performance of CM4 in my case. To be honest I have checked the assembler for both cases and they are the same instructions. Starting measuring the number of executed core cycles at the beginning of algorithm (after ADC reading) and ending at the end of algorithm (staring writing to the PWM registers).
My understanding is that when the code is pure floating point instructions (what CM7 compilers are doing when writing the code in general way) and the code is non-linear (because of conditional / unconditional branching) the pure floating point pipeline (6 stages) causes the wait-states (flush / fill the pipeline). Looks like the dynamic branch prediction does not apply in that case.
Could you please make comments to the point 3 and only to that case, as the previous mentioned does make sense and are working in my case? Thanks.
NOTE: I see some slight improvement in a case of CM7 when linear code is executed (no branching at all).
Od: ianjohnson
Odoslané: Friday, July 24, 2015 12:08
reply from Ian Johnson<http://community.arm.com/people/ianjohnson?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29596?et=watches.email.thread#29596>
can you provide your benchmark code?
The C code would be preferable.
Unless there is code, Ian could not give any comment, I think.
I have gotten the STM32F7 Discovery board and now I can cross-check your results.
Can you provide the codes of which performance were less than Cortex-M7?
For a trial, I measured the 4x4 matrix multiply performance of the floating point by SysTick.
The results are
Cortex-M7: 303 cpu cycles and
Cortex-M4: 452 cpu cycles.
According to my trial, Cortex-M7 is 1.5 times better performance than Cortex-M4.
Very much appropriate your help with. I am on vacation nowadays with limited access to the evaluation. However, after vacation I will try also your test (higher level of matrix). Thanks.
Odoslané: Sunday, July 26, 2015 23:17
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29627?et=watches.email.thread#29627>
I also measured performance of Linpack and Whetstone benchmarks.
Linpack: 1.62 times faster by Cortex-M7 at the same clock.
Whetsone: 1.91 times faster by Cortex-M7 at the same clock.