Hi all,
I did some of the investigation based on comparison of FPU based algorithms on CM4 and CM7 cores. All the code/data were placed into the single cycle memory with full utilization of modified / true Harvard architecture, it means:
- on CM4 - code in SRAM accesible via CODE bus, data in SRAM accesible via SYSTEM bus with fully utilized modified Harvard architecture
- on CM7 - code in I-TCM memory, data in DTCM memory
Most of the code (instructions) are floating point (99%), it means thay are not interleaved with integer instructions (well this is most probably caused by compiler - to be honest I have check the assembly for both codes CM4 / CM7 and they looked the same). The code mostly contains general math calculations mul, mac, sqrt, div + load / store, all in floating point. The result I am getting are confusing me. Cortex M4 shows even better results that Cortex M7.
Questions:
- are the differencies caused by cores pipelines? not sure how "dynamic branch prediction" works, if it is really posible to get branch in single cycle or it is required to flush whole pipeline (6 cycles) in a case of floating point pipeline on CM7
- what are the best practices in coding to get the best from CM7 over CM4 in floating point meaning? (not sure if the compilers are now in best condition regarding to CM7)
thanks in advance.
regards
Rastislav
Hello,
I am product manager for Cortex-M7 and Cortex-M4. This is an interesting thread, so I thought I would add my comments.
Of course, it will always be important which compiler, version and options combination you are using - and also the exact arrangement of memory.
Just for our info, which compiler are you using? and which options?
The "underlying" cycle timings (ie ignoring dual issue and dependencies) of the Cortex-M7 FP instructions are the same as for the Cortex-M4 FP instructions.
FP loads can be dual-issued so that is one area where the Cortex-M7 can have an advantage over Cortex-M4.
Also, as has been pointed out, FP instructions can dual-issue with integer instructions.
During Cortex-M7 development we measured a body of code as benchmarks, including the standard EEMBC FPMark on which we found a 60% uplift from Cortex-M4 to Cortex-M7.
We also measured key functions (FFT, FIR, IIR, Biquad etc) from our CMSIS DSP Library by:
Compiling for Cortex-M4 and running on Cortex-M4
Compiling for Cortex-M4 and running on Cortex-M7 (ie run the same binary we just ran on Cortex-M4, but on Cortex-M7)
Compiling for Cortex-M7 and running on Cortex-M7
Source-level rearrangement and then compiling for Cortex-M7 and running on Cortex-M7
This was done for Q15, Q31 and F32 data types, so we did get to compare SIMD integer and single-precision FP between Cortex-M4 and Cortex-M7.
In most cases, simply running the same (Cortex-M4) code on Cortex-M7 gave most of the uplift.
There were obviously some cases where due to the coding style of the particular function, we got only a small uplift due to the microarchitecture and this uplift grew when compiling explicitly for Cortex-M7.
In some cases we found even more uplift when rearranging the DSP Lib source to encourage more dual issue, by making source-level changes which a compiler is unlikely to find, since they involve changes in the algorithm itself.
For example, some DSP Lib functions had been coded to perform "loads" first, then a block of arithmetic operations mostly using register values loaded at the beginning of a loop, then a block of "saves" back to memory.
By interleaving the "loads" and the arithmetic operations we were able to get more uplift by exploiting the dual-issue of loads and arithmetic operations.
In an ideal world this would be done entirely by the compiler, but this is not always possible.
This is all done in C by the way, and then looking at the resultant disassembly of the compiled code and moving C statements around, which it is reasonably easy to do for a RISC ISA.
In a very small number of cases, there are functions which consume more cycles on Cortex-M7 than Cortex-M4 without source-leve rearrangement, but we didn't find any that we were not able to optimize using a combination of compiler options and algorithmic changes.
That said, I am sure it is possible to find individual cases where code runs in less cycles on Cortex-M4 than Cortex-M7 due to the nature of the code and register dependencies between instructions.
If the original poster would like to send us the disassembly of his code (or even the source itself if it is not confidential), then we can take a look at it to explain what you are seeing.
(Ian.Johnson@arm.com).
Regards
Ian
Hello Ian,
thank you for you comments.Current my opinions came from the many internet articles and I have not yet have a real device experience.I will get the discovery board of STM32F7 in a few days and try the facts by myself.I think the sample codes of the original poster would be made only by floating point operations (excluding loads).In this case, I guess the register dependencies will affect the Cortex-M7 performance as you say.Anyway I am very glad to get comments form the developer of Cortex-M7.
Best regards,Yasuhiko Koumoto.
Hello again Ian,
may I ask a question?
ARM says Cortex-M7 is 1.6 times better performance than Cortex-M4 by EEMBC FPMark.
Is the condition the same clock frequency?
I have thought it would be the different frequency (e.g. Cortex-M7 is 200MHz and Cortex-M4 is 100MHz).
What is the truth?
Because FPMark performs a lot of memory accesses, I am not surprised at that it is the same clock frequency.
Thank you in advance.
Best regards,
Yasuhiko Koumoto.
I'm sorry.
I have found your presentation material http://community.arm.com/servlet/JiveServlet/previewBody/9595-102-4-18606/ARM_Cortex_M7_MCU_Johnson.pdf .
In the materials FPmark socre is described that assumes all processors running at the same clock frequency.
Yes, the EEMBC benchmarks are run at the same frequency.
Some of our other benchmarks include uplift due to frequency. For example, in our DSP benchmarks we see on average 1.6-1.7x speedup due to IPC and a further 0.3-0.4x due to being able to run the processor at a faster frequency.
Of course these are "average" results across which we have taken a geometric mean.
Individual benchmarks will show varying results depending on the exact mix of FP vs integer arithmetic vs load/store.
All,
Well, I am a bit confused. The answers here are not clear for me yet. Some of you saying something and some of you saying something different.
The clear benefit of performance of CM7 over CM4 are:
1. Doing pure integer / fraction math (DSP)
2. Doing mixed (interleaved) integer and floating math (DSP)
What is still completely unclear is:
3. Doing pure floating point math (DSP). For example: motor control algorithm executed in ADC isr: between ADC results reading and PWM duty setting (integer pipelines active) it is required to do quickly floating point motor control algorithm (quite lot of pure floating point instructions with couple of branching due to embedded functions and conditional branching). Benchmarking this kind of code is resulting in higher performance of CM4 in my case. To be honest I have checked the assembler for both cases and they are the same instructions. Starting measuring the number of executed core cycles at the beginning of algorithm (after ADC reading) and ending at the end of algorithm (staring writing to the PWM registers).
My understanding is that when the code is pure floating point instructions (what CM7 compilers are doing when writing the code in general way) and the code is non-linear (because of conditional / unconditional branching) the pure floating point pipeline (6 stages) causes the wait-states (flush / fill the pipeline). Looks like the dynamic branch prediction does not apply in that case.
Could you please make comments to the point 3 and only to that case, as the previous mentioned does make sense and are working in my case? Thanks.
NOTE: I see some slight improvement in a case of CM7 when linear code is executed (no branching at all).
Od: ianjohnson
Odoslané: Friday, July 24, 2015 12:08
Komu: Pavlanin Rastislav-B34185
Predmet: Re: - What is the advantage of floating point of CM7 versus CM4
<http://community.arm.com/?et=watches.email.thread>
What is the advantage of floating point of CM7 versus CM4
reply from Ian Johnson<http://community.arm.com/people/ianjohnson?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29596?et=watches.email.thread#29596>
Hello Rastislav,
can you provide your benchmark code?
The C code would be preferable.
Unless there is code, Ian could not give any comment, I think.
I further investigated the issue.
I found a some trick the Cortex-M7 performance.
Generally speaking, a C compiler converts a single precision variables to double precision at the computing.
It would be disadvantage for Cortex-M4 because Cortex-M7 has double precision instructions.
I measured the sort benchmarks which have no load instructions.
They are mandelbrot and Cosine (probably the same as the previous post) execution.
The dis-assembly codes are below.
Also the relative performance is the followings at the same clock frequency.
Mandelbrot: Cortex-M7 is 0.76 vs Cortex-M4 is 1.0 (bigger is better).
Cosine: Cortex-M7 is 0.94 vs Cortex-M4 is 1.0 (bigger is better).
I guess these came from the pipeline overhead.
That is, the dependency of the floating point registers affect the performance.
Is the forwarding of the floating registers realized in the Cortex-M7 pipeline?
How do you think?
Hi Ian, Yasuhiko,
I really appropriate yours feedbacks, very helpful for me. However, I am still not fully clear on this topic. Let’s exclude any general benchmarks. My customer is focused on its specific functions. Most probably they have existing project (c language based) and they want to simply move to another MCU (now ARM Cortex M based). When I did the performance test of some of these functions I have got in some cases (non-linear code with pure SFPU instructions) even worse results (number of MCU cycles) as on CM4 with the +/- same SPFU. The assembly looks pretty the same. I have used IAR C compiler for ARM (ICCARM) version 7.40.3.
There is one additional questions based on your last statement:
- where your tests (benchmarks) done on the Cortex M7 with both single and double precision FP available on MCU?
- Are there any advantages to do single precision FP operations on MCU with both units implemented (single and double precision FP)?
In my case I was using MCU with only single precision FPU implemented (no double precision) – from CMSIS SCB_GetFPUType() is 1 (not 2). To be honest I am not clear on how the FP pipeline share single and double precision, or if they have completely independent pipelines (which at the end would allow parallelism).
Od: yasuhikokoumoto
Odoslané: Wednesday, July 29, 2015 18:11
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29762?et=watches.email.thread#29762>
I also use only SP FP and to call DP FP routines are my misunderstanding.
The answers from your questions are below.
where your tests (benchmarks) done on the Cortex M7 with both single and double precision FP available on MCU?
No.
Are there any advantages to do single precision FP operations on MCU with both units implemented (single and double precision FP)?
Regarding these benchmarks, only SP FP was used. Also your answers are No.
For your information, I tested the same code on Cortex-A9.
Mandelbrot: 0.0516 (vs Cortex-M4 is 1.0 and Bigger is better).
Cosine: 2.07 (vs Cortex-M4 is 1.0 and Bigger is better).
I wonder Cortex-A9 is too slow at the same clock.
I tried both NEON and VFP and the results are almost the same.
Now I am confusing.
Best regards.
the benchmark codes are not identical and I tried again with the same assembly codes.
The results are,
Mandelbrot: 0.0351 (vs Cortex-M4 is 1.0 and Bigger is better).
Cosine: 0.861 (vs Cortex-M4 is 1.0 and Bigger is better).
These mean the performance of Cortex-A9 is lower than Cortex-M7 at the same clock frequency.
I guess the pipeline scheme has something with the performance.
Yasuhiko,
Did you do comparison of A9 vs. M4, or M7? Below you have mentioned CM4 and then CM7, what is a bit misleading for me. What compiler are you using, ARMCC? Thanks.
Odoslané: Tuesday, August 04, 2015 13:03
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29947?et=watches.email.thread#29947>
of course, I am comparing CM4 and CM7.
As for CA9 it is just for the information.
I think the reason of low performance of CM7 compared with CM4 will be the depth of the pipeline.
Therefore, I measured the more deeper pipeline case (i.e. CA9).
The results show my guess would be correct.
The compiler which I used is the same as you. That is IAR EWARM 4.7.0 compiler.
However, the optimization level is the middle because the UART did not work with the high level.
Regarding CA9, I used the codes generated by the EWARM compiler by the assembly level for the target functions.
For the other parts of the test, I used the GCC 4.9.
Hi Yasuhiko san,
Yes, I agree with you. The pipeline is affecting the SFPU CM7 performance. That is why the linear code looks better on CM7 than CM4. However, non-linear code (with branches) can result in lower performance. I have written can because in some cases the result can be slightly better on CM7, this is really dependent on the code structure:
- If executed floating point code (mostly DSP) is large enough then the result is of course better on CM7 (because the flush / fill of pipeline due to branching is not really affecting overall advantages of CM7 – faster instruction execution over CM4, e.g. VMLA in 1 cycle over the 3 cycles on CM4)
- If executed floating point code contains couple of floating point instructions (e.g. cosine function, or FIR of first order etc.) then if there are any branching (e.g. operands value limitations etc.) then it can result in better performance on CM4 (due to pipeline depth).
Do you agree with that simplified explanation?
Odoslané: Tuesday, August 04, 2015 22:32
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29960?et=watches.email.thread#29960>
of course, I fully agree with you.
I would like to also get ARM's agreement because I am not ARM employee.
I tried a linear (i.e. no branch) case. It is a polynomial calculation.
1)source code
float Poly(float x) { return (float)((((((float)x+(float)1.0)*(float)x+(float)1.0)*(float)x+(float)1.0)*(float)x+(float)1.0)*x+(float)1.0); }
2)assembly oode
Poly: VMOV.F32 S1,#1.0 VADD.F32 S1,S0,S1 VMOV.F32 S2,#1.0 VMLA.F32 S2,S1,S0 VMOV.F32 S1,#1.0 VMLA.F32 S1,S2,S0 VMOV.F32 S2,#1.0 VMLA.F32 S2,S1,S0 VMOV.F32 S1,#1.0 VMLA.F32 S1,S2,S0 VMOV.F32 S0,S1 BX LR
However, the result is the below.Cortex-M7: 50 cyclesCortex-M4: 47 cycles
Can you show us the source code which Cortex-M7 is faster case?