Hi all,
I did some of the investigation based on comparison of FPU based algorithms on CM4 and CM7 cores. All the code/data were placed into the single cycle memory with full utilization of modified / true Harvard architecture, it means:
- on CM4 - code in SRAM accesible via CODE bus, data in SRAM accesible via SYSTEM bus with fully utilized modified Harvard architecture
- on CM7 - code in I-TCM memory, data in DTCM memory
Most of the code (instructions) are floating point (99%), it means thay are not interleaved with integer instructions (well this is most probably caused by compiler - to be honest I have check the assembly for both codes CM4 / CM7 and they looked the same). The code mostly contains general math calculations mul, mac, sqrt, div + load / store, all in floating point. The result I am getting are confusing me. Cortex M4 shows even better results that Cortex M7.
Questions:
- are the differencies caused by cores pipelines? not sure how "dynamic branch prediction" works, if it is really posible to get branch in single cycle or it is required to flush whole pipeline (6 cycles) in a case of floating point pipeline on CM7
- what are the best practices in coding to get the best from CM7 over CM4 in floating point meaning? (not sure if the compilers are now in best condition regarding to CM7)
thanks in advance.
regards
Rastislav
All,
Well, I am a bit confused. The answers here are not clear for me yet. Some of you saying something and some of you saying something different.
The clear benefit of performance of CM7 over CM4 are:
1. Doing pure integer / fraction math (DSP)
2. Doing mixed (interleaved) integer and floating math (DSP)
What is still completely unclear is:
3. Doing pure floating point math (DSP). For example: motor control algorithm executed in ADC isr: between ADC results reading and PWM duty setting (integer pipelines active) it is required to do quickly floating point motor control algorithm (quite lot of pure floating point instructions with couple of branching due to embedded functions and conditional branching). Benchmarking this kind of code is resulting in higher performance of CM4 in my case. To be honest I have checked the assembler for both cases and they are the same instructions. Starting measuring the number of executed core cycles at the beginning of algorithm (after ADC reading) and ending at the end of algorithm (staring writing to the PWM registers).
My understanding is that when the code is pure floating point instructions (what CM7 compilers are doing when writing the code in general way) and the code is non-linear (because of conditional / unconditional branching) the pure floating point pipeline (6 stages) causes the wait-states (flush / fill the pipeline). Looks like the dynamic branch prediction does not apply in that case.
Could you please make comments to the point 3 and only to that case, as the previous mentioned does make sense and are working in my case? Thanks.
NOTE: I see some slight improvement in a case of CM7 when linear code is executed (no branching at all).
Regards
Od: ianjohnson
Odoslané: Friday, July 24, 2015 12:08
Komu: Pavlanin Rastislav-B34185
Predmet: Re: - What is the advantage of floating point of CM7 versus CM4
<http://community.arm.com/?et=watches.email.thread>
What is the advantage of floating point of CM7 versus CM4
reply from Ian Johnson<http://community.arm.com/people/ianjohnson?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29596?et=watches.email.thread#29596>
Hello Rastislav,
can you provide your benchmark code?
The C code would be preferable.
Unless there is code, Ian could not give any comment, I think.
Best regards,
Yasuhiko Koumoto.
Hi Yasuhiko san,
I could not provide the benchmark code as it is customer code. However, for simplicity I did something similar (but more simple) on cosine function (9th order polynomial approximation) which is attached. I have used ICCARM compiler (IAR 7.40). The assembler code for both cases are completely the same as you can find in attachment. The number of cycles differs by 3 (CM4 in 42 cycles, CM7 in 45 cycles).