Hi all,
I did some of the investigation based on comparison of FPU based algorithms on CM4 and CM7 cores. All the code/data were placed into the single cycle memory with full utilization of modified / true Harvard architecture, it means:
- on CM4 - code in SRAM accesible via CODE bus, data in SRAM accesible via SYSTEM bus with fully utilized modified Harvard architecture
- on CM7 - code in I-TCM memory, data in DTCM memory
Most of the code (instructions) are floating point (99%), it means thay are not interleaved with integer instructions (well this is most probably caused by compiler - to be honest I have check the assembly for both codes CM4 / CM7 and they looked the same). The code mostly contains general math calculations mul, mac, sqrt, div + load / store, all in floating point. The result I am getting are confusing me. Cortex M4 shows even better results that Cortex M7.
Questions:
- are the differencies caused by cores pipelines? not sure how "dynamic branch prediction" works, if it is really posible to get branch in single cycle or it is required to flush whole pipeline (6 cycles) in a case of floating point pipeline on CM7
- what are the best practices in coding to get the best from CM7 over CM4 in floating point meaning? (not sure if the compilers are now in best condition regarding to CM7)
thanks in advance.
regards
Rastislav
Hi Yasuhiko,
Yes, this is quite interesting investigation. I have also done some of the test which are based on different compilers (ICCCARM, ARMCC), different optimization etc. and then using inline assembler to play with order of instruction in the code execution. Play with the mixing fixed / floating point operations. However, at the end the best results I have got considering floating point operation were achieved on CM4. In all configuration mentioned the same mathematical calculation (with the same parameters and coefficients) were used. Of course in some special cases CM7 showed better results than CM4. However, the best results were achieved on CM4. I getting to be sure that the compiler are not yet prepared to CM7 features. It would be perfect to have definitive answer from ARM to show us how to write a code / use a compiler to utilize CM7 full performance.
Regards