Hi all,
I did some of the investigation based on comparison of FPU based algorithms on CM4 and CM7 cores. All the code/data were placed into the single cycle memory with full utilization of modified / true Harvard architecture, it means:
- on CM4 - code in SRAM accesible via CODE bus, data in SRAM accesible via SYSTEM bus with fully utilized modified Harvard architecture
- on CM7 - code in I-TCM memory, data in DTCM memory
Most of the code (instructions) are floating point (99%), it means thay are not interleaved with integer instructions (well this is most probably caused by compiler - to be honest I have check the assembly for both codes CM4 / CM7 and they looked the same). The code mostly contains general math calculations mul, mac, sqrt, div + load / store, all in floating point. The result I am getting are confusing me. Cortex M4 shows even better results that Cortex M7.
Questions:
- are the differencies caused by cores pipelines? not sure how "dynamic branch prediction" works, if it is really posible to get branch in single cycle or it is required to flush whole pipeline (6 cycles) in a case of floating point pipeline on CM7
- what are the best practices in coding to get the best from CM7 over CM4 in floating point meaning? (not sure if the compilers are now in best condition regarding to CM7)
thanks in advance.
regards
Rastislav
Hello,
I tried to reply earlier, but I think I accidentally closed the window
.... or maybe you will see 2 replies.
These results are broadly what we would expect.
Linpack, whetstone and matrix multiply are all long enough and varied enough code to benefit from Cortex-M7's microarchitecture.
The short instruction sequence to approximate cos will execute roughly the same on Cortex-M4 and Cortex-M7 - you gain a little from dual issue of the loads, but then lose a little from dependencies between the FP arithmetic instructions.
You may get slightly different results using the ARM compiler, but there is not much chance for the compiler to avoid the dependencies in such a short sequence.
Regards
Ian
I'm not ignoring you but we have the ARM Partner's meeting in Cambridge this week.
I'll talk to our engineering team and get back to you.
Hello Ian,
what is your opinion about the issue?
Best regards,
Yasuhiko Koumoto.
Hi Ian,
what did you mean by:
- The short instruction sequence to approximate cos will execute roughly the same on Cortex-M4 and Cortex-M7 - you gain a little from dual issue of the loads, but then lose a little from dependencies between the FP arithmetic instructions.
I still getting confused from one thread to another on this topic.
Is dual issue of floating point load/store and another floating point instruction possible on CM7?
We are clear that the dual issue is possible on integer load/store. Imagine that the benchmarked code includes only single precision floating point instructions + couple of branch instructions (no additional integer instructions which we know can be helpful to utilize dual issue). I am still thinking that such code will (or better to say is can) take longer on CM7 than on CM4 due to differences on pipelines (flush / fill), If the code is linear (no branches) the code should definitely take less cycles on CM7 than on CM4.
I know (I have also experience with) that when the code is done manually in assembly with interleaved integer <–> floating point instructions it will take less cycles on CM7 than on CM4. However, it is in most cases difficult to implement and I am pretty sure that the compilers do not doing this yet.