Hi all,
I did some of the investigation based on comparison of FPU based algorithms on CM4 and CM7 cores. All the code/data were placed into the single cycle memory with full utilization of modified / true Harvard architecture, it means:
- on CM4 - code in SRAM accesible via CODE bus, data in SRAM accesible via SYSTEM bus with fully utilized modified Harvard architecture
- on CM7 - code in I-TCM memory, data in DTCM memory
Most of the code (instructions) are floating point (99%), it means thay are not interleaved with integer instructions (well this is most probably caused by compiler - to be honest I have check the assembly for both codes CM4 / CM7 and they looked the same). The code mostly contains general math calculations mul, mac, sqrt, div + load / store, all in floating point. The result I am getting are confusing me. Cortex M4 shows even better results that Cortex M7.
Questions:
- are the differencies caused by cores pipelines? not sure how "dynamic branch prediction" works, if it is really posible to get branch in single cycle or it is required to flush whole pipeline (6 cycles) in a case of floating point pipeline on CM7
- what are the best practices in coding to get the best from CM7 over CM4 in floating point meaning? (not sure if the compilers are now in best condition regarding to CM7)
thanks in advance.
regards
Rastislav
Hello Rastislav,
I have gotten the STM32F7 Discovery board and now I can cross-check your results.
Can you provide the codes of which performance were less than Cortex-M7?
For a trial, I measured the 4x4 matrix multiply performance of the floating point by SysTick.
The results are
Cortex-M7: 303 cpu cycles and
Cortex-M4: 452 cpu cycles.
According to my trial, Cortex-M7 is 1.5 times better performance than Cortex-M4.
Best regards,
Yasuhiko Koumoto.
Hi Yasuhiko san,
Very much appropriate your help with. I am on vacation nowadays with limited access to the evaluation. However, after vacation I will try also your test (higher level of matrix). Thanks.
Regards
Od: yasuhikokoumoto
Odoslané: Sunday, July 26, 2015 23:17
Komu: Pavlanin Rastislav-B34185
Predmet: Re: - What is the advantage of floating point of CM7 versus CM4
<http://community.arm.com/?et=watches.email.thread>
What is the advantage of floating point of CM7 versus CM4
reply from yasuhikokoumoto<http://community.arm.com/people/yasuhikokoumoto?et=watches.email.thread> in ARM Processors - View the full discussion<http://community.arm.com/message/29627?et=watches.email.thread#29627>
I also measured performance of Linpack and Whetstone benchmarks.
Linpack: 1.62 times faster by Cortex-M7 at the same clock.
Whetsone: 1.91 times faster by Cortex-M7 at the same clock.
Hello,
I tried to reply earlier, but I think I accidentally closed the window
.... or maybe you will see 2 replies.
These results are broadly what we would expect.
Linpack, whetstone and matrix multiply are all long enough and varied enough code to benefit from Cortex-M7's microarchitecture.
The short instruction sequence to approximate cos will execute roughly the same on Cortex-M4 and Cortex-M7 - you gain a little from dual issue of the loads, but then lose a little from dependencies between the FP arithmetic instructions.
You may get slightly different results using the ARM compiler, but there is not much chance for the compiler to avoid the dependencies in such a short sequence.
Ian
I'm not ignoring you but we have the ARM Partner's meeting in Cambridge this week.
I'll talk to our engineering team and get back to you.
Hello Ian,
what is your opinion about the issue?
I evaluated again from another aspect.
This time, I prepared the following 3 functions.
The C source code is below.
float ip(float a, float b, float c, float d, float e) { float x,y,z,u,v; x=a*a; y=b*b; z=c*c; u=d*d; v=e*e; return (float)(x+y+z+z+u+v); }
And I compiled it by 3 different optimization options.
Ip0 is for Cortex-M3.
Ip1 is for Cortex-M7.
Ip2 is for Cortex-R7 (same as Cortex-A9).
float ip0(float a, float b, float c, float d, float e) { asm("vmul.f32 s1, s1, s1"); asm("vmul.f32 s0, s0, s0"); asm("vmul.f32 s2, s2, s2"); asm("vadd.f32 s0, s0, s1"); asm("vmul.f32 s3, s3, s3"); asm("vadd.f32 s0, s0, s2"); asm("vmul.f32 s4, s4, s4"); asm("vadd.f32 s0, s0, s2"); asm("vadd.f32 s0, s0, s3"); asm("vadd.f32 s0, s0, s4"); asm("bx lr"); asm("nop"); }
float ip1(float a, float b, float c, float d, float e) { asm("vmul.f32 s0, s0, s0"); asm("vmul.f32 s1, s1, s1"); asm("vmul.f32 s2, s2, s2"); asm("vmul.f32 s3, s3, s3"); asm("vadd.f32 s1, s0, s1"); asm("vmul.f32 s4, s4, s4"); asm("vadd.f32 s1, s1, s2"); asm("vadd.f32 s1, s1, s2"); asm("vadd.f32 s0, s1, s3"); asm("vadd.f32 s0, s0, s4"); asm("bx lr"); asm("nop"); }
float ip2(float a, float b, float c, float d, float e) { asm("vmul.f32 s1, s1, s1"); asm("vmul.f32 s2, s2, s2"); asm("vmla.f32 s1, s0, s0"); asm("vadd.f32 s0, s1, s2"); asm("vadd.f32 s0, s0, s2"); asm("vmla.f32 s0, s3, s3"); asm("vmla.f32 s0, s4, s4"); asm("bx lr"); asm("nop"); }
Ip0 and ip1 are the same but the instruction order is different.
Ip1 has less dependency of registers.
Ip2 is just for the information and it uses the multiply and addition instructions.
The execution time by Cortex-M7 and Cortex-M4 will be shown below.
- Cortex-M7
ip0 40 cycles
ip1 33 cycles
ip2 27 cycles
- Cortex-M4
ip0 32 cycles
ip1 37 cycles
ip2 30 cycles
The results show the register dependency will affect the performance.
However, by using each specific compile option, the best performance will be gotten.
Hi Yasuhiko,
Yes, this is quite interesting investigation. I have also done some of the test which are based on different compilers (ICCCARM, ARMCC), different optimization etc. and then using inline assembler to play with order of instruction in the code execution. Play with the mixing fixed / floating point operations. However, at the end the best results I have got considering floating point operation were achieved on CM4. In all configuration mentioned the same mathematical calculation (with the same parameters and coefficients) were used. Of course in some special cases CM7 showed better results than CM4. However, the best results were achieved on CM4. I getting to be sure that the compiler are not yet prepared to CM7 features. It would be perfect to have definitive answer from ARM to show us how to write a code / use a compiler to utilize CM7 full performance.
Hi Ian,
what did you mean by:
- The short instruction sequence to approximate cos will execute roughly the same on Cortex-M4 and Cortex-M7 - you gain a little from dual issue of the loads, but then lose a little from dependencies between the FP arithmetic instructions.
I still getting confused from one thread to another on this topic.
Is dual issue of floating point load/store and another floating point instruction possible on CM7?
We are clear that the dual issue is possible on integer load/store. Imagine that the benchmarked code includes only single precision floating point instructions + couple of branch instructions (no additional integer instructions which we know can be helpful to utilize dual issue). I am still thinking that such code will (or better to say is can) take longer on CM7 than on CM4 due to differences on pipelines (flush / fill), If the code is linear (no branches) the code should definitely take less cycles on CM7 than on CM4.
I know (I have also experience with) that when the code is done manually in assembly with interleaved integer <–> floating point instructions it will take less cycles on CM7 than on CM4. However, it is in most cases difficult to implement and I am pretty sure that the compilers do not doing this yet.
I have rerun your test on a M4F processor (a K64F from NXP). For this I put the floating point instructions into a simple loop:
asm( "MOV R1, #200"); loop: // floating point goes here asm( "SUBS R1, R1, #1"); asm goto( "BNE %l[loop]" : : : : loop);
and I measured the following, without the 3 cycles needed for the loop:
ip 12 cycles
ip1 13 cycles
ip2 13 cycles
ip and ip1 are exactly the speeds you would expect when reading the technical reference; ip2 is surprisingly fast, I would have expected two more cycles because of the VADD data dependencies.
pavlanin asked a question above which is also my question:
I don't think it was answered.
A big part of our selection of the M7 was the improved DSP FPU performance due to the parallel loads and MACs. I'm writing strictly in assembly code so there's no compiler confusion. Due to lack of detailed documentation, I have been trying all sorts of permutations and tiny changes in hopes of finding the delicate combination that allows parallel operation. I'm aware of and avoiding register conflicts (register dependencies >= 4 instructions apart, so lots of margin).
Then it hit me that maybe it was a human communication issue and the parallel load/MAC operation only applies to integer, not floating-point. I sure hope I'm wrong.
Is this true? It seems like it shouldn't be true since there are pipeline diagrams showing 2 FPU execution paths. Hopes: 1) the 2 FPU paths could issue load/MAC, 2) FPU issue MAC, Load/Store unit load FP reg.
We just need a single concrete example in assembly code of how any FPU parallelism is possible then I can work my way from there to my application math.
Below is a sample of my asm code that does not execute the line pairs in 1 cycle as hoped for.
I took my baseline asm and changed my VLDRs to LDRs and the cycle count dropped about in half, so I can confirm int loads and VFMAs can run in parallel.
Thanks, Chris
As a follow-up, if FP parallel load/MACs are not possible, then what is next best plan?
From my fumbling around, it appears that doing a burst load of some size (<1 cycle/load) followed by a string of single cycle VFMAs, gives the smallest cycle count I've seen.
If the above burst method is the best, given the N-port FPU load structure, what is the optimal way to load FP regs? (instruction to use, number of regs in burst, alignment effects, etc) A concrete asm code example of the optimal load/MAC method would be much appreciated.
I can't spend any more time on this. I haven't received any feedback. I always like to help the next person, so here's my full report out of discoveries and conclusions. I would still very much like an official ARM confirmation of my findings. You have great CPU designs; please up your game in terms of detailed optimizing information -- let's all get better together.
- The only FPU advantage of M7 over M4 is a single cycle MAC.
- Various web sources (ARM, others) talk of "2 FPU ALU pipes". I've seen no evidence from experiments that this is true. My guess is that it is communication confusion -- my swag: there ARE 2 paths that are creatively designed to enable a 1 cycle MAC; it is useless is all other regards. I don't like this conclusion and would LOVE to hear of tricks otherwise as to how I can parallelize anything.
- KEY learning: due to pipelining CAN'T use just 2 intermediate calculation regs for some algorithms, need 4
// VFMA.F32 S0,S31,S8 195 MHz (S0=Real, S1=Imag)
// VFMA.F32 S1,S30,S9
// VFMA.F32 S0,S29,S10
// VFMA.F32 S1,S28,S11
//
// VFMA.F32 S0,S31,S8 317 MHz (S0, S1=Real [sum later], S2, S3=Imag [sum later] )
// VFMA.F32 S2,S29,S10
// VFMA.F32 S3,S28,S11
- Simple alternating load/MAC as suggested by literature does quite poorly. Bursts of loads do much better -- do that. My hunch is that I assumed (my bad, read the details) that the marketing of parallel load/MACs applied equally to FPU and Integer math. From my testing, it appears this is only true for Int. What a shame. Again, I'd love to be wrong. Please show me the trick.
- There is an art to reg bursting. Study your algorithm, really break it down to its essence. The FPU in general is MUCH preferred to the Int ALU (only 13-14 regs) because we have 32 SP regs, 32! Huge resource. Load/stores are trouble. Move what you can into regs, keep them there, run lots of cycles. For the remaining data that must be loaded on the fly, do this: identify the bare minimum numbers of regs needed for intermediate calculations, allocate the rest as a "load buffer" that is filled with a single VLDM (Ex: VLDM R1!, {S4-S15} ). The code should have a pattern, one VLDM, lots of VFMAs, repeat. That's it. That's the only trick I was able to find. This trick was also my only one in M4 land.
- Cache. TCM is MUCH preferred but, for various reasons, sometimes it can't be used. ARM: push your licensees to provide better selection of TCM configurations, don't force me to waste huge amounts, let me have highly granular config -- I needed just a bit of DTCM only (little benefit for ITCM) and can't select it on my chip. Ok, back to cache. You're fighting 16 KB of DCache, understand that and embrace it. If your twiddle/etc tables are bigger than that, your code will be horribly slow. The trick is simple. Find a way to break your algorithm into bursts so that a portion of the table WILL fit in cache, first set of math is slow (load cache) then rest is fast. For my algorithm, I used 16, 1 slow, 15 fast, had to break my processing into 3 sections so a part of table would fully fit in cache (most of the time, break into into more sections if it misses fitting too often [interrupts, task switching, etc] ).
- Load/store density reduction. This is a small fine tuning item but I was able to save a few cycles and it only takes a few minutes of fiddling. After all my heavy duty MAC math in the main inner loop, there's a bottom section of the obligatory load/store, update type. I was able to rearrange items (that didn't matter) so that no load/store was next to another one -- it's faster, but the code is less logical, no big deal, put a long comment, disks are free. ARM: again, we need data. What's going on here? Just share the inner workings so we know the silicon with which we dance. It's not super secret IP, almost all CPU IP shops must be dealing with this. I don't see any competitive loss. It could even become an Advantage by showing that you are willing to help your customers wring every last cycle out of your parts. Less preferred path but I'll take it, let's do an NDA, then share the tricks -- but this is much less preferred because all the really innovative small shops (that turn into big ones) will never do this because lawyers are expensive and they don't have time to waste on NDAs which can take months to execute.
Happy optimizing, Chris