Using arm compiler 6, on a Cortex M7 I found a hard bug using fmaf in some linear interpolation that is iterated in a large 2d image loop. When I change it to simply a*b+c I see the assembler has changed from __fmaf_hardfp() to vmla.f32
I now have realtime performance for the image loop. I tried some inline assembly to use vfma.f32 but haven't been successful.
What on earth is going on in __fmaf_hardfp() ???
Using latest MDK ARM with all fast optimizations on.
STATIC_INLINE_PURE float lerp(float const A, float const B, float const tNorm) { // fma(t, v1, fma(-t, v0, v0)); return( fmaf(tNorm, B, fmaf(-tNorm, A, A)) ); }