Hi,
I am using S32K14x controllers (Coretx-M4F). It has floating point math unit. I need to perform many mathematical operations as fast as possible. Which will be faster: fixed point q16 or fixed point q32 or single precision (32 bit) floating point?
Regards,
Pramod
Pramod Ranade said: On Cortex-M4F microcontrollers: is fixed point math faster ?
Probably not:
https://blogs.sw.siemens.com/embedded-software/2012/09/10/the-floating-point-argument/
why don't you run some tests to find out ?
But you do have to be careful to stick to single precision:
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/10-useful-tips-to-using-the-floating-point-unit-on-the-arm-cortex--m4-processor
https://dzone.com/articles/be-aware-floating-point-operations-on-arm-cortex-m
Hi Pramod,
Further to Andy's excellent reply above, the CPU has a cycle count register you can use to easily compare code performancehttps://developer.arm.com/documentation/ddi0439/b/Data-Watchpoint-and-Trace-Unit/DWT-Programmers-Model(enabled by bit 0 o DWT_CTRL). Most development tools (such as Keil MDK) have this integrated into the environment.By mathematical operations do you mean low level operations (MAC etc) or higher level operations (FFT or similar). The CMSIS DSP library contains a number of optimized routines (with and without VFP) to further help you analyze.www.keil.com/.../index.html
ARM documentation says that most FPU instructions (except division) complete in 1 clock cycle. But there is an overhead of moving operands between Rx registers and the FPU registers. The DSP instructions also seem to perform most basic arithmetic on q32 numbers in single clock cycle, but the compiler can't generate DSP instructions. Which means, we must use CMSIS DSP library. In both cases, there is some overhead, but don't know which is worse. Hence the question.
Andy Neil said:why don't you run some tests to find out ?
Yes, I am planning the same right now. I am using CMSIS DSP library. It has functions to perform arithmetic on q32 as well as float32. I assume it will use the DSP and FPU instructions, respectively. Will compare performance of the _q32 and _f32 variants of the same functions. Will post the results here when done.
Ronan Synnott said:By mathematical operations do you mean low level operations (MAC etc) or higher level operations (FFT or similar)
Low level operations, to start with. May evaluate higher math functions later (e.g. matrix multiplication)
Update:
Tried to perform MAC on q31 and float32 numbers. Used DSP instruction for q31 and FPU instruction for float32 numbers. Used gcc with highest optimization (-O3). The statement
__SMMLA(a, x, y); // equivalent to a += (x * y);
requires 9 clock cycles to execute, where a, x and y are local q31 variables.
Equivalent statement for floating point variables
f32Var1 += (f32Var2 * f32Var3);
requires 10 clock cycles to execute, where f32Var1, f32Var2 and f32Var3 are local float variables.
So fixed point math (using the DSP instructions) is faster than floating point math (using FPU). But the difference is marginal.
Pramod Ranade said:fixed point math (using the DSP instructions) is faster than floating point math (using FPU). But the difference is marginal
thanks for the feedback.
An issue often noted with fixed point is that, aside from the actual calculations, it adds overhead & complexity to the code which needs to supply the input data and/or use the results. So I guess one would need a wider benchmark to see if that tips the balance for the overall system ... ?
#FloatingVsFixedPoint
Andy Neil said:An issue often noted with fixed point is that, aside from the actual calculations, it adds overhead & complexity to the code which needs to supply the input data and/or use the results.
I agree.
Another problem in using fixed point is that code won't be portable due to the use of non-standard intrinsic functions like __SMMLA etc.
One problem in using float is that it will increase ISR entry and exit times, due to the need of saving and restoring FPU registers.
Hypothetically yes, however do your ISRs use the FPU registers?
Ronan Synnott said:Hypothetically yes, however do your ISRs use the FPU registers?
Yes!
Intrinsic functions are more commonly used for 'non C' type actions, such as barrier instructions. In higher order code, the compiler will generate VFP instructions automatically when compiled for VFP. If you really want to hand craft a function, you would use assembler, rather than intrinsics. For example:
float foo(float a, float b){ return (a+b); }
Compiled with:
armclang -c -O2 --target=arm-arm-none-eabi -mcpu=cortex-m4 -mfpu=vfpv3 float.c
outputs:
foo 0x00000000: ee001a10 .... VMOV s0,r1 0x00000004: ee010a10 .... VMOV s2,r0 0x00000008: ee310a00 1... VADD.F32 s0,s2,s0 0x0000000c: ee100a10 .... VMOV r0,s0 0x00000010: 4770 pG BX lr
For completeness, when compiled without VFP it calls a library function, which will take many cycles
foo 0x00000000: f7ffbffe .... B.W __aeabi_fadd