Hello!
I am trying to implement IIR filter algorithms on an STM32F767. I'm using the CMSIS library for the filter algorithms and they are working as expected.
However, the execution speed is very low and I'm not sure why this is happening?
I'm calculating the output samples from the input and output (circular) buffer with coefficients and data structures in place as recommended in the CMSIS documentation. Calculating a single sample takes around 285 clock cycles for the floating point implementation, and around 580 if I use fixed point data.
When thinking about a 2nd order IIR sample calculation there should be five multiplications and some rounding/shifting so I am very surprised about the big cycle count. I would expect something in the low two digit figures in terms of cycle count.
For cycle counting I am using the DWT_CYCCNT feature and a subtraction to display the acual counts used for a certain function calls/code snippets.I'm using a sampling frequency of 48kHz. A 12.288MHz external clock gets the core running at 215.808MHz. The data comes in from a pair of 8ch ADC's via DMA upon SAI interrupt and is sent out via DMA as well.
I want to have faster calculation because the end application will use 16Ch of audio passing thru the µC and should apply filtering to each of them.
If I had to make a guess about the source of the problem, I would suspect that the data I am working with takes long to reach the FPU and is therefore slowing down the operation. However, I would be surprised if a CMSIS function would not be optimized in this regard.While I'm aware that writing assembly code for that calculation would probably speed things up, I'm trying not to go that deep if not absolutely necessary.
Is there someone here who has made this experience and/or who could point me to the source of the problem?
Thank you very much in advance!
Michele
Hi Michele,
the source code for CMSIS DSP is available on GitHub if you want to explore the detailed implementation: https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/DSP/Source/FilteringFunctions
I haven't tried myself but perhaps it's worth checking that you're building the code with the FPU enabled? It might be also useful to post here the generated assembly code of the function you're testing (or even better the trace).
I hope this helps.
Best Regards,Stefano
Hello Stefano,thanks for your reply.I have checked that " __FPU_USED" is set and that is the case. I have also checked the compiler options (-mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=fpv5-d16) , to me the setup seems to be in order.I have tried different optimization levels for he compiler but the numbers stay in the above mentioned range.I have also stepped through the CMSIS filter function (arm_biquad_cascade_df1_f32) and observed, that just moving samples and coefficients from one place to another wastes 24 cycles per transfer. So the below code alone will take around 200 cycles to execute.
from "arm_biquad_cascade_df1_f32":
/* Reading the coefficients */ b0 = *pCoeffs++; b1 = *pCoeffs++; b2 = *pCoeffs++; a1 = *pCoeffs++; a2 = *pCoeffs++; /* Reading the pState values */ Xn1 = pState[0]; Xn2 = pState[1]; Yn1 = pState[2]; Yn2 = pState[3];
The actual operation that is necessary for the calculation of a sample (5*multiply and add) only takes part of the remaining time.
I could imagine that in debug mode, the code optimization done by the compiler is by far not as agressive as it can be in a release version.
My question at this point:
Do I affect execution times by using the debug mode? Or in other words: will the compiler be able to recognize and optimize those "useless" data transfers if I go into a release?
If yes, how could I verify this and how could I find the actual execution time/cycle count while in operation?
thanks for your answers!