We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello ARM support team,
I hope you can help me.
I'm making use of your very nice DSP library, specifically the arm_fir_decimate_fast_q15 function. I see the convolution multiplications are implemented using the Dual Long multiply accumulate instruction, like this:
__ASM volatile ("smlad %0, %1, %2, %3" : "=r" (acc0) : "r" (x0), "r" (c0), "r" (acc0) );
However when I inspect the ASM code I see that each instance of smlad is only doing a single 16x16 bit multply and the upper words of the input registers are empty.
It seems the dual aspect of the smlad instruction is waisted and I get the same performace when I substitute a regular 16x16 bit multiply. This is a problem for me as I could really do with the performance gain of a dual multiply.
Can you please confrim whether this is the expected behaviour? And if so, what is the reason this function can't take advantage of dual multiply?
My system details:
CPU: Cortex -M33
MCU: STM32U3
Compiller: GNU 13.3 (STM32Cube IDE 1.18)
Optimisation: -Ofast
Many thanks!
John