Please note: We are aware of an issue affecting replies on the Arm Community forums, which may not be loading as expected.
We apologize for any inconvenience and appreciate your patience while we investigate and work to resolve the issue.
Thank you for your understanding.
Hello ARM support team,
I hope you can help me.
I'm making use of your very nice DSP library, specifically the arm_fir_decimate_fast_q15 function. I see the convolution multiplications are implemented using the Dual Long multiply accumulate instruction, like this:
__ASM volatile ("smlad %0, %1, %2, %3" : "=r" (acc0) : "r" (x0), "r" (c0), "r" (acc0) );
However when I inspect the ASM code I see that each instance of smlad is only doing a single 16x16 bit multply and the upper words of the input registers are empty.
It seems the dual aspect of the smlad instruction is waisted and I get the same performace when I substitute a regular 16x16 bit multiply. This is a problem for me as I could really do with the performance gain of a dual multiply.
Can you please confrim whether this is the expected behaviour? And if so, what is the reason this function can't take advantage of dual multiply?
My system details:
CPU: Cortex -M33
MCU: STM32U3
Compiller: GNU 13.3 (STM32Cube IDE 1.18)
Optimisation: -Ofast
Many thanks!
John