Use of smlad instruction in arm_fir_decimate_fast_q15

Hello ARM support team,

I hope you can help me.

I'm making use of your very nice DSP library, specifically the arm_fir_decimate_fast_q15 function. I see the convolution multiplications are implemented using the Dual Long multiply accumulate instruction, like this:

__ASM volatile ("smlad %0, %1, %2, %3" : "=r" (acc0) : "r" (x0), "r" (c0), "r" (acc0) );

However when I inspect the ASM code I see that each instance of smlad is only doing a single 16x16 bit multply and the upper words of the input registers are empty.

It seems the dual aspect of the smlad instruction is waisted and I get the same performace when I substitute a regular 16x16 bit multiply. This is a problem for me as I could really do with the performance gain of a dual multiply.

Can you please confrim whether this is the expected behaviour? And if so, what is the reason this function can't take advantage of dual multiply?

My system details:

CPU: Cortex -M33

MCU: STM32U3

Compiller: GNU 13.3 (STM32Cube IDE 1.18)

Optimisation: -Ofast

Many thanks!

John