This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Can't efficiently use SMLAL instruction in Cortex M4

Hi

We are using a Cortex M4 part from NXP, in part because of the (apparently) powerful DSP-style instructions it features. In particular, I am trying to use SMLAL to implement a multiply-accumulate inside a tight loop. I am using Keil uVision 4.23.

I have tried many routes, and there does not seem to be a way to efficiently use this instruction at all. The following function I would expect to use the instruction:

inline long long asm_smlal (long long Acc, long a, long b) {
   return Acc+(a*b);
}


but it does not (I tried many variations on this, including splitting the long long into 2 longs to more closely match the SMLAL parameters). Instead, I get a multiplication with two additions in the disassembly listing. These extra cycles are significant in my application.

I tried to implement the instruction using inline assembler, but, for a reason I could not find explained anywhere, assembly inlining is not supported at all for Thumb-32 (really very frustrated by this missing feature...). Numerous tricks to get around this didn't work, all pointing back to the same problem (e.g. #pragma arm doesn't work, as the M4 does not support the arm instruction set. Trying to force an assembly function to be inline gives the same error, etc).

I was able to get the SMLAL instruction in a 1-line assembly function, but this resulted in a function call every time. A function call that 'linker inlining' didn't seem to want to remove, when enabling this feature in its parameters. Even if the linker inlining had worked, it would not have really helped, as the accumulator registers would have been needlessly reloaded every loop.

How am I supposed to use this efficient and useful function, without writing my whole loop in assembly?

Thanks

- Jeff

Parents Reply Children
No data