Hi
We are using a Cortex M4 part from NXP, in part because of the (apparently) powerful DSP-style instructions it features. In particular, I am trying to use SMLAL to implement a multiply-accumulate inside a tight loop. I am using Keil uVision 4.23.
I have tried many routes, and there does not seem to be a way to efficiently use this instruction at all. The following function I would expect to use the instruction:
inline long long asm_smlal (long long Acc, long a, long b) { return Acc+(a*b); }
but it does not (I tried many variations on this, including splitting the long long into 2 longs to more closely match the SMLAL parameters). Instead, I get a multiplication with two additions in the disassembly listing. These extra cycles are significant in my application.
I tried to implement the instruction using inline assembler, but, for a reason I could not find explained anywhere, assembly inlining is not supported at all for Thumb-32 (really very frustrated by this missing feature...). Numerous tricks to get around this didn't work, all pointing back to the same problem (e.g. #pragma arm doesn't work, as the M4 does not support the arm instruction set. Trying to force an assembly function to be inline gives the same error, etc).
I was able to get the SMLAL instruction in a 1-line assembly function, but this resulted in a function call every time. A function call that 'linker inlining' didn't seem to want to remove, when enabling this feature in its parameters. Even if the linker inlining had worked, it would not have really helped, as the accumulator registers would have been needlessly reloaded every loop.
How am I supposed to use this efficient and useful function, without writing my whole loop in assembly?
Thanks
- Jeff
Hi Jeff,
No, I haven't seen any specific info on how to avoid stalling. I looked over the NXP DSP lib for some hints: www.nxp.com/.../AN10913.pdf and www.nxp.com/.../AN10913_CM3_DSP_library_v1_0_0.zip
Moving the "subs" instruction earlier in the code is one optimization I've seen, that way the branch won't get delayed.
Even if you're not using a STM32 the following document is worth reading for general M3 info:
www.hitex.com/.../isg-stm32-v18d-scr.pdf
Andrew