This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Can't efficiently use SMLAL instruction in Cortex M4

We are using a Cortex M4 part from NXP, in part because of the (apparently) powerful DSP-style instructions it features. In particular, I am trying to use SMLAL to implement a multiply-accumulate inside a tight loop. I am using Keil uVision 4.23.

I have tried many routes, and there does not seem to be a way to efficiently use this instruction at all. The following function I would expect to use the instruction:

inline long long asm_smlal (long long Acc, long a, long b) {
   return Acc+(a*b);
}

but it does not (I tried many variations on this, including splitting the long long into 2 longs to more closely match the SMLAL parameters). Instead, I get a multiplication with two additions in the disassembly listing. These extra cycles are significant in my application.

I tried to implement the instruction using inline assembler, but, for a reason I could not find explained anywhere, assembly inlining is not supported at all for Thumb-32 (really very frustrated by this missing feature...). Numerous tricks to get around this didn't work, all pointing back to the same problem (e.g. #pragma arm doesn't work, as the M4 does not support the arm instruction set. Trying to force an assembly function to be inline gives the same error, etc).

I was able to get the SMLAL instruction in a 1-line assembly function, but this resulted in a function call every time. A function call that 'linker inlining' didn't seem to want to remove, when enabling this feature in its parameters. Even if the linker inlining had worked, it would not have really helped, as the accumulator registers would have been needlessly reloaded every loop.

How am I supposed to use this efficient and useful function, without writing my whole loop in assembly?

Thanks

- Jeff

Parents

0 Jeff Simpson over 13 years ago in reply to Andrew Queisser

Sorry, my reply has ended up in the wrong place. Please see my latest post above (sorry also for the extra post-notification email this post will generate).
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Jeff Simpson over 13 years ago in reply to Andrew Queisser

Sorry, my reply has ended up in the wrong place. Please see my latest post above (sorry also for the extra post-notification email this post will generate).
Cancel
Vote up 0 Vote down

Cancel

Children

0 Scott Douglass over 13 years ago in reply to Jeff Simpson
On closer inspection, this is actually performing a long multiplcation of two 64-bit numbers, ...

I don't think so. This loop:

... loop1 LDR r10,[r12,#0x08]! LDR r11,[r4,#0x08]! SMLAL r0,r1,r8,r9 SMLAL r5,r6,r10,r11 LDR r8,[r12,#0x04] LDR r9,[r4,#0x04] SUBS r7,r7,#1 BNE loop1 ...

is using SMLAL to do 32x32=>64 (and 64-bit acculumate). It's been unrolled to do it twice per loop (four LDRs, two SMLALs), but there is no 64x64 being done.

This all makes sense because the C code Scott suggested actually asks for a 64+(64*64) operation, ...

Yes, technically, at the C level the '(long long)' cast causes the multiplication to be of long longs. But armcc realizes that SML[A]L (which does 32x32=>64) will give the correct answer for this case and so it uses it.

unrolled assembly loop: 299

The loop using MUL is doing a different calculation since MUL is 32x32=>32 and SMLAL is (in part) 32x32=>64. Depending on the input data the results could be the different or the same (in the case where the 32x32=>32 never happens to overflow).

Why does my assembly loop take longer than the compiler generated one, which has at least twice the instructions?

Your mac_loop is not unrolled and spends more time branching (because it goes 'rounf the loop twice as many times).

Why will the compiler only generate SMLAL in the case of 64*64 bit multiplication?

Because in C (given 32 bits as the width of 'long') multiplication of two 'longs' must be 32x32=>32 and using SMLAL to to the mulitplication would be wrong.

[Aside: one could argue that when long*long overflows the user has invoked the dreaded "Undefined Behavior" so armcc could in fact use SMLAL for this case since if the multiplication overflows all bets are off. But armcc doesn't do it (maybe it's "too unexpected"). If, instead, the multiplication were unsigned long*unsigned long then there's no undefined behavior and UML[A]L is definitely not a good idea.]
Cancel
Vote up 0 Vote down

Cancel
0 HansBernhard Broeker over 13 years ago in reply to Scott Douglass

Because in C (given 32 bits as the width of 'long') multiplication of two 'longs' must be 32x32=>32 and using SMLAL to to the mulitplication would be wrong.

Not necessarily. As long as the compiler extracts the right 32 bits of signed integer result out of it, it can use whatever machine instruction it wants to.
Cancel
Vote up 0 Vote down

Cancel
0 Scott Douglass over 13 years ago in reply to HansBernhard Broeker

As long as the compiler extracts the right 32 bits of signed integer result out of it, it can use whatever machine instruction it wants to.

True, but since the SMLAL does (32x32=>64)+64 it's not straight forward to use it in a situation where (32x32=>32=>64)+64 is required. And since SMULL is slower (or at least no faster) and requires more registers than MUL there is probably no reason to use it either.
Cancel
Vote up 0 Vote down

Cancel
0 Jeff Simpson over 13 years ago in reply to Scott Douglass
Hi Scott

Thanks for the detailed reply.

I have looked again at the assembly output from your MAC operation which includes a cast, and it looks like I was wrong in my assertion that it was performing a long multiplication. I guess I was getting a bit too involved for day #1 of ARM assembler! Thanks for this correction (also, it looks like break points in the disassembly listing don't work reliably; this threw me off a bit).

long a,b; long long Acc; Acc += a*b; // long long += long*long; implicit casting of (a*b) to long long Acc += (long long)(a*b); // Same result, cast now explicit but assembly output is identical // Now for your suggestion: Acc += (long long)a*b // explicit cast of a to long long, causes implicit cast of b up to long long also Acc += (long long)a*(long long)b; // same assembler output as previous line, as expected

So the code says (64*64=>64)+64, but the compiler presumably realises that the operands are in fact guaranteed to fit in 32 bits, so reduces this by optimisation to (32x32=>64)+64. Great!

Your mac_loop is not unrolled and spends more time branching (because it goes 'rounf the loop twice as many times).

True, but what I was confused about was that the loop was taking longer than the apparently more complicated loop generated from the simple

for (int32_t i = 0; i<Loop; i++) { Acc+=a[i]*b[i]; }

C code. It turns out that I had identified the wrong piece of assembly code. The actual loop is:

loop LDR r5,[r2,r12,LSL #2] LDR r6,[r3,r12,LSL #2] MULS r5,r6,r5 ADDS r0,r5,r0 ADC r1,r1,r5,ASR #31 ADD r12,r12,#0x01 CMP r4,r12 BGT loop

which is (32*32=>32)+64, as you rightly suggested. Compare this to my loop:

loop LDR r6,[r1],#4 // load multiplicand a pointed to by r1, then point r1 to next multiplicand LDR r7,[r2],#4 // load multiplicand b pointed to by r2, then point r2 to next multiplicand SMLAL r4,r5,r6,r7 // perform the MAC SUBS r3, r3, #1 // subtract one from loop counter BNE loop // back to start of loop if counter not zero

and I still do not understand why my loop takes longer to execute, despite having two fewer 1-cycle instructions in the loop.

Because in C (given 32 bits as the width of 'long') multiplication of two 'longs' must be 32x32=>32 and using SMLAL to to the mulitplication would be wrong.

I'm not sure that this is the case; as Hans points out, the compiler can use any instructions it likes, as long as the result is correct. The overflow behavior of a long*long is not actually defined in the C standard. So it seems like the real issue is that there is no C syntax that gives (32*32=>64)+64. Looks to me like a prime example of where an intrinsic would be a good (if not very portable) solution, I wonder why one is not provided.

So in the end, for 64 loops:
C loop, upcasting operands to 64-bit: 394
unrolled assembly loop: 299

I guess this is pretty good. I am just starting to learn this type of programming, so now I have a much better understanding of how close in performance the compiler output can get to hand optimised/unrolled assembler. In this case about 30% slower.

Thanks to all.

Cheers

- Jeff
Cancel
Vote up 0 Vote down

Cancel