This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Can't efficiently use SMLAL instruction in Cortex M4

We are using a Cortex M4 part from NXP, in part because of the (apparently) powerful DSP-style instructions it features. In particular, I am trying to use SMLAL to implement a multiply-accumulate inside a tight loop. I am using Keil uVision 4.23.

I have tried many routes, and there does not seem to be a way to efficiently use this instruction at all. The following function I would expect to use the instruction:

inline long long asm_smlal (long long Acc, long a, long b) {
   return Acc+(a*b);
}

but it does not (I tried many variations on this, including splitting the long long into 2 longs to more closely match the SMLAL parameters). Instead, I get a multiplication with two additions in the disassembly listing. These extra cycles are significant in my application.

I tried to implement the instruction using inline assembler, but, for a reason I could not find explained anywhere, assembly inlining is not supported at all for Thumb-32 (really very frustrated by this missing feature...). Numerous tricks to get around this didn't work, all pointing back to the same problem (e.g. #pragma arm doesn't work, as the M4 does not support the arm instruction set. Trying to force an assembly function to be inline gives the same error, etc).

I was able to get the SMLAL instruction in a 1-line assembly function, but this resulted in a function call every time. A function call that 'linker inlining' didn't seem to want to remove, when enabling this feature in its parameters. Even if the linker inlining had worked, it would not have really helped, as the accumulator registers would have been needlessly reloaded every loop.

How am I supposed to use this efficient and useful function, without writing my whole loop in assembly?

Thanks

- Jeff

Parents

0 Jeff Simpson over 13 years ago in reply to Gp F

Hi

Thanks both for the quick and useful replies. A colleague had already sugested that I take a look at the intrinsics, but sadly there does not seem to be one for this particular instruction (that I could find, anyway).

It's reassuring to see other people have had the same problem as me. I had ended up writing a very similar assembly code loop to the one you posted, Andrew [I called the loop label mac_loop instead of macloop :)]. I'm sure you are already aware of this but I thought I would point out that you can do the pointer bumping as part of the LDR instructions using the post-indexed addressing mode for LDR. Appologies if you already knew this!

You mention optimising the loop to avoid pipeline stalls; where can I find documentation on what cauases stalling? This subject is mentioned in the technical reference mannual, for example in section 3.3.2. 'Load/store timings', but I wonder if there is some overview document that explains how to write assembly code that minimises stalling?

Thanks again

- Jeff
Cancel
Vote up 0 Vote down

Cancel

Reply

0 Jeff Simpson over 13 years ago in reply to Gp F

Hi

Thanks both for the quick and useful replies. A colleague had already sugested that I take a look at the intrinsics, but sadly there does not seem to be one for this particular instruction (that I could find, anyway).

It's reassuring to see other people have had the same problem as me. I had ended up writing a very similar assembly code loop to the one you posted, Andrew [I called the loop label mac_loop instead of macloop :)]. I'm sure you are already aware of this but I thought I would point out that you can do the pointer bumping as part of the LDR instructions using the post-indexed addressing mode for LDR. Appologies if you already knew this!

You mention optimising the loop to avoid pipeline stalls; where can I find documentation on what cauases stalling? This subject is mentioned in the technical reference mannual, for example in section 3.3.2. 'Load/store timings', but I wonder if there is some overview document that explains how to write assembly code that minimises stalling?

Thanks again

- Jeff
Cancel
Vote up 0 Vote down

Cancel

Children

0 Mike Kleshov over 13 years ago in reply to Jeff Simpson

You might want to look at how the DSP library from CMSIS is written. I believe they make extensive use of intrinsics.
Cancel
Vote up 0 Vote down

Cancel
0 Andrew Queisser over 13 years ago in reply to Jeff Simpson

Hi Jeff,

No, I haven't seen any specific info on how to avoid stalling. I looked over the NXP DSP lib for some hints:
www.nxp.com/.../AN10913.pdf
and
www.nxp.com/.../AN10913_CM3_DSP_library_v1_0_0.zip

Moving the "subs" instruction earlier in the code is one optimization I've seen, that way the branch won't get delayed.

Even if you're not using a STM32 the following document is worth reading for general M3 info:

www.hitex.com/.../isg-stm32-v18d-scr.pdf

Andrew
Cancel
Vote up 0 Vote down

Cancel
0 Jeff Simpson over 13 years ago in reply to Mike Kleshov
Hi

Thanks Andrew for the pdf links, some good information in there. Thanks also to Mike for the suggestion to look at CMSIS. I am happy to use intrinsics, but sadly there isn't one for the instruction that I would like to use.

Thanks Scott for this suggestion. My first thought was "surely assigning the result to an int64_t performs an implicit cast". I had actually misread your code, which in fact only casts one of the operands. As expected, casting the whole result does not make any difference, as the cast is implicit. Casting one of the operands causes them both to be upcast to 64 bit. An extract from the resulting assembler is below:

LDR.W r12,[pc,#676] LDR lr,[r12,#0x08] SUB r7,lr,#0x01 CMP r7,#0x00 BLE loop4 SUBS r4,r3,#4 SUB r12,r2,#0x04 TST lr,#0x01 BNE loop3 LDR r5,[r12,#0x04]! LDR r6,[r4,#0x04]! SMLAL r0,r1,r5,r6 loop3 MOVS r5,#0x00 LDR r8,[r12,#0x04] LDR r9,[r4,#0x04] MOV r6,r5 ASRS r7,r7,#1 BEQ loop2 NOP loop1 LDR r10,[r12,#0x08]! LDR r11,[r4,#0x08]! SMLAL r0,r1,r8,r9 SMLAL r5,r6,r10,r11 LDR r8,[r12,#0x04] LDR r9,[r4,#0x04] SUBS r7,r7,#1 BNE loop1 loop2 ADDS r0,r0,r5 ADCS r1,r1,r6 loop4 SUBS r12,lr,#0x01 IT MI POPMI {r4-r11,pc} LDR r2,[r2,r12,LSL #2] LDR r3,[r3,r12,LSL #2] SMLAL r0,r1,r2,r3 POP {r4-r11,pc} LDR r0,[pc,#504] LDR r1,[r0,#0x14] TST r1,#0x40 IT EQ

SMLAL!! Thanks Scott. This took 394 cycles for 64 loops. On closer inspection, this is actually performing a long multiplcation of two 64-bit numbers, with optimisations that don't bother with multiplications that would have a zero result. Setting break points on the first 3 SMLAL instructions, the code never stops running. This all makes sense because the C code Scott suggested actually asks for a 64+(64*64) operation, and the assembly reflects that. So asking for wider operands actually makes the code run faster, pretty counterintuitive!

So this is a big improvement, but optimally, I would not want those extra checks in there for values that I know are zero, since they are the result of a cast. I have ended up spending some time analysing compiler output and writing my own loops.

I thought I would post my loops and some results. In each case I need to perform 64 operations. The cycle counts are not exact; they include a function call to the function containing the loop, and a few instructions for reading the timer.

PUSH { r4-r7 } // stack the registers we are going to be using. LDRD r4,r5,[r0,#0] // load r5 and r6 with accumulator (located at r0 and r0+4) mac_loop \ LDR r6,[r1],#4 // load multiplicand a pointed to by r1, then point r1 to next multiplicand LDR r7,[r2],#4 // load multiplicand b pointed to by r2, then point r2 to next multiplicand SMLAL r4,r5,r6,r7 // perform the MAC SUBS r3, r3, #1 // subtract one from loop counter BNE mac_loop // back to start of loop if counter not zero STM r0, {R4,R5} // store the accumulator value back to its address POP { r4-r7 } // push back the registers we have used BX lr // return

This took 543 cycles. Moving the SUBS to just before the SMLAL did not reduce this. This seemed like the only place it might be worth moving it to, as splitting up the LDRs means that they don't pipeline as well, causing them to take 2 cycles each instead of 3 cycles for both.

for (int32_t i = 0; i<Loop; i++) { Acc+=a[i]*b[i]; } return Acc;

This took 520 cycles. This was very very confusing, when looking at the code that the compiler output:

loop: MUL r8,r8,r11 LDR r9,[r5,#0x08]! LDR r10,[r6,#0x08]! ADDS r0,r8,r0 ADC r1,r1,r8,ASR #31 MUL r9,r9,r10 ADDS r12,r9,r12 LDR r8,[r5,#0x04] LDR r11,[r6,#0x04] ADC r4,r4,r9,ASR #31 SUBS r7,r7,#1 BNE loop

According to the listing, this is the assembly code output. By my calculation this should take 12 cycles (assuming no pipeline stalling) plus the SUBS and BNE, compared to 4 for my loop. If anybody can offer some sort of explanation for this, that would be superb.

Next I tried unrolling my loop. The result took 299 cycles, or ~4.7 per MAC. Since the LDR,LDR,SMLAL take 4 instructions in the best case, I detirmined that the extra 43 cycles are the result of the stack pushing/popping, function call branches, etc.

Getting better, but it's still a real shame that inline assembly is not allowed; with that, I should be able to get something around 260 cycles for the 'loop'.

In summary:

Initial C loop (no SMLAL): 520 cycles
assembly loop (SMLAL): 543
C loop (Scott's SMLAL): 394
unrolled assembly loop: 299

Remaining questions:
Why does my assembly loop take longer than the compiler generated one, which has at least twice the instructions?
Why will the compiler only generate SMLAL in the case of 64*64 bit multiplication?

Thanks again for all the responses/interest so far.

Cheers

- Jeff
Cancel
Vote up 0 Vote down

Cancel