This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Can't efficiently use SMLAL instruction in Cortex M4

Hi

We are using a Cortex M4 part from NXP, in part because of the (apparently) powerful DSP-style instructions it features. In particular, I am trying to use SMLAL to implement a multiply-accumulate inside a tight loop. I am using Keil uVision 4.23.

I have tried many routes, and there does not seem to be a way to efficiently use this instruction at all. The following function I would expect to use the instruction:

inline long long asm_smlal (long long Acc, long a, long b) {
   return Acc+(a*b);
}


but it does not (I tried many variations on this, including splitting the long long into 2 longs to more closely match the SMLAL parameters). Instead, I get a multiplication with two additions in the disassembly listing. These extra cycles are significant in my application.

I tried to implement the instruction using inline assembler, but, for a reason I could not find explained anywhere, assembly inlining is not supported at all for Thumb-32 (really very frustrated by this missing feature...). Numerous tricks to get around this didn't work, all pointing back to the same problem (e.g. #pragma arm doesn't work, as the M4 does not support the arm instruction set. Trying to force an assembly function to be inline gives the same error, etc).

I was able to get the SMLAL instruction in a 1-line assembly function, but this resulted in a function call every time. A function call that 'linker inlining' didn't seem to want to remove, when enabling this feature in its parameters. Even if the linker inlining had worked, it would not have really helped, as the accumulator registers would have been needlessly reloaded every loop.

How am I supposed to use this efficient and useful function, without writing my whole loop in assembly?

Thanks

- Jeff

  • I looked into this as well and have not found a way to get the compiler to generate the code or using inline assembly. Embedded assembly works but that means effectively writing your inner loop in assembly. Here's an example (not very optimized):

    __asm int64_t DotProd(int32_t *x, int32_t *y, int len)
    {
            // r0, r1 are pointers to next data values
            // r2 is the length of the vector (and loop counter)
            // r3, r4 are the data values (dereferenced r0, r1)
            // r5, r6 is accumulator and return value
    
                    mov     r5, #0
                    mov     r6, #0                                                  // initialize accumulator, which is also return value
    
    macloop
                    ldr     r3, [r0]                                                // dereference current pointers
                    ldr     r4, [r1]
                    smlal   r5, r6, r3, r4                  // multiply and accumulate into r5 and r6 (low part in r5)
                    add     r0, r0, #4                                      // bump pointers to next value
                    add     r1, r1, #4
                    subs            r2, r2, #1                                      // decrement loop counter and branch if not zero
                    bne macloop
    
                    mov     r0, r5                                                  // copy results to normal return registers
                    mov     r1, r6
    
                    bx lr
    }
    
    

    In our case it doesn't matter that much because we would end up writing in assembly anyway to make sure we order instructions the right way to avoid pipeline stalls and unroll our loop appropriately (not shown above). It would be nice to have a compiler with intrinsics for common DSP tasks but I'm not aware of one for M3/M4.

    NXP does have a DSP lib we studied for inspiration.

  • You might look into the ARM Compiler toolchain ARMv6 SIMD Instruction Intrinsics online docs:

    Overview:
    infocenter.arm.com/.../CHDECGJB.html

    e.g. __smlald intrinsic:
    infocenter.arm.com/.../CJACAAFC.html

  • Hi

    Thanks both for the quick and useful replies. A colleague had already sugested that I take a look at the intrinsics, but sadly there does not seem to be one for this particular instruction (that I could find, anyway).

    It's reassuring to see other people have had the same problem as me. I had ended up writing a very similar assembly code loop to the one you posted, Andrew [I called the loop label mac_loop instead of macloop :)]. I'm sure you are already aware of this but I thought I would point out that you can do the pointer bumping as part of the LDR instructions using the post-indexed addressing mode for LDR. Appologies if you already knew this!

    You mention optimising the loop to avoid pipeline stalls; where can I find documentation on what cauases stalling? This subject is mentioned in the technical reference mannual, for example in section 3.3.2. 'Load/store timings', but I wonder if there is some overview document that explains how to write assembly code that minimises stalling?

    Thanks again

    - Jeff

  • You might want to look at how the DSP library from CMSIS is written. I believe they make extensive use of intrinsics.

  • Hi Jeff,

    No, I haven't seen any specific info on how to avoid stalling. I looked over the NXP DSP lib for some hints:
    www.nxp.com/.../AN10913.pdf
    and
    www.nxp.com/.../AN10913_CM3_DSP_library_v1_0_0.zip

    Moving the "subs" instruction earlier in the code is one optimization I've seen, that way the branch won't get delayed.

    Even if you're not using a STM32 the following document is worth reading for general M3 info:

    www.hitex.com/.../isg-stm32-v18d-scr.pdf

    Andrew

  • I think the C code you are looking for is:

    inline long long asm_smlal (long long Acc, long a, long b) {
       return Acc + ((long long)a*b);
    }
    

    Without the '(long long)' cast the multiplication part is 32x32=>32, which does not match SMLAL. SMLAL does 32x32=>64.

  • Scott, thanks for that. First time I've seen smlal generated by the compiler.

    Andrew

  • Hi

    Thanks Andrew for the pdf links, some good information in there. Thanks also to Mike for the suggestion to look at CMSIS. I am happy to use intrinsics, but sadly there isn't one for the instruction that I would like to use.

    Thanks Scott for this suggestion. My first thought was "surely assigning the result to an int64_t performs an implicit cast". I had actually misread your code, which in fact only casts one of the operands. As expected, casting the whole result does not make any difference, as the cast is implicit. Casting one of the operands causes them both to be upcast to 64 bit. An extract from the resulting assembler is below:

            LDR.W         r12,[pc,#676]
            LDR           lr,[r12,#0x08]
            SUB           r7,lr,#0x01
            CMP           r7,#0x00
            BLE           loop4
            SUBS          r4,r3,#4
            SUB           r12,r2,#0x04
            TST           lr,#0x01
            BNE           loop3
            LDR           r5,[r12,#0x04]!
            LDR           r6,[r4,#0x04]!
            SMLAL         r0,r1,r5,r6
    loop3   MOVS          r5,#0x00
            LDR           r8,[r12,#0x04]
            LDR           r9,[r4,#0x04]
            MOV           r6,r5
            ASRS          r7,r7,#1
            BEQ           loop2
            NOP
    loop1   LDR           r10,[r12,#0x08]!
            LDR           r11,[r4,#0x08]!
            SMLAL         r0,r1,r8,r9
            SMLAL         r5,r6,r10,r11
            LDR           r8,[r12,#0x04]
            LDR           r9,[r4,#0x04]
            SUBS          r7,r7,#1
            BNE           loop1
    loop2   ADDS          r0,r0,r5
            ADCS          r1,r1,r6
    loop4   SUBS          r12,lr,#0x01
            IT            MI
            POPMI         {r4-r11,pc}
            LDR           r2,[r2,r12,LSL #2]
            LDR           r3,[r3,r12,LSL #2]
            SMLAL         r0,r1,r2,r3
            POP           {r4-r11,pc}
            LDR           r0,[pc,#504]
            LDR           r1,[r0,#0x14]
            TST           r1,#0x40
            IT            EQ
    


    SMLAL!! Thanks Scott. This took 394 cycles for 64 loops. On closer inspection, this is actually performing a long multiplcation of two 64-bit numbers, with optimisations that don't bother with multiplications that would have a zero result. Setting break points on the first 3 SMLAL instructions, the code never stops running. This all makes sense because the C code Scott suggested actually asks for a 64+(64*64) operation, and the assembly reflects that. So asking for wider operands actually makes the code run faster, pretty counterintuitive!

    So this is a big improvement, but optimally, I would not want those extra checks in there for values that I know are zero, since they are the result of a cast. I have ended up spending some time analysing compiler output and writing my own loops.

    I thought I would post my loops and some results. In each case I need to perform 64 operations. The cycle counts are not exact; they include a function call to the function containing the loop, and a few instructions for reading the timer.

            PUSH    { r4-r7 }       // stack the registers we are going to be using.
            LDRD    r4,r5,[r0,#0]   // load r5 and r6 with accumulator (located at r0 and r0+4)
    mac_loop \ 
            LDR     r6,[r1],#4      // load multiplicand a pointed to by r1, then point r1 to next multiplicand
            LDR     r7,[r2],#4      // load multiplicand b pointed to by r2, then point r2 to next multiplicand
            SMLAL   r4,r5,r6,r7     // perform the MAC
            SUBS    r3, r3, #1      // subtract one from loop counter
            BNE     mac_loop        // back to start of loop if counter not zero
            STM     r0,     {R4,R5} // store the accumulator value back to its address
            POP     { r4-r7 }       // push back the registers we have used
            BX      lr              // return
    


    This took 543 cycles. Moving the SUBS to just before the SMLAL did not reduce this. This seemed like the only place it might be worth moving it to, as splitting up the LDRs means that they don't pipeline as well, causing them to take 2 cycles each instead of 3 cycles for both.

            for (int32_t i = 0; i<Loop; i++) {
                    Acc+=a[i]*b[i];
            }
            return Acc;
    


    This took 520 cycles. This was very very confusing, when looking at the code that the compiler output:

    loop:   MUL     r8,r8,r11
            LDR     r9,[r5,#0x08]!
            LDR     r10,[r6,#0x08]!
            ADDS    r0,r8,r0
            ADC     r1,r1,r8,ASR #31
            MUL     r9,r9,r10
            ADDS    r12,r9,r12
            LDR     r8,[r5,#0x04]
            LDR     r11,[r6,#0x04]
            ADC     r4,r4,r9,ASR #31
            SUBS    r7,r7,#1
            BNE     loop
    


    According to the listing, this is the assembly code output. By my calculation this should take 12 cycles (assuming no pipeline stalling) plus the SUBS and BNE, compared to 4 for my loop. If anybody can offer some sort of explanation for this, that would be superb.

    Next I tried unrolling my loop. The result took 299 cycles, or ~4.7 per MAC. Since the LDR,LDR,SMLAL take 4 instructions in the best case, I detirmined that the extra 43 cycles are the result of the stack pushing/popping, function call branches, etc.

    Getting better, but it's still a real shame that inline assembly is not allowed; with that, I should be able to get something around 260 cycles for the 'loop'.

    In summary:

    Initial C loop (no SMLAL): 520 cycles
    assembly loop (SMLAL): 543
    C loop (Scott's SMLAL): 394
    unrolled assembly loop: 299

    Remaining questions:
    Why does my assembly loop take longer than the compiler generated one, which has at least twice the instructions?
    Why will the compiler only generate SMLAL in the case of 64*64 bit multiplication?

    Thanks again for all the responses/interest so far.

    Cheers

    - Jeff

  • Sorry, my reply has ended up in the wrong place. Please see my latest post above (sorry also for the extra post-notification email this post will generate).

  • On closer inspection, this is actually performing a long multiplcation of two 64-bit numbers, ...

    I don't think so. This loop:

    ...
    loop1   LDR           r10,[r12,#0x08]!
            LDR           r11,[r4,#0x08]!
            SMLAL         r0,r1,r8,r9
            SMLAL         r5,r6,r10,r11
            LDR           r8,[r12,#0x04]
            LDR           r9,[r4,#0x04]
            SUBS          r7,r7,#1
            BNE           loop1
    ...
    


    is using SMLAL to do 32x32=>64 (and 64-bit acculumate). It's been unrolled to do it twice per loop (four LDRs, two SMLALs), but there is no 64x64 being done.

    This all makes sense because the C code Scott suggested actually asks for a 64+(64*64) operation, ...

    Yes, technically, at the C level the '(long long)' cast causes the multiplication to be of long longs. But armcc realizes that SML[A]L (which does 32x32=>64) will give the correct answer for this case and so it uses it.

    unrolled assembly loop: 299

    The loop using MUL is doing a different calculation since MUL is 32x32=>32 and SMLAL is (in part) 32x32=>64. Depending on the input data the results could be the different or the same (in the case where the 32x32=>32 never happens to overflow).

    Why does my assembly loop take longer than the compiler generated one, which has at least twice the instructions?

    Your mac_loop is not unrolled and spends more time branching (because it goes 'rounf the loop twice as many times).

    Why will the compiler only generate SMLAL in the case of 64*64 bit multiplication?

    Because in C (given 32 bits as the width of 'long') multiplication of two 'longs' must be 32x32=>32 and using SMLAL to to the mulitplication would be wrong.

    [Aside: one could argue that when long*long overflows the user has invoked the dreaded "Undefined Behavior" so armcc could in fact use SMLAL for this case since if the multiplication overflows all bets are off. But armcc doesn't do it (maybe it's "too unexpected"). If, instead, the multiplication were unsigned long*unsigned long then there's no undefined behavior and UML[A]L is definitely not a good idea.]

  • Because in C (given 32 bits as the width of 'long') multiplication of two 'longs' must be 32x32=>32 and using SMLAL to to the mulitplication would be wrong.

    Not necessarily. As long as the compiler extracts the right 32 bits of signed integer result out of it, it can use whatever machine instruction it wants to.

  • As long as the compiler extracts the right 32 bits of signed integer result out of it, it can use whatever machine instruction it wants to.

    True, but since the SMLAL does (32x32=>64)+64 it's not straight forward to use it in a situation where (32x32=>32=>64)+64 is required. And since SMULL is slower (or at least no faster) and requires more registers than MUL there is probably no reason to use it either.

  • Hi Scott

    Thanks for the detailed reply.

    I have looked again at the assembly output from your MAC operation which includes a cast, and it looks like I was wrong in my assertion that it was performing a long multiplication. I guess I was getting a bit too involved for day #1 of ARM assembler! Thanks for this correction (also, it looks like break points in the disassembly listing don't work reliably; this threw me off a bit).

    long a,b;
    long long Acc;
    
    Acc += a*b;     // long long += long*long; implicit casting of (a*b) to long long
    
    Acc += (long long)(a*b);        // Same result, cast now explicit but assembly output is identical
    
    // Now for your suggestion:
    Acc += (long long)a*b           // explicit cast of a to long long, causes implicit cast of b up to long long also
    
    Acc += (long long)a*(long long)b;       // same assembler output as previous line, as expected
    


    So the code says (64*64=>64)+64, but the compiler presumably realises that the operands are in fact guaranteed to fit in 32 bits, so reduces this by optimisation to (32x32=>64)+64. Great!

    Your mac_loop is not unrolled and spends more time branching (because it goes 'rounf the loop twice as many times).

    True, but what I was confused about was that the loop was taking longer than the apparently more complicated loop generated from the simple

            for (int32_t i = 0; i<Loop; i++) {
                    Acc+=a[i]*b[i];
            }
    


    C code. It turns out that I had identified the wrong piece of assembly code. The actual loop is:

    loop    LDR     r5,[r2,r12,LSL #2]
            LDR     r6,[r3,r12,LSL #2]
            MULS    r5,r6,r5
            ADDS    r0,r5,r0
            ADC     r1,r1,r5,ASR #31
            ADD     r12,r12,#0x01
            CMP     r4,r12
            BGT     loop
    


    which is (32*32=>32)+64, as you rightly suggested. Compare this to my loop:

    loop    LDR     r6,[r1],#4      // load multiplicand a pointed to by r1, then point r1 to next multiplicand
            LDR     r7,[r2],#4      // load multiplicand b pointed to by r2, then point r2 to next multiplicand
            SMLAL   r4,r5,r6,r7     // perform the MAC
            SUBS    r3, r3, #1      // subtract one from loop counter
            BNE     loop            // back to start of loop if counter not zero
    


    and I still do not understand why my loop takes longer to execute, despite having two fewer 1-cycle instructions in the loop.

    Because in C (given 32 bits as the width of 'long') multiplication of two 'longs' must be 32x32=>32 and using SMLAL to to the mulitplication would be wrong.

    I'm not sure that this is the case; as Hans points out, the compiler can use any instructions it likes, as long as the result is correct. The overflow behavior of a long*long is not actually defined in the C standard. So it seems like the real issue is that there is no C syntax that gives (32*32=>64)+64. Looks to me like a prime example of where an intrinsic would be a good (if not very portable) solution, I wonder why one is not provided.

    So in the end, for 64 loops:
    C loop, upcasting operands to 64-bit: 394
    unrolled assembly loop: 299

    I guess this is pretty good. I am just starting to learn this type of programming, so now I have a much better understanding of how close in performance the compiler output can get to hand optimised/unrolled assembler. In this case about 30% slower.

    Thanks to all.

    Cheers

    - Jeff