This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm GCC lambda optimization

Hello,

I am working on an IoT project, mixing C and C++, and I am having stack issues with lambdas.

The following code was compiled by gcc-arm-none-eabi-8-2018-q4-major-win32, with -Os and runs on a NUCLEO-L476RG. I monitored stack usage with Ozone.

typedef struct structTest
{
    uint32_t var1;
    uint32_t var2;
} structTest;

// Test 1
int main()
{
    dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var2 = 0; $.var2 = 24; $.var1 = 48; return $; }() );
}

// Test 2
int main()
{
    dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var1 = 0; $.var1 = 48; return $; }() );

    dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var1 = 0; $.var2 = 13; $.var1 = 42; return $; }() );
}

We have some complex macros that enables use to make sure structures are used initialized, and those macros generated some code similar to the above one. "structTest $; $.var1 = 0; $.var2 = 0;" is always generated, and after the macros add the users values to the corresponding fields.

The expected behavior in case 1 and 2 was that only 8 bytes of stack were used for data. This is the case in Test 1, but it is 16 bytes for test 2.

Is there any way to keep this kind of structure but to force the compiler to reuse the stack ? -fconserve-stack and -fstack-reuse=all both had no effect.

I also can't find documentation on the optimization behavior expected for lambda functions, if anyone has a link I'll be gratefull

  • Hi B_Cartier,

    Could you give me a full testcase for this? so include a declaration for dostuff?

    thanks.

  • Hi Christina

    Here is a link with a better test case : https://answers.launchpad.net/gcc-arm-embedded/+question/682825.

    The declaration of doStuff should not matter, in most of our project it is a function pointer that is known only at link time, so no inlining possible.

  • Hi B_Cartier,

    The lambdas are correctly re-using the stack slots

            add     x0, sp, 16                                                                                                                                                                                                                    
            stp     q1, q0, [sp, 48]                                                                                                                                                                                                              
            bl      _Z7doStuff10TestStruct

    is the first call and

            add     x0, sp, 16                                                                                                                                                                                                                    
            ldr     q0, [x3, #:lo12:.LC8]                                                                                                                                                                                                         
            ldp     x2, x3, [x2]                                                                                                                                                                                                                  
            stp     x2, x3, [sp, 16]                                                                                                                                                                                                              
            ldp     x2, x3, [x1]                                                                                                                                                                                                                  
            stp     x2, x3, [sp, 32]                                                                                                                                                                                                              
            stp     q1, q0, [sp, 80]                                                                                                                                                                                                              
            bl      _Z7doStuff10TestStruct 

    is the second one with sp not being modified in between. So both structs are using the same address.

    Where the extra allocation comes from is a generic bug in GCC with cleaning up dead stack space. The stp to [sp, 80] in the snippet above

    are dead and GCC doesn't detected it.

    Where this comes from (if you see -O1) is because before optimizations the values for your struct are created on the stack

            mov     w0, 11
            str     w0, [sp, 80]
            mov     w0, 22
            str     w0, [sp, 84]
            mov     w0, 33
            str     w0, [sp, 88]
            mov     w0, 44
            str     w0, [sp, 92]
            mov     w0, 55
            str     w0, [sp, 96]
            mov     w0, 66
            str     w0, [sp, 100]
            mov     w0, 77
            str     w0, [sp, 104]
            mov     w0, 88
            str     w0, [sp, 108]
            ldp     x0, x1, [sp, 80]
    

    However at -O2 we realize we can create the constants entirely in registers

            mov     x0, 11
            mov     x3, 33
            movk    x0, 0x16, lsl 32
            movk    x3, 0x2c, lsl 32
            mov     x2, 55
            mov     x1, 77
            stp     x0, x3, [sp, 80]
            ldp     x2, x3, [sp, 80]
            stp     x2, x3, [sp, 16]
            ldp     x2, x3, [sp, 96]
            stp     x2, x3, [sp, 32]
    

    The compiler does something stupid here (because of the copy it has to make) in that it decides to spill the values we created to the stack at their original address and then moves them to the right place later. It doesn't realize it can just store directly at `sp+16` and `sp+32` without the spill.

    At -O3 we spill the constants to a literal pool

            adrp    x3, .LC7
            adrp    x2, .LC3
            add     x2, x2, :lo12:.LC3
            ldr     q1, [x3, #:lo12:.LC7]
            adrp    x1, .LC4
            adrp    x3, .LC8
            add     x1, x1, :lo12:.LC4
            add     x0, sp, 16
            ldr     q0, [x3, #:lo12:.LC8]
            ldp     x2, x3, [x2]
            stp     x2, x3, [sp, 16]
            ldp     x2, x3, [x1]
            stp     x2, x3, [sp, 32]
            stp     q1, q0, [sp, 80]

    which is fine, and loads them directly into `sp+16` and `sp+32` BUT the while the load from `sp+80` is marked dead and removed the store isn't.

    Which is why the extra stack allocation.

    In short it's a bug in generic parts of GCC that track usages of memory locations.

    If you're wondering where the additional copy comes from (the reason for the store to 80 to begin with) its because structs on the stack are passed by copy.

    dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var2 = 0; $.var2 = 24; $.var1 = 48; return $; }() );

    is actually

    x = [&]() -> structTest{ structTest $; $.var1 = 0; $.var2 = 0; $.var2 = 24; $.var1 = 48; return $; }()
    dostuff( x );

    In order to pass it to `dostuff` a copy is made.

  • Hi Tamar Christina,

    Thanks a lot for the in depth explanation.

    If I understood correctly,  the stack slots are reused and should be reused not matter the number of calls of dostuff( [&]() -> structTest{...}}, but because of the copy of an unnamed variable, GCC does not realize stack is being reused.

    Strangely enough, I cannot reproduce the reuse of the stack slots.

    Here is the assembly I get :

    _Z14wrapper2LAMBDAv
    $Thumb
    {
     08001404   PUSH         {R4-R6, LR}
     08001406   LDR          R4, =_etext            
     08001408   MOV          R6, R4
     0800140A   LDM          R6!, {R0-R3}
    {
     0800140C   SUB          SP, SP, #0x50
     0800140E   ADD          R5, SP, #0x10
     08001410   STM          R5!, {R0-R3}
     08001412   LDM.W        R6, {R0-R3}
     08001416   STM.W        R5, {R0-R3}
     0800141A   ADD          R3, SP, #0x20
     0800141C   LDM          R3, {R0-R3}
     0800141E   STM.W        SP, {R0-R3}
     08001422   ADD          R5, SP, #0x10
     08001424   LDM.W        R5, {R0-R3}
     08001428   ADDS         R4, #0x20
     0800142A   BL           _Z7doStuff10TestStruct
     0800142E   LDM          R4!, {R0-R3}
     08001430   ADD          R5, SP, #0x30
     08001432   STM          R5!, {R0-R3}
     08001434   LDM.W        R4, {R0-R3}
     08001438   STM.W        R5, {R0-R3}
     0800143C   ADD          R3, SP, #0x50
     0800143E   LDMDB        R3, {R0-R3}
     08001442   STM.W        SP, {R0-R3}
     08001446   ADD          R4, SP, #0x30
     08001448   LDM.W        R4, {R0-R3}
     0800144C   BL           _Z7doStuff10TestStruct 
    }
     08001450   ADD          SP, SP, #0x50
     08001452   POP          {R4-R6, PC}

    Do you know if a fix is in the making, and if I should post a bug report directly to GCC (or are they already aware of this problem)?

  • Hi B_Cartier,

    hmm you're right, on Arm it doesn't re-use the stack slots. I'm not sure why that is. But an upstream ticket to GCC would be the best course of action here.

    There are two bugs here, the not re-using of the stack slot and the not removing of the dead store. The latter is a known issue, but the former I am not sure.

    The not removing the dead store is a fairly old issue that affects all architectures.

    Cheers,

    Tamar

  • I will post a ticket to GCC then.

    I guess that not reusing the stack slots means not removing the dead store is not a bug in this particular case, since it is not really dead anymore.

    If it is an old issue we can only hope a fix is in the making, that would greatly help our project.

    Thanks a lot for your time, I'll keep you posted if I get any answer from GCC if you want.

    Cheers,

    B_Cartier