Hello,
I am working on an IoT project, mixing C and C++, and I am having stack issues with lambdas.
The following code was compiled by gcc-arm-none-eabi-8-2018-q4-major-win32, with -Os and runs on a NUCLEO-L476RG. I monitored stack usage with Ozone.
gcc-arm-none-eabi-8-2018-q4-major-win32
typedef struct structTest { uint32_t var1; uint32_t var2; } structTest; // Test 1 int main() { dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var2 = 0; $.var2 = 24; $.var1 = 48; return $; }() ); } // Test 2 int main() { dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var1 = 0; $.var1 = 48; return $; }() ); dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var1 = 0; $.var2 = 13; $.var1 = 42; return $; }() ); }
We have some complex macros that enables use to make sure structures are used initialized, and those macros generated some code similar to the above one. "structTest $; $.var1 = 0; $.var2 = 0;" is always generated, and after the macros add the users values to the corresponding fields.
The expected behavior in case 1 and 2 was that only 8 bytes of stack were used for data. This is the case in Test 1, but it is 16 bytes for test 2.
Is there any way to keep this kind of structure but to force the compiler to reuse the stack ? -fconserve-stack and -fstack-reuse=all both had no effect.
I also can't find documentation on the optimization behavior expected for lambda functions, if anyone has a link I'll be gratefull
Hi B_Cartier,
Could you give me a full testcase for this? so include a declaration for dostuff?
thanks.
Hi Christina
Here is a link with a better test case : https://answers.launchpad.net/gcc-arm-embedded/+question/682825.
The declaration of doStuff should not matter, in most of our project it is a function pointer that is known only at link time, so no inlining possible.
The lambdas are correctly re-using the stack slots
add x0, sp, 16 stp q1, q0, [sp, 48] bl _Z7doStuff10TestStruct
is the first call and
add x0, sp, 16 ldr q0, [x3, #:lo12:.LC8] ldp x2, x3, [x2] stp x2, x3, [sp, 16] ldp x2, x3, [x1] stp x2, x3, [sp, 32] stp q1, q0, [sp, 80] bl _Z7doStuff10TestStruct
is the second one with sp not being modified in between. So both structs are using the same address.
Where the extra allocation comes from is a generic bug in GCC with cleaning up dead stack space. The stp to [sp, 80] in the snippet above
are dead and GCC doesn't detected it.
Where this comes from (if you see -O1) is because before optimizations the values for your struct are created on the stack
mov w0, 11 str w0, [sp, 80] mov w0, 22 str w0, [sp, 84] mov w0, 33 str w0, [sp, 88] mov w0, 44 str w0, [sp, 92] mov w0, 55 str w0, [sp, 96] mov w0, 66 str w0, [sp, 100] mov w0, 77 str w0, [sp, 104] mov w0, 88 str w0, [sp, 108] ldp x0, x1, [sp, 80]
However at -O2 we realize we can create the constants entirely in registers
mov x0, 11 mov x3, 33 movk x0, 0x16, lsl 32 movk x3, 0x2c, lsl 32 mov x2, 55 mov x1, 77 stp x0, x3, [sp, 80] ldp x2, x3, [sp, 80] stp x2, x3, [sp, 16] ldp x2, x3, [sp, 96] stp x2, x3, [sp, 32]
The compiler does something stupid here (because of the copy it has to make) in that it decides to spill the values we created to the stack at their original address and then moves them to the right place later. It doesn't realize it can just store directly at `sp+16` and `sp+32` without the spill.
At -O3 we spill the constants to a literal pool
adrp x3, .LC7 adrp x2, .LC3 add x2, x2, :lo12:.LC3 ldr q1, [x3, #:lo12:.LC7] adrp x1, .LC4 adrp x3, .LC8 add x1, x1, :lo12:.LC4 add x0, sp, 16 ldr q0, [x3, #:lo12:.LC8] ldp x2, x3, [x2] stp x2, x3, [sp, 16] ldp x2, x3, [x1] stp x2, x3, [sp, 32] stp q1, q0, [sp, 80]
which is fine, and loads them directly into `sp+16` and `sp+32` BUT the while the load from `sp+80` is marked dead and removed the store isn't.
Which is why the extra stack allocation.
In short it's a bug in generic parts of GCC that track usages of memory locations.
If you're wondering where the additional copy comes from (the reason for the store to 80 to begin with) its because structs on the stack are passed by copy.
dostuff( [&]() -> structTest{ structTest $; $.var1 = 0; $.var2 = 0; $.var2 = 24; $.var1 = 48; return $; }() );
is actually
x = [&]() -> structTest{ structTest $; $.var1 = 0; $.var2 = 0; $.var2 = 24; $.var1 = 48; return $; }() dostuff( x );
In order to pass it to `dostuff` a copy is made.
Hi Tamar Christina,
Thanks a lot for the in depth explanation.
If I understood correctly, the stack slots are reused and should be reused not matter the number of calls of dostuff( [&]() -> structTest{...}}, but because of the copy of an unnamed variable, GCC does not realize stack is being reused.
Strangely enough, I cannot reproduce the reuse of the stack slots.
Here is the assembly I get :
_Z14wrapper2LAMBDAv $Thumb { 08001404 PUSH {R4-R6, LR} 08001406 LDR R4, =_etext 08001408 MOV R6, R4 0800140A LDM R6!, {R0-R3} { 0800140C SUB SP, SP, #0x50 0800140E ADD R5, SP, #0x10 08001410 STM R5!, {R0-R3} 08001412 LDM.W R6, {R0-R3} 08001416 STM.W R5, {R0-R3} 0800141A ADD R3, SP, #0x20 0800141C LDM R3, {R0-R3} 0800141E STM.W SP, {R0-R3} 08001422 ADD R5, SP, #0x10 08001424 LDM.W R5, {R0-R3} 08001428 ADDS R4, #0x20 0800142A BL _Z7doStuff10TestStruct 0800142E LDM R4!, {R0-R3} 08001430 ADD R5, SP, #0x30 08001432 STM R5!, {R0-R3} 08001434 LDM.W R4, {R0-R3} 08001438 STM.W R5, {R0-R3} 0800143C ADD R3, SP, #0x50 0800143E LDMDB R3, {R0-R3} 08001442 STM.W SP, {R0-R3} 08001446 ADD R4, SP, #0x30 08001448 LDM.W R4, {R0-R3} 0800144C BL _Z7doStuff10TestStruct } 08001450 ADD SP, SP, #0x50 08001452 POP {R4-R6, PC}
Do you know if a fix is in the making, and if I should post a bug report directly to GCC (or are they already aware of this problem)?
hmm you're right, on Arm it doesn't re-use the stack slots. I'm not sure why that is. But an upstream ticket to GCC would be the best course of action here.
There are two bugs here, the not re-using of the stack slot and the not removing of the dead store. The latter is a known issue, but the former I am not sure.
The not removing the dead store is a fairly old issue that affects all architectures.
Cheers,
Tamar
I will post a ticket to GCC then.
I guess that not reusing the stack slots means not removing the dead store is not a bug in this particular case, since it is not really dead anymore.
If it is an old issue we can only hope a fix is in the making, that would greatly help our project.
Thanks a lot for your time, I'll keep you posted if I get any answer from GCC if you want.
B_Cartier