Typically when teaching a class about embedded C programming, one of the early questions we ask is "Where does the memory come from for function arguments?"
Take, for example, the following simple C function:
void test_function(int a, int b, int c, int d);
when we invoke the function, where are the function arguments stored?
int main(void) { ... test_function(1,2,3,4); ... }
Unsurprisingly, the most common answer after "I don't know" is "the stack"; and of course if you were compiling for x86 this would be true. This can be seen from the following x86 assembler for main setting up the call to test_function:
... subl $16, %esp movl $4, 12(%esp) movl $3, 8(%esp) movl $2, 4(%esp) movl $1, (%esp) call _test_function ...
The stack is decremented by 16-bytes, then the four int's are moved onto the stack prior to the call to test_function.
est_function
In addition to the function arguments being pushed, the call will also push the return address (i.e. the program counter of the next instruction after the call) and, what in x86 terms, is often referred to as the saved frame pointer on to the stack. The frame pointer is used to reference local variables further stored on the stack.
This stack frame format is quite widely understood and historically been the target of malicious buffer overflows attacks by modifying the return address.
But, of course, we're not here to discuss x86, it's the Arm architecture we're interested in.
The Arm is a RISC architecture, whereas the x86 is CISC. Since 2003 Arm have published a document detailing how separately compiled and linked code units work together. Over the years it has gone through a couple of name changes, but is now officially referred to as the "Procedure Call Standard for the Arm Architecture" or the AAPCS.
If we recompile main.c for Arm:
> armcc -S main.c
we get the following:
...
MOV r3,#4
MOV r2,#3
MOV r1,#2
MOV r0,#1
BL test_function
Here we can see that the four arguments have been placed in register r0-r3. This is followed by the "Relative branch with link" instruction. So how much stack has been used for this call? The short answer is none, as BL instruction moves the return address into the Link Register (lr/r14) rather than pushing it on to the stack, as per the x86 model.
Note: Around a function call there will be other stack operations but that's not the focus of this post
I'd imagine most readers are familiar with the Arm register set, but just to review;
So the base function call model is that if there are four or fewer 32-bit parameters, r0 through r3 are used to pass the arguments and the call return address is stored in the link register.
If we add a fifth parameter, as in:
void test_function2(int a, int b, int c, int d, int e); int main(void) { ... test_function2(1,2,3,4,5); ... }
We get the following:
MOV r0,#5
STR r0,[sp,#0]
BL test_function2
Here, the fifth argument (5) is being stored on the stack prior to the call.
Given the following code:
int test_function(int a, int b, int c, int d); int val; int main(void) { //... val = test_function(1,2,3,4); //... }
By analyzing the assembler we can see the return value is place in r0
LDR r1,|L0.40| ; load address of extern val into r1
STR r0,[r1,#0] ; store function return value in val
The AAPCS defines the size and alignment of the C base types. The C99 long long is 8 bytes in size and alignment. So how does this change our model?
Given: long long test_ll(long long a, long long b); long long ll_val; extern long long ll_p1; extern long long ll_p2; int main(void) { ... ll_val = test_ll(ll_p1, ll_p2); ... } We get: ... LDR r0,|L0.40| LDR r1,|L0.44| LDRD r2,r3,[r0,#0] LDRD r0,r1,[r1,#0] BL test_ll LDR r2,|L0.48| STRD r0,r1,[r2,#0] ... |L0.40| DCD ll_p2 |L0.44| DCD ll_p1
This code demonstrates that an 64-bit long long uses two registers (r0-r1 for the first parameter and r2-r3 for the second). In addition, the 64-bit return value has come back in r0-r1.
As with the long long, a double type (based on the IEEE 754 standard) is also 8-bytes in size and alignment on Arm. However the code generated will depend on the actual core. For example, given the code:
double test_dbl(double a, double b); double dval; extern double dbl_p1; extern double dbl_p2; int main(void) { ... dval = test_dbl(dbl_p1, dbl_p2); ... }
When compiled for a Cortex-M3 (armcc --cpu=Cortex-M3 --c99 -S main.c) the output is almost identical to the long long example:
... LDR r0,|L0.28| LDR r1,|L0.32| LDRD r2,r3,[r0,#0] LDRD r0,r1,[r1,#0] BL test_dbl LDR r2,|L0.36| STRD r0,r1,[r2,#0] ... |L0.28| DCD dbl_p2 |L0.32| DCD dbl_p1
However, if we recompile this for a Cortex-A9 (armcc --cpu=Cortex-A9 --c99 -S main.c), note we get quite different generated instructions:
... LDR r0,|L0.40| VLDR d1,[r0,#0] LDR r0,|L0.44| VLDR d0,[r0,#0] BL test_dbl LDR r0,|L0.48| VSTR d0,[r0,#0] ... |L0.40| DCD dbl_p2 |L0.44| DCD dbl_p1
The VLDR and VSTR instructions are generated as the Cortex-A9 has Vector Floating Point (VFP) technology.
Assuming we change our function to accept a mixture of 32-bit and 64-bit parameters, e.g.
void test_iil(int a, int b, long long c); extern long long ll_p1; int main(void) { ... test_iil(1, 2, ll_p1); ... }
As expected we get; a in r0, b in r1 and ll_p1 in r2-r3.
... LDR r0,|L0.32| MOV r1,#2 LDRD r2,r3,[r0,#0] MOV r0,#1 BL test_iil ... |L0.32| DCD ll_p1
However, if we subtly change the order to:
void test_iil(int a, long long c, int b); extern long long ll_p1; int main(void) { ... test_ili(1,ll_p1,2); ... }
We get a different result; a is in r0, c is in r2-r3, but now b is stored on the stack.
... MOV r0,#2 STR r0,[sp,#0] ; store parameter b on the stack LDR r0,|L0.36| LDRD r2,r3,[r0,#0] MOV r0,#1 BL test_ili ... |L0.36| DCD ll_p1
So why doesn't parameter 'c' use r1-r2? because the AAPCS states:
As the complier is not allowed to rearrange parameter ordering, then unfortunately the parameter 'b' has to come in order after 'c' and therefore cannot use the unused register r1.
For any C++ programmers out there, it is important to realize that for class member functions the implicit 'this' argument is passed as the 32-bit value in r0. So, hopefully you can see the implications if targeting Arm of:
class Ex { public: void mf(long long d, int i); }; vs. class Ex { public: void mf(int i, long long d); };
Even though keeping arguments in registers may be seen as "marginal gains", for large code bases, I have seen first-hand significant performance and power improvements simply by rearranging the parameter ordering.
Is is also useful to know that both the Arm Accredited Engineer (AAE) Accreditation and the Arm Accredited MCU Engineer (AAME) Accreditation exams require AAPCS knowledge.
I'll leave you with one more bit of code to puzzle over, given:
typedef struct { int a; int b; int c; int d; } Example; void test_struct(Example p); Example ex = {1,2,3,4}; int main(void) { ... test_struct(ex); ... }
Can you guess how 'ex' is passed?