Function Parameters on 32-bit Arm

May 8, 2014

6 minute read time.

Function call basics

Typically when teaching a class about embedded C programming, one of the early questions we ask is "Where does the memory come from for function arguments?"

Take, for example, the following simple C function:

void test_function(int a, int b, int c, int d);

when we invoke the function, where are the function arguments stored?

int main(void)

{

  ...

  test_function(1,2,3,4);

  ...

}

Unsurprisingly, the most common answer after "I don't know" is "the stack"; and of course if you were compiling for x86 this would be true. This can be seen from the following x86 assembler for main setting up the call to test_function:

  ...

  subl $16, %esp

  movl $4, 12(%esp)

  movl $3, 8(%esp)

  movl $2, 4(%esp)

  movl $1, (%esp)

  call _test_function

  ...

The stack is decremented by 16-bytes, then the four int's are moved onto the stack prior to the call to test_function.

In addition to the function arguments being pushed, the call will also push the return address (i.e. the program counter of the next instruction after the call) and, what in x86 terms, is often referred to as the saved frame pointer on to the stack. The frame pointer is used to reference local variables further stored on the stack.

This stack frame format is quite widely understood and historically been the target of malicious buffer overflows attacks by modifying the return address.

But, of course, we're not here to discuss x86, it's the Arm architecture we're interested in.

The AAPCS

The Arm is a RISC architecture, whereas the x86 is CISC. Since 2003 Arm have published a document detailing how separately compiled and linked code units work together. Over the years it has gone through a couple of name changes, but is now officially referred to as the "Procedure Call Standard for the Arm Architecture" or the AAPCS.

If we recompile main.c for Arm:

> armcc -S main.c

we get the following:

...

MOV r3,#4

MOV r2,#3

MOV r1,#2

MOV r0,#1

BL test_function

...

Here we can see that the four arguments have been placed in register r0-r3. This is followed by the "Relative branch with link" instruction. So how much stack has been used for this call? The short answer is none, as BL instruction moves the return address into the Link Register (lr/r14) rather than pushing it on to the stack, as per the x86 model.

Note: Around a function call there will be other stack operations but that's not the focus of this post

The Register Set

I'd imagine most readers are familiar with the Arm register set, but just to review;

There are 16 data/core registers r0-r15
Of these 16, three are special purpose registers
- Register r13 acts as the stack pointer (sp)
- Register r14 acts as the link register (lr)
- Register r15 acts as the program counter (pc)

Basic Model

So the base function call model is that if there are four or fewer 32-bit parameters, r0 through r3 are used to pass the arguments and the call return address is stored in the link register.

If we add a fifth parameter, as in:

void test_function2(int a, int b, int c, int d, int e);

int main(void)

{

  ...

  test_function2(1,2,3,4,5);

  ...

}

We get the following:

...

MOV r0,#5

MOV r3,#4

MOV r2,#3

STR r0,[sp,#0]

MOV r1,#2

MOV r0,#1

BL test_function2

...

Here, the fifth argument (5) is being stored on the stack prior to the call.

Return values

Given the following code:

int test_function(int a, int b, int c, int d);

int val;

int main(void)

{

  //...

  val = test_function(1,2,3,4);

  //...

}

By analyzing the assembler we can see the return value is place in r0

...

MOV r3,#4

MOV r2,#3

MOV r1,#2

MOV r0,#1

BL test_function

LDR r1,|L0.40| ; load address of extern val into r1

STR r0,[r1,#0] ; store function return value in val

...

C99 long long Arguments

The AAPCS defines the size and alignment of the C base types. The C99 long long is 8 bytes in size and alignment. So how does this change our model?

Given:

long long test_ll(long long a, long long b);

long long ll_val;

extern long long ll_p1;

extern long long ll_p2;

int main(void)

{

  ...

  ll_val = test_ll(ll_p1, ll_p2);

  ...

}

We get:

        ...

        LDR      r0,|L0.40|

        LDR      r1,|L0.44|

        LDRD     r2,r3,[r0,#0]

        LDRD     r0,r1,[r1,#0]

        BL       test_ll

        LDR      r2,|L0.48|

        STRD     r0,r1,[r2,#0]

        ...

|L0.40|

        DCD      ll_p2

|L0.44|

        DCD      ll_p1

This code demonstrates that an 64-bit long long uses two registers (r0-r1 for the first parameter and r2-r3 for the second). In addition, the 64-bit return value has come back in r0-r1.

Doubles

As with the long long, a double type (based on the IEEE 754 standard) is also 8-bytes in size and alignment on Arm. However the code generated will depend on the actual core. For example, given the code:

double test_dbl(double a, double b);

double dval;

extern double dbl_p1;

extern double dbl_p2;

int main(void)

{

  ...

  dval = test_dbl(dbl_p1, dbl_p2);

  ...

}

When compiled for a Cortex-M3 (armcc --cpu=Cortex-M3 --c99 -S main.c) the output is almost identical to the long long example:

 ...

        LDR      r0,|L0.28|

        LDR      r1,|L0.32|

        LDRD     r2,r3,[r0,#0]

        LDRD     r0,r1,[r1,#0]

        BL       test_dbl

        LDR      r2,|L0.36|

        STRD     r0,r1,[r2,#0]

        ...

|L0.28|

        DCD      dbl_p2

|L0.32|

        DCD      dbl_p1

However, if we recompile this for a Cortex-A9 (armcc --cpu=Cortex-A9 --c99 -S main.c), note we get quite different generated instructions:

  ...

        LDR      r0,|L0.40|

        VLDR     d1,[r0,#0]

        LDR      r0,|L0.44|

        VLDR     d0,[r0,#0]

        BL       test_dbl

        LDR      r0,|L0.48|

        VSTR     d0,[r0,#0]

  ...

|L0.40|

        DCD      dbl_p2

|L0.44|

        DCD      dbl_p1

The VLDR and VSTR instructions are generated as the Cortex-A9 has Vector Floating Point (VFP) technology.

Mixing 32-bit and 64-bit parameters

Assuming we change our function to accept a mixture of 32-bit and 64-bit parameters, e.g.

void test_iil(int a, int b, long long c);

extern long long ll_p1;

int main(void)

{

  ...

  test_iil(1, 2, ll_p1);

  ...

}

As expected we get; a in r0, b in r1 and ll_p1 in r2-r3.

        ...

        LDR      r0,|L0.32|

        MOV      r1,#2

        LDRD     r2,r3,[r0,#0]

        MOV      r0,#1

        BL       test_iil

        ...

|L0.32|

        DCD      ll_p1

However, if we subtly change the order to:

void test_iil(int a, long long c, int b);

extern long long ll_p1;

int main(void)

{

  ...

  test_ili(1,ll_p1,2);

  ...

}

We get a different result; a is in r0, c is in r2-r3, but now b is stored on the stack.

   ...

        MOV      r0,#2

        STR      r0,[sp,#0] ; store parameter b on the stack

        LDR      r0,|L0.36|

        LDRD     r2,r3,[r0,#0]

        MOV      r0,#1

        BL       test_ili

        ...

|L0.36|

        DCD      ll_p1

So why doesn't parameter 'c' use r1-r2? because the AAPCS states:

"A double-word sized type is passed in two consecutive registers (e.g., r0 and r1, or r2 and r3). The content of the registers is as if the value had been loaded from memory representation with a single LDM instruction."

As the complier is not allowed to rearrange parameter ordering, then unfortunately the parameter 'b' has to come in order after 'c' and therefore cannot use the unused register r1.

C++

For any C++ programmers out there, it is important to realize that for class member functions the implicit 'this' argument is passed as the 32-bit value in r0. So, hopefully you can see the implications if targeting Arm of:

class Ex

{

public:

  void mf(long long d, int i);

};

vs.

class Ex

{

public:

  void mf(int i, long long d);

};

Summary

Even though keeping arguments in registers may be seen as "marginal gains", for large code bases, I have seen first-hand significant performance and power improvements simply by rearranging the parameter ordering.

Is is also useful to know that both the Arm Accredited Engineer (AAE) Accreditation and the Arm Accredited MCU Engineer (AAME) Accreditation exams require AAPCS knowledge.

And finally...

I'll leave you with one more bit of code to puzzle over, given:

typedef struct

{

  int a;

  int b;

  int c;

  int d;

} Example;

void test_struct(Example p);

Example ex = {1,2,3,4};

int main(void)

{

  ...

  test_struct(ex);

  ...

}

Can you guess how 'ex' is passed?

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Function Parameters on 32-bit Arm

Function call basics

The AAPCS

The Register Set

Basic Model

Return values

C99 long long Arguments

Doubles

Mixing 32-bit and 64-bit parameters

C++

Summary

And finally...

Introducing GICv5: Scalable and secure interrupt management for Arm

Getting started with AARCHMRS Features.json using Python

Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC