This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

aarch64 - return by value - GNU gcc inline assembler

I'm trying to write a very simple function in two or three aarch64 instructions as 'inline assembler' inside a C++ source file.

With the aarch64 calling convention on Linux, if a function returns a very large struct by value, then the address of where to store the return value is passed in the X8 register. This is out of the ordinary as far as calling conventions go. Every other calling convention, for example System V x86_64, Microsoft x64, cdecl, stdcall, arm32, pass the address of the return value in the first parameter. So for example with x86_64 on Linux, the RDI register contains the address of where to store the very large struct.

I want to try emulate this behaviour on aarch64 on Linux. When my assembler function is entered, I want it to do two things:
(1) Put the address of the indirect return object into the first parameter register, i.e. move X8 to X0
(2) Jump to a location specified by a global function pointer

So here's how I think my assembler function should look:

    __asm("Invoke:       \n"
" mov x0, x8 \n" // move return value address into 1st parameter
" mov x9, f \n" // Load address of code into register
" br x9 \n" // Jump to code
);

I don't know what's wrong here but it doesn't work. In the following complete C++ program, I use the class 'std::mutex' as it's a good example of a class that can't be copied or moved (I am relying on mandatory Return Value Optimisation).

Here is my entire program in one C++ file, could someone please help me write the assembler function properly? Am I supposed to be using the ADRP and LDR instructions instead of MOV?

#include <mutex>                  // mutex
#include <iostream>               // cout, endl
using std::cout, std::endl;

void (*f)(void) = nullptr;

extern "C" void Invoke(void);

__asm("Invoke:                    \n"
      "    mov  x0, x8            \n"  // move return value address into 1st parameter
      "    mov  x9, f             \n"  // Load address of code into register
      "    br   x9                \n"  // Jump to code
);

void Func(std::mutex *const p)
{
    cout << "Address of return value: " << p << endl;
}

int main(void)
{
    f = (void(*)(void))Func;

    auto const p = reinterpret_cast<std::mutex (*)(void)>(Invoke);

    auto retval = p();

    cout << "Address of return value: " << &retval << endl;
}

Parents
  • On Monday 17 July 2023, Frederick Gotham wrote:
    >
    > If I change it to thread_local then try to re-compile, I get a linker error:

    >
    > R_AARCH64_ADR_PREL_LO21 used with TLS symbol f
    >
    > Do you know what syntax I use to access the thread_local variable from assembler? Will I need to write a separate function as follows?


    In order to try understand how thread_local variables are accessed from aarch64 assembler, I wrote the following dynamic shared library in C:

          __thread void (*f)(void);

          void (*g)(void);

          void Func(void)
          {
              g = f;
          }

    I compiled this to 'libtest.so" and then used 'objdump' on it to see:

                 <Func>:
    Line 01: stp x29, x30, [sp, #-16]!
    Line 02: mrs x1, tpidr_el0
    Line 03: mov x29, sp
    Line 04: adrp x0, 20000 <__cxa_finalize>
    Line 05: ldr x2, [x0, #16]
    Line 06: add x0, x0, #0x10
    Line 07: blr x2
    Line 08: adrp x2, 1f000 <__FRAME_END__+0x1e8c8>
    Line 09: ldr x2, [x2, #4032]
    Line 10: ldr x0, [x1, x0]
    Line 11: str x0, [x2]
    Line 12: ldp x29, x30, [sp], #16
    Line 13: ret

    Line #2 appears to put the address of "thread local storage" inside the x1 register.

    Lines #4-7 at first glance seem to call the function "__cxz_finalize" (which is the one that gets called at the end of a program to invoke all the destructors of global objects)... but really I just think that the number 0x20000 is being used as a base address to apply offsets to.

    Lines #7 definitely is calling some function, although I don't know which one.

    Lines #8-12, I'm not sure here... but I think they're moving the value of the thread_local variable 'f' into the global variable 'g'.

    Can anyone please help me understand this? And explore how I would go about writing aarch64 to access a thread_local variable called 'f'?

Reply
  • On Monday 17 July 2023, Frederick Gotham wrote:
    >
    > If I change it to thread_local then try to re-compile, I get a linker error:

    >
    > R_AARCH64_ADR_PREL_LO21 used with TLS symbol f
    >
    > Do you know what syntax I use to access the thread_local variable from assembler? Will I need to write a separate function as follows?


    In order to try understand how thread_local variables are accessed from aarch64 assembler, I wrote the following dynamic shared library in C:

          __thread void (*f)(void);

          void (*g)(void);

          void Func(void)
          {
              g = f;
          }

    I compiled this to 'libtest.so" and then used 'objdump' on it to see:

                 <Func>:
    Line 01: stp x29, x30, [sp, #-16]!
    Line 02: mrs x1, tpidr_el0
    Line 03: mov x29, sp
    Line 04: adrp x0, 20000 <__cxa_finalize>
    Line 05: ldr x2, [x0, #16]
    Line 06: add x0, x0, #0x10
    Line 07: blr x2
    Line 08: adrp x2, 1f000 <__FRAME_END__+0x1e8c8>
    Line 09: ldr x2, [x2, #4032]
    Line 10: ldr x0, [x1, x0]
    Line 11: str x0, [x2]
    Line 12: ldp x29, x30, [sp], #16
    Line 13: ret

    Line #2 appears to put the address of "thread local storage" inside the x1 register.

    Lines #4-7 at first glance seem to call the function "__cxz_finalize" (which is the one that gets called at the end of a program to invoke all the destructors of global objects)... but really I just think that the number 0x20000 is being used as a base address to apply offsets to.

    Lines #7 definitely is calling some function, although I don't know which one.

    Lines #8-12, I'm not sure here... but I think they're moving the value of the thread_local variable 'f' into the global variable 'g'.

    Can anyone please help me understand this? And explore how I would go about writing aarch64 to access a thread_local variable called 'f'?

Children
No data