Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Using the Stack in AArch64: Implementing Push and Pop
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Assembly
  • AArch64
  • Tutorial
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Using the Stack in AArch64: Implementing Push and Pop

Jacob Bramley
Jacob Bramley
November 23, 2015
7 minute read time.

As described in my last article, AArch64 performs stack pointer alignment checks in hardware. In particular, whenever the stack pointer is used as the base register in an address operand, it must have 16-byte alignment.

The alignment checks can be very inconvenient in code generators where it is not feasible to determine how much stack space a function will require. Many JIT compilers fall into this category; they tend to rely on being able to push individual values to the stack.

The Problem

For conventional C and C++ compilers, the stack pointer alignment restrictions in AAPCS64 don't seem to cause much trouble 1. Many C functions start with a prologue that allocates the stack space required for the whole function. This space is then accessed as needed during the function. This is possible because the C compiler can determine in advance the stack space that will be required. Special handling will be required for variable-length arrays and alloca, but these are special cases that aren't often seen in real code.

JIT compilers (and other time-constrained code generators) cannot usually do this because it is expensive to analyse the code to extract this information. Also, many simple compilers are based around a stack machine, and assume that there is an efficient push implementation for an arbitrary number of registers. This is easy to manage in AArch32 because the basic data type is usually a single 4-byte register, and these can be pushed individually (between function calls) without violating any sp alignment rules. However, for AArch64, the required stack pointer alignment is two x or four w registers, and it is not possible to push individual registers.

// Broken AArch64 implementation of `push {x1}; push {x0};`.
  str   x1, [sp, #-8]!  // This works, but leaves `sp` with only 8-byte alignment ...
  str   x0, [sp, #-8]!  // ... so the second `str` will fail.

The most appropriate method of implementing push and pop operations will depend on the nature of the engine you are using. I considered a number of possible solutions for use in the AArch64 port of the Google V8 JavaScript engine. I will present each idea along with their advantages and disadvantages.

Calculate stack sizes in advance

If the required analysis is possible, it can result in fast generated code and efficient use of stack memory, so I've included this as a kind of benchmark, even though it might not be possible for many JIT compilers. The generated code will typically look something like this:

sub   sp, sp, #(8 * 14)       // Allocate space for the whole block.
...
str   x0, [sp, #(8 * 11)]     // Write to slot 11.
...
ldr   x0, [sp, #(8 * 11)]     // Read from slot 11.
...
add   sp, sp, #(8 * 14)       // Free the space at the end of the block.

Indexed addressing modes can sometimes be used to combine some of the operations. For example:

str   x0, [sp, #-(8 * 14)]!   // Allocate space and write to slot 0 in one step.

Depending on the design of your compiler (and your source language), it might be possible to calculate stack usage for individual basic blocks, even if function-level analysis isn't feasible. You'll have a separate allocation instruction (sub) for each block, but this is still cheaper than some other approaches that I'll describe in this article.

16-byte stacks slots diagram

Use 16-byte stack slots

Let's start with a simple, quick-and-dirty approach.

It sounds wasteful – and in most cases it is – but the simplest way to handle the stack pointer can be to push each value to a 16-byte slot. This doubles the stack usage (for pointer-sized values), and it effectively reduces the available memory bandwidth. It is also awkward to implement multiple-register operations using this scheme, since each register requires a separate instruction.

In general, I don't consider this approach to be appropriate. However, it does have one significant advantage, which is that it is very simple; there might be situations where this simplicity is worth the cost.

str   x0, [sp, #-16]!         // push {x0}
...
ldr   x0, [sp], #16           // pop {x0}

Use a register other than sp as the stack pointer

This mechanism is simple in principle: if the alignment restrictions of sp are inconvenient, just use another register as your stack pointer. General-purpose registers have no special alignment restrictions. Interfaces with PCS-compliant code (such as the C or C++ parts of the virtual machine) need to synchronise sp and the replacement stack pointer, but this is usually simple and quite cheap.

There is a notable complication: memory below the architectural sp (but in the stack area) cannot be safely accessed. Notably, this area is used by signal handlers, which execute asynchronously (like interrupts). If we just copy sp to some other register and start using it as a (descending) stack pointer, our special stack area will eventually be corrupted.

Separate stack area

Diagram separate stack areas

One way to use a separate register for the stack is to have a completely separate area of memory allocated for generated code to use as a stack. The two stacks would grow and shrink independently, and the procedure-call standard would apply only to the architectural stack. You must ensure that you allocate enough memory, but on most platforms you can allocate a large range of contiguous virtual addresses without actually reserving physical memory. (This is how Linux creates the normal process stack, for example.)

There aren't very many complications with this technique. Generated code must be careful around entry and exit points, but not significantly more than usual. The biggest complication in most situations will be integration with other components. For example, in a virtual machine where a garbage collector needs to scan and update the stack, it also needs to be aware of the special stack area.

Reserve stack space in advance

If the application can reliably predict a maximum stack space for a given function, the entry point can simply move sp down temporarily to accomodate this space. It is often easier to determine the maximum stack space required than it is to determine precisely how much stack is needed.

entry:
  // Using x28 as a replacement stack pointer.
  sub   sp, x28, #max_stack_space
  ...
  str   x0, [x28, #-8]!   // push {x0}
  ...
  ldr   x0, [x28], #8     // pop {x0}

Note that sp doesn't need to be kept 16-byte aligned in the example above because it isn't used to access memory.

Sadly, although finding an upper limit on the required stack space is easier than calculating the usage exactly, it still often requires analysis that isn't easily available, so this is definitely not a drop-in solution.

Maximum stack space diagram

Shadow sp.

Another solution is to update sp just before every push. sp won't necessarily be 16-byte aligned, but since it is never used to access memory, it doesn't matter. This method is what the Google V8 JavaScript engine uses, and it's also what VIXL's MacroAssembler uses if you tell it to use a different stack pointer.

sub   sp, x28, #8           // preparation
str   x0, [x28, #-8]!       // push {x0}
...
ldr   x0, [x28], #8         // pop {x0}

In general, there is no need to unwind the architectural sp on pop instructions, since it is harmless to leave it where it is.


Shadow stacks

With some care, the preparation step for several pushes can be combined in order to minimise the code-size overhead. (If you take this far enough, it starts to look quite similar to the "reserve stack area in advance" proposal above.)

// Several pushes can share a single preparation step.
sub   sp, x28, #32          // preparation
stp   x3, x2, [x28, #-16]!  // push {x2}; push {x3};
...
stp   x1, x0, [x28, #-16]!  // push {x0}; push {x1};

Aside from the wasteful 16-byte-per-slot mechanism, this shadow-sp design is probably the simplest drop-in solution available; push and pop macros can be written to hide the alignment restrictions for ad-hoc usage, and no additional analysis is required to get it to work. It performs well, since most processors can execute the sub and str at the same time. The only significant cost to be aware of is the code size overhead, especially where you have many small pushes.

In Conclusion

None of these ideas will work well in every context, so the best choice really depends on the constraints that you have to work within. However, hopefully I've explained a few of the practical problems that you're likely to face, and given a bit of inspiration.


1I've never actually worked on a C compiler, but their stack allocation behaviour is clear from disassembly.

Anonymous
  • 42Bastian
    42Bastian over 9 years ago

    Agree for AArch32/ARMv7. But I still do not see the problem for AArch64.

    Even for code generators.

    Taking your code-generator example:

    // AArch64 implementation of `push {w0, w1, w2, w3}`. 

    1.   stp   w0, w1, [sp, #-16]!   // Allocate four words and store w0 and w1 at the lower addresses. 
    2.   stp   w2, w3, [sp, #8]      // Store w2 and w3 at the upper addresses. 

    I doubt it brings any benefit to store 32bit values on the stack. Sure, it is half the memory, but I doubt it takes half the time.

    So the generator should simply translate a "push {...}" into a number of "stp" with the X registers.

    BTW: The benefit for an Assembly programmer of AArch64 is, that one has to use the stack rather seldom :-)

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Jacob Bramley
    Jacob Bramley over 9 years ago

    On AArch32, 8-byte alignment is only required at public interfaces. (That is, function calls.) Within a function, only 4-byte alignment is required. That is the following A32 code is valid:

        push {r0}

        push {r1}

    However, similar A64 code is not valid:

        str x0, [sp, #-8]!

        str x1, [sp, #-8]!

    I explained this in a bit more detail in my previous article: Using the Stack in AArch32 and AArch64

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • 42Bastian
    42Bastian over 9 years ago

    Hi

    I am a bit confused about the "trouble" you see with the stack. On the Aarch32, a 8 byte stack alignment must (Cortex-A) or should be kept.

    So a single push {r0] will not do.

    So on AArch32, always an even number of registers is pushed even if only to keep alignment.

    So what is the matter of using stp to push registers?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
<
Architectures and Processors blog
  • When a barrier does not block: The pitfalls of partial order

    Wathsala Vithanage
    Wathsala Vithanage
    Acquire fences aren’t always enough. See how LDAPR exposed unsafe interleavings and what we did to patch the problem.
    • September 15, 2025
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025