Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog Using the Stack in AArch64: Implementing Push and Pop
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Assembly
  • AArch64
  • Tutorial
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Using the Stack in AArch64: Implementing Push and Pop

Jacob Bramley
Jacob Bramley
November 23, 2015
7 minute read time.

As described in my last article, AArch64 performs stack pointer alignment checks in hardware. In particular, whenever the stack pointer is used as the base register in an address operand, it must have 16-byte alignment.

The alignment checks can be very inconvenient in code generators where it is not feasible to determine how much stack space a function will require. Many JIT compilers fall into this category; they tend to rely on being able to push individual values to the stack.

The Problem

For conventional C and C++ compilers, the stack pointer alignment restrictions in AAPCS64 don't seem to cause much trouble 1. Many C functions start with a prologue that allocates the stack space required for the whole function. This space is then accessed as needed during the function. This is possible because the C compiler can determine in advance the stack space that will be required. Special handling will be required for variable-length arrays and alloca, but these are special cases that aren't often seen in real code.

JIT compilers (and other time-constrained code generators) cannot usually do this because it is expensive to analyse the code to extract this information. Also, many simple compilers are based around a stack machine, and assume that there is an efficient push implementation for an arbitrary number of registers. This is easy to manage in AArch32 because the basic data type is usually a single 4-byte register, and these can be pushed individually (between function calls) without violating any sp alignment rules. However, for AArch64, the required stack pointer alignment is two x or four w registers, and it is not possible to push individual registers.

// Broken AArch64 implementation of `push {x1}; push {x0};`.
  str   x1, [sp, #-8]!  // This works, but leaves `sp` with only 8-byte alignment ...
  str   x0, [sp, #-8]!  // ... so the second `str` will fail.

The most appropriate method of implementing push and pop operations will depend on the nature of the engine you are using. I considered a number of possible solutions for use in the AArch64 port of the Google V8 JavaScript engine. I will present each idea along with their advantages and disadvantages.

Calculate stack sizes in advance

If the required analysis is possible, it can result in fast generated code and efficient use of stack memory, so I've included this as a kind of benchmark, even though it might not be possible for many JIT compilers. The generated code will typically look something like this:

sub   sp, sp, #(8 * 14)       // Allocate space for the whole block.
...
str   x0, [sp, #(8 * 11)]     // Write to slot 11.
...
ldr   x0, [sp, #(8 * 11)]     // Read from slot 11.
...
add   sp, sp, #(8 * 14)       // Free the space at the end of the block.

Indexed addressing modes can sometimes be used to combine some of the operations. For example:

str   x0, [sp, #-(8 * 14)]!   // Allocate space and write to slot 0 in one step.

Depending on the design of your compiler (and your source language), it might be possible to calculate stack usage for individual basic blocks, even if function-level analysis isn't feasible. You'll have a separate allocation instruction (sub) for each block, but this is still cheaper than some other approaches that I'll describe in this article.

16-byte stacks slots diagram

Use 16-byte stack slots

Let's start with a simple, quick-and-dirty approach.

It sounds wasteful – and in most cases it is – but the simplest way to handle the stack pointer can be to push each value to a 16-byte slot. This doubles the stack usage (for pointer-sized values), and it effectively reduces the available memory bandwidth. It is also awkward to implement multiple-register operations using this scheme, since each register requires a separate instruction.

In general, I don't consider this approach to be appropriate. However, it does have one significant advantage, which is that it is very simple; there might be situations where this simplicity is worth the cost.

str   x0, [sp, #-16]!         // push {x0}
...
ldr   x0, [sp], #16           // pop {x0}

Use a register other than sp as the stack pointer

This mechanism is simple in principle: if the alignment restrictions of sp are inconvenient, just use another register as your stack pointer. General-purpose registers have no special alignment restrictions. Interfaces with PCS-compliant code (such as the C or C++ parts of the virtual machine) need to synchronise sp and the replacement stack pointer, but this is usually simple and quite cheap.

There is a notable complication: memory below the architectural sp (but in the stack area) cannot be safely accessed. Notably, this area is used by signal handlers, which execute asynchronously (like interrupts). If we just copy sp to some other register and start using it as a (descending) stack pointer, our special stack area will eventually be corrupted.

Separate stack area

Diagram separate stack areas

One way to use a separate register for the stack is to have a completely separate area of memory allocated for generated code to use as a stack. The two stacks would grow and shrink independently, and the procedure-call standard would apply only to the architectural stack. You must ensure that you allocate enough memory, but on most platforms you can allocate a large range of contiguous virtual addresses without actually reserving physical memory. (This is how Linux creates the normal process stack, for example.)

There aren't very many complications with this technique. Generated code must be careful around entry and exit points, but not significantly more than usual. The biggest complication in most situations will be integration with other components. For example, in a virtual machine where a garbage collector needs to scan and update the stack, it also needs to be aware of the special stack area.

Reserve stack space in advance

If the application can reliably predict a maximum stack space for a given function, the entry point can simply move sp down temporarily to accomodate this space. It is often easier to determine the maximum stack space required than it is to determine precisely how much stack is needed.

entry:
  // Using x28 as a replacement stack pointer.
  sub   sp, x28, #max_stack_space
  ...
  str   x0, [x28, #-8]!   // push {x0}
  ...
  ldr   x0, [x28], #8     // pop {x0}

Note that sp doesn't need to be kept 16-byte aligned in the example above because it isn't used to access memory.

Sadly, although finding an upper limit on the required stack space is easier than calculating the usage exactly, it still often requires analysis that isn't easily available, so this is definitely not a drop-in solution.

Maximum stack space diagram

Shadow sp.

Another solution is to update sp just before every push. sp won't necessarily be 16-byte aligned, but since it is never used to access memory, it doesn't matter. This method is what the Google V8 JavaScript engine uses, and it's also what VIXL's MacroAssembler uses if you tell it to use a different stack pointer.

sub   sp, x28, #8           // preparation
str   x0, [x28, #-8]!       // push {x0}
...
ldr   x0, [x28], #8         // pop {x0}

In general, there is no need to unwind the architectural sp on pop instructions, since it is harmless to leave it where it is.


Shadow stacks

With some care, the preparation step for several pushes can be combined in order to minimise the code-size overhead. (If you take this far enough, it starts to look quite similar to the "reserve stack area in advance" proposal above.)

// Several pushes can share a single preparation step.
sub   sp, x28, #32          // preparation
stp   x3, x2, [x28, #-16]!  // push {x2}; push {x3};
...
stp   x1, x0, [x28, #-16]!  // push {x0}; push {x1};

Aside from the wasteful 16-byte-per-slot mechanism, this shadow-sp design is probably the simplest drop-in solution available; push and pop macros can be written to hide the alignment restrictions for ad-hoc usage, and no additional analysis is required to get it to work. It performs well, since most processors can execute the sub and str at the same time. The only significant cost to be aware of is the code size overhead, especially where you have many small pushes.

In Conclusion

None of these ideas will work well in every context, so the best choice really depends on the constraints that you have to work within. However, hopefully I've explained a few of the practical problems that you're likely to face, and given a bit of inspiration.


1I've never actually worked on a C compiler, but their stack allocation behaviour is clear from disassembly.

Anonymous
  • HALivingston
    HALivingston over 6 years ago

    I just have one question ... why? You've gone through all the details (and I really do appreciate it, because even though after reading it I'm like of course, these are all the available options, it was confusing) but you've failed to explain why the brilliant engineers at ARM decided to subject compiler writers and assembly writers to these peculiar rules.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Jacob Bramley
    Jacob Bramley over 9 years ago

    I think it would be harder than it sounds, but such a stage could work in some designs. It would cost time, though, and the kind of compilers that have this problem are the ones that want to compile as quickly as possible (for responsiveness). I think it would be more desirable in general to pick a stack-usage strategy like one of the ones above, and not have to implement an extra post-processing stage, but different contexts demand different compromises.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • 42Bastian
    42Bastian over 9 years ago

    Jacob,

    ok, now I have a clearer image of the problem.

    So it seem a JIT compiler should have a post-optimization stage.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Michael Williams
    Michael Williams over 9 years ago

    Hi Jacob, great article.

    I think another point worth making is that there might also be performance advantages of one approach over the other. If you compare, for example:

        stp  x4, x5, [sp, #-8]!

        stp  x2, x3, [sp, #-8]!

        stp  x0, x1, [sp, #-8]!

    vs.

        stp  x0, x1, [sp, #-24]!

        stp  x2, x3, [sp, #8]

        stp  x4, x5, [sp, #16]

    Then the former has a chain of dependencies between the instructions. The second instruction relies on the address calculation from the first, and the third from the second. The second form doesn't have the dependency between the second and third. Once the base address for the stack frame has been calculated, each of the stores can then proceed.

    Whether this manifests as a performance advantage depends on the microarchitecture of the processor; but it is generally good style to avoid unnecessary dependencies between instructions to maximize the opportunities for multi- and out-of-order execution.

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
  • Jacob Bramley
    Jacob Bramley over 9 years ago

    Agree for AArch32/ARMv7. But I still do not see the problem for AArch64.

    For example, Google V8's internal code-generation API requires that we can push arbitrary numbers of registers. This is not an unusual requirement; almost every JIT compiler I've seen has similar expectations. Typical calling code will simply call "Push(x0)". The trouble is, you can't generate "push {x0}" without misaligning the stack pointer.

    In theory you could group all the pushes together and only ever issue them in pairs (or bigger groups), but that's very difficult in practice. At the very least, it requires analysis of the code being compiled that may not be feasible. (This has the same requirements as the "Calculate stack sizes in advance" solution mentioned in the article.) Google V8's full-codegen, for example, is a stack machine that tends to want to push a single result from each operation: https://chromium.googlesource.com/v8/v8.git/+/4.9.70/src/full-codegen/arm64/full-codegen-arm64.cc#475

    I doubt it brings any benefit to store 32bit values on the stack. Sure, it is half the memory, but I doubt it takes half the time.

    This might be true, but it really depends on what you're doing. If you have to store a lot of them in a heavily-recursive function, or if you need them to look like an array or part of a packed structure, then you do want to push them into 32-bit slots. Also, sometimes memory is valuable, especially where there's no time compromise to worry about.

    It's also worth noting that the same pattern is useful for pushing large sets of X registers. That is, this is (probably) the most efficient way to implement "push {x0, x1, x2, x3}":

        stp   x0, x1, [sp, #-16]!

        stp   x2, x3, [sp, #8]

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
>
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025