Using the Stack in AArch64: Implementing Push and Pop

November 23, 2015

7 minute read time.

As described in my last article, AArch64 performs stack pointer alignment checks in hardware. In particular, whenever the stack pointer is used as the base register in an address operand, it must have 16-byte alignment.

The alignment checks can be very inconvenient in code generators where it is not feasible to determine how much stack space a function will require. Many JIT compilers fall into this category; they tend to rely on being able to push individual values to the stack.

The Problem

For conventional C and C++ compilers, the stack pointer alignment restrictions in AAPCS64 don't seem to cause much trouble ¹. Many C functions start with a prologue that allocates the stack space required for the whole function. This space is then accessed as needed during the function. This is possible because the C compiler can determine in advance the stack space that will be required. Special handling will be required for variable-length arrays and alloca, but these are special cases that aren't often seen in real code.

JIT compilers (and other time-constrained code generators) cannot usually do this because it is expensive to analyse the code to extract this information. Also, many simple compilers are based around a stack machine, and assume that there is an efficient push implementation for an arbitrary number of registers. This is easy to manage in AArch32 because the basic data type is usually a single 4-byte register, and these can be pushed individually (between function calls) without violating any sp alignment rules. However, for AArch64, the required stack pointer alignment is two x or four w registers, and it is not possible to push individual registers.

// Broken AArch64 implementation of `push {x1}; push {x0};`.
  str   x1, [sp, #-8]!  // This works, but leaves `sp` with only 8-byte alignment ...
  str   x0, [sp, #-8]!  // ... so the second `str` will fail.

The most appropriate method of implementing push and pop operations will depend on the nature of the engine you are using. I considered a number of possible solutions for use in the AArch64 port of the Google V8 JavaScript engine. I will present each idea along with their advantages and disadvantages.

Calculate stack sizes in advance

If the required analysis is possible, it can result in fast generated code and efficient use of stack memory, so I've included this as a kind of benchmark, even though it might not be possible for many JIT compilers. The generated code will typically look something like this:

sub   sp, sp, #(8 * 14)       // Allocate space for the whole block.
...
str   x0, [sp, #(8 * 11)]     // Write to slot 11.
...
ldr   x0, [sp, #(8 * 11)]     // Read from slot 11.
...
add   sp, sp, #(8 * 14)       // Free the space at the end of the block.

Indexed addressing modes can sometimes be used to combine some of the operations. For example:

str   x0, [sp, #-(8 * 14)]!   // Allocate space and write to slot 0 in one step.

Depending on the design of your compiler (and your source language), it might be possible to calculate stack usage for individual basic blocks, even if function-level analysis isn't feasible. You'll have a separate allocation instruction (sub) for each block, but this is still cheaper than some other approaches that I'll describe in this article.

Use 16-byte stack slots

Let's start with a simple, quick-and-dirty approach.

It sounds wasteful – and in most cases it is – but the simplest way to handle the stack pointer can be to push each value to a 16-byte slot. This doubles the stack usage (for pointer-sized values), and it effectively reduces the available memory bandwidth. It is also awkward to implement multiple-register operations using this scheme, since each register requires a separate instruction.

In general, I don't consider this approach to be appropriate. However, it does have one significant advantage, which is that it is very simple; there might be situations where this simplicity is worth the cost.

str   x0, [sp, #-16]!         // push {x0}
...
ldr   x0, [sp], #16           // pop {x0}

Use a register other than `sp` as the stack pointer

This mechanism is simple in principle: if the alignment restrictions of sp are inconvenient, just use another register as your stack pointer. General-purpose registers have no special alignment restrictions. Interfaces with PCS-compliant code (such as the C or C++ parts of the virtual machine) need to synchronise sp and the replacement stack pointer, but this is usually simple and quite cheap.

There is a notable complication: memory below the architectural sp (but in the stack area) cannot be safely accessed. Notably, this area is used by signal handlers, which execute asynchronously (like interrupts). If we just copy sp to some other register and start using it as a (descending) stack pointer, our special stack area will eventually be corrupted.

Separate stack area

One way to use a separate register for the stack is to have a completely separate area of memory allocated for generated code to use as a stack. The two stacks would grow and shrink independently, and the procedure-call standard would apply only to the architectural stack. You must ensure that you allocate enough memory, but on most platforms you can allocate a large range of contiguous virtual addresses without actually reserving physical memory. (This is how Linux creates the normal process stack, for example.)

There aren't very many complications with this technique. Generated code must be careful around entry and exit points, but not significantly more than usual. The biggest complication in most situations will be integration with other components. For example, in a virtual machine where a garbage collector needs to scan and update the stack, it also needs to be aware of the special stack area.

Reserve stack space in advance

If the application can reliably predict a maximum stack space for a given function, the entry point can simply move sp down temporarily to accomodate this space. It is often easier to determine the maximum stack space required than it is to determine precisely how much stack is needed.

entry:
  // Using x28 as a replacement stack pointer.
  sub   sp, x28, #max_stack_space
  ...
  str   x0, [x28, #-8]!   // push {x0}
  ...
  ldr   x0, [x28], #8     // pop {x0}

Note that sp doesn't need to be kept 16-byte aligned in the example above because it isn't used to access memory.

Sadly, although finding an upper limit on the required stack space is easier than calculating the usage exactly, it still often requires analysis that isn't easily available, so this is definitely not a drop-in solution.

Shadow `sp`.

Another solution is to update sp just before every push. sp won't necessarily be 16-byte aligned, but since it is never used to access memory, it doesn't matter. This method is what the Google V8 JavaScript engine uses, and it's also what VIXL's MacroAssembler uses if you tell it to use a different stack pointer.

sub   sp, x28, #8           // preparation
str   x0, [x28, #-8]!       // push {x0}
...
ldr   x0, [x28], #8         // pop {x0}

In general, there is no need to unwind the architectural sp on pop instructions, since it is harmless to leave it where it is.

With some care, the preparation step for several pushes can be combined in order to minimise the code-size overhead. (If you take this far enough, it starts to look quite similar to the "reserve stack area in advance" proposal above.)

// Several pushes can share a single preparation step.
sub   sp, x28, #32          // preparation
stp   x3, x2, [x28, #-16]!  // push {x2}; push {x3};
...
stp   x1, x0, [x28, #-16]!  // push {x0}; push {x1};

Aside from the wasteful 16-byte-per-slot mechanism, this shadow-sp design is probably the simplest drop-in solution available; push and pop macros can be written to hide the alignment restrictions for ad-hoc usage, and no additional analysis is required to get it to work. It performs well, since most processors can execute the sub and str at the same time. The only significant cost to be aware of is the code size overhead, especially where you have many small pushes.

In Conclusion

None of these ideas will work well in every context, so the best choice really depends on the constraints that you have to work within. However, hopefully I've explained a few of the practical problems that you're likely to face, and given a bit of inspiration.

¹I've never actually worked on a C compiler, but their stack allocation behaviour is clear from disassembly.

42Bastian over 8 years ago
Agree for AArch32/ARMv7. But I still do not see the problem for AArch64.
Even for code generators.
Taking your code-generator example:
// AArch64 implementation of `push {w0, w1, w2, w3}`.
stp w0, w1, [sp, #-16]! // Allocate four words and store w0 and w1 at the lower addresses.
stp w2, w3, [sp, #8] // Store w2 and w3 at the upper addresses.
I doubt it brings any benefit to store 32bit values on the stack. Sure, it is half the memory, but I doubt it takes half the time.
So the generator should simply translate a "push {...}" into a number of "stp" with the X registers.
BTW: The benefit for an Assembly programmer of AArch64 is, that one has to use the stack rather seldom :-)
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Jacob Bramley over 8 years ago

On AArch32, 8-byte alignment is only required at public interfaces. (That is, function calls.) Within a function, only 4-byte alignment is required. That is the following A32 code is valid:
    push {r0}
    push {r1}
However, similar A64 code is not valid:
    str x0, [sp, #-8]!
    str x1, [sp, #-8]!
I explained this in a bit more detail in my previous article: Using the Stack in AArch32 and AArch64
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
42Bastian over 8 years ago

Hi
I am a bit confused about the "trouble" you see with the stack. On the Aarch32, a 8 byte stack alignment must (Cortex-A) or should be kept.
So a single push {r0] will not do.
So on AArch32, always an even number of registers is pushed even if only to keep alignment.
So what is the matter of using stp to push registers?
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Using SVE in C#

Alan Hayward

.NET 9 introduces SVE support on Arm, allowing users to write simplified vectorised code. This blog gives examples in C# and compares it to C++.
- November 20, 2024
Part 3: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Supporting C++ style exceptions and DWARF for Pointer Authentication Codes (PAC) signed pointers.
- November 20, 2024
Part 2: Enabling PAC and BTI on AArch64 for Linux

Bill Roberts

Utilizing Pointer Authentication Codes (PAC) and Branch Target Instructions (BTI) together and optimizations in instruction counts.
- November 19, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Using the Stack in AArch64: Implementing Push and Pop

The Problem

Calculate stack sizes in advance

Use 16-byte stack slots

Use a register other than `sp` as the stack pointer

Separate stack area

Reserve stack space in advance

Shadow `sp`.

In Conclusion

Using SVE in C#

Part 3: Enabling PAC and BTI on AArch64 for Linux

Part 2: Enabling PAC and BTI on AArch64 for Linux

Using the Stack in AArch64: Implementing Push and Pop

The Problem

Calculate stack sizes in advance

Use 16-byte stack slots

Use a register other than sp as the stack pointer

Separate stack area

Reserve stack space in advance

Shadow sp.

In Conclusion

Use a register other than `sp` as the stack pointer

Shadow `sp`.