As described in my last article, AArch64 performs stack pointer alignment checks in hardware. In particular, whenever the stack pointer is used as the base register in an address operand, it must have 16-byte alignment.
The alignment checks can be very inconvenient in code generators where it is not feasible to determine how much stack space a function will require. Many JIT compilers fall into this category; they tend to rely on being able to push individual values to the stack.
For conventional C and C++ compilers, the stack pointer alignment restrictions in AAPCS64 don't seem to cause much trouble 1. Many C functions start with a prologue that allocates the stack space required for the whole function. This space is then accessed as needed during the function. This is possible because the C compiler can determine in advance the stack space that will be required. Special handling will be required for variable-length arrays and alloca, but these are special cases that aren't often seen in real code.
alloca
JIT compilers (and other time-constrained code generators) cannot usually do this because it is expensive to analyse the code to extract this information. Also, many simple compilers are based around a stack machine, and assume that there is an efficient push implementation for an arbitrary number of registers. This is easy to manage in AArch32 because the basic data type is usually a single 4-byte register, and these can be pushed individually (between function calls) without violating any sp alignment rules. However, for AArch64, the required stack pointer alignment is two x or four w registers, and it is not possible to push individual registers.
push
sp
x
w
// Broken AArch64 implementation of `push {x1}; push {x0};`. str x1, [sp, #-8]! // This works, but leaves `sp` with only 8-byte alignment ... str x0, [sp, #-8]! // ... so the second `str` will fail.
The most appropriate method of implementing push and pop operations will depend on the nature of the engine you are using. I considered a number of possible solutions for use in the AArch64 port of the Google V8 JavaScript engine. I will present each idea along with their advantages and disadvantages.
If the required analysis is possible, it can result in fast generated code and efficient use of stack memory, so I've included this as a kind of benchmark, even though it might not be possible for many JIT compilers. The generated code will typically look something like this:
sub sp, sp, #(8 * 14) // Allocate space for the whole block. ... str x0, [sp, #(8 * 11)] // Write to slot 11. ... ldr x0, [sp, #(8 * 11)] // Read from slot 11. ... add sp, sp, #(8 * 14) // Free the space at the end of the block.
Indexed addressing modes can sometimes be used to combine some of the operations. For example:
str x0, [sp, #-(8 * 14)]! // Allocate space and write to slot 0 in one step.
Depending on the design of your compiler (and your source language), it might be possible to calculate stack usage for individual basic blocks, even if function-level analysis isn't feasible. You'll have a separate allocation instruction (sub) for each block, but this is still cheaper than some other approaches that I'll describe in this article.
sub
Let's start with a simple, quick-and-dirty approach.
It sounds wasteful – and in most cases it is – but the simplest way to handle the stack pointer can be to push each value to a 16-byte slot. This doubles the stack usage (for pointer-sized values), and it effectively reduces the available memory bandwidth. It is also awkward to implement multiple-register operations using this scheme, since each register requires a separate instruction.
In general, I don't consider this approach to be appropriate. However, it does have one significant advantage, which is that it is very simple; there might be situations where this simplicity is worth the cost.
str x0, [sp, #-16]! // push {x0} ... ldr x0, [sp], #16 // pop {x0}
This mechanism is simple in principle: if the alignment restrictions of sp are inconvenient, just use another register as your stack pointer. General-purpose registers have no special alignment restrictions. Interfaces with PCS-compliant code (such as the C or C++ parts of the virtual machine) need to synchronise sp and the replacement stack pointer, but this is usually simple and quite cheap.
There is a notable complication: memory below the architectural sp (but in the stack area) cannot be safely accessed. Notably, this area is used by signal handlers, which execute asynchronously (like interrupts). If we just copy sp to some other register and start using it as a (descending) stack pointer, our special stack area will eventually be corrupted.
One way to use a separate register for the stack is to have a completely separate area of memory allocated for generated code to use as a stack. The two stacks would grow and shrink independently, and the procedure-call standard would apply only to the architectural stack. You must ensure that you allocate enough memory, but on most platforms you can allocate a large range of contiguous virtual addresses without actually reserving physical memory. (This is how Linux creates the normal process stack, for example.)
There aren't very many complications with this technique. Generated code must be careful around entry and exit points, but not significantly more than usual. The biggest complication in most situations will be integration with other components. For example, in a virtual machine where a garbage collector needs to scan and update the stack, it also needs to be aware of the special stack area.
If the application can reliably predict a maximum stack space for a given function, the entry point can simply move sp down temporarily to accomodate this space. It is often easier to determine the maximum stack space required than it is to determine precisely how much stack is needed.
entry: // Using x28 as a replacement stack pointer. sub sp, x28, #max_stack_space ... str x0, [x28, #-8]! // push {x0} ... ldr x0, [x28], #8 // pop {x0}
Note that sp doesn't need to be kept 16-byte aligned in the example above because it isn't used to access memory.
Sadly, although finding an upper limit on the required stack space is easier than calculating the usage exactly, it still often requires analysis that isn't easily available, so this is definitely not a drop-in solution.
Another solution is to update sp just before every push. sp won't necessarily be 16-byte aligned, but since it is never used to access memory, it doesn't matter. This method is what the Google V8 JavaScript engine uses, and it's also what VIXL's MacroAssembler uses if you tell it to use a different stack pointer.
sub sp, x28, #8 // preparation str x0, [x28, #-8]! // push {x0} ... ldr x0, [x28], #8 // pop {x0}
In general, there is no need to unwind the architectural sp on pop instructions, since it is harmless to leave it where it is.
pop
With some care, the preparation step for several pushes can be combined in order to minimise the code-size overhead. (If you take this far enough, it starts to look quite similar to the "reserve stack area in advance" proposal above.)
// Several pushes can share a single preparation step. sub sp, x28, #32 // preparation stp x3, x2, [x28, #-16]! // push {x2}; push {x3}; ... stp x1, x0, [x28, #-16]! // push {x0}; push {x1};
Aside from the wasteful 16-byte-per-slot mechanism, this shadow-sp design is probably the simplest drop-in solution available; push and pop macros can be written to hide the alignment restrictions for ad-hoc usage, and no additional analysis is required to get it to work. It performs well, since most processors can execute the sub and str at the same time. The only significant cost to be aware of is the code size overhead, especially where you have many small pushes.
str
None of these ideas will work well in every context, so the best choice really depends on the constraints that you have to work within. However, hopefully I've explained a few of the practical problems that you're likely to face, and given a bit of inspiration.
1I've never actually worked on a C compiler, but their stack allocation behaviour is clear from disassembly.
Hi Jacob, great article.
I think another point worth making is that there might also be performance advantages of one approach over the other. If you compare, for example:
stp x4, x5, [sp, #-8]!
stp x2, x3, [sp, #-8]!
stp x0, x1, [sp, #-8]!
vs.
stp x0, x1, [sp, #-24]!
stp x2, x3, [sp, #8]
stp x4, x5, [sp, #16]
Then the former has a chain of dependencies between the instructions. The second instruction relies on the address calculation from the first, and the third from the second. The second form doesn't have the dependency between the second and third. Once the base address for the stack frame has been calculated, each of the stores can then proceed.
Whether this manifests as a performance advantage depends on the microarchitecture of the processor; but it is generally good style to avoid unnecessary dependencies between instructions to maximize the opportunities for multi- and out-of-order execution.