The A64 ISA and Compilers

October 15, 2013

16 minute read time.

Fact: The ARM architecture is the most widely licensed 32-bit embedded instruction set architecture in the industry.

That fact makes the ARM Instruction Set Architecture (ISA) incredibly important to a huge number of people. As an architecture, the ARM ISA has had a relatively short history (the first incarnation was defined in 1985) but it has gone through several incremental revisions between then and now. The latest mainstream version is ARMv7, a 32-bit architecture which encompasses everything from high-performance application processor platforms to the tiniest microcontrollers (see Navigating the Cortex Maze for more information) . In late 2011, at the annual TechCon show in Santa Clara, California, ARM publicly announced ARMv8. In the words of chief architect, Richard Grisenthwaite, this represents “The largest architecture change in ARM’s history.” As such, it is an incredibly significant development.

Over the intervening couple of years, ARM has released more information about the details of ARMv8. Initially, specifications have been released only for the “application” profile, ARMv8-A. This provides full 64-bit capability in the ARM architecture for the first time. Recently, complete tool sets have begun to emerge on the market supporting the ARMv8 architecture, specifically the A64 Instruction Set Architecture (ISA) that is defined as its major component. So, how has that gone? What is the new instruction set like as a compiler target? That has to include both dynamic and static compilation solutions.

I spoke to engineers at ARM who have been involved in these development and the message can be summarised as “About as nice as a compiler could hope for.” But I don’t expect you to take that statement, bold as it is, without any backup. So I dug a little deeper.

What makes a good compiler target?

The industry would say that, to make a good compiler target, an ISA should exhibit:

Implementability
Programmability
Regularity
Orthogonality
Composability

So, what do they mean? Just how does the ARMv8 ISA stack up against those criteria?

Implementability

This refers to the ease of designing efficient hardware which implements the ISA. I don’t really propose to discuss this here as I’m concerned with the ISA as a compiler target rather than the microarchitectural details of individual processors…fascinating though that may be.

Programmability

This refers to the ease of writing programs in the ISA. For many of the reasons which I will discuss below in relation to automatic code generation, the ARMv8 ISA is an excellent vehicle for hand coding. But here I am not really concerned with how easy it is for humans to write ARMv8 assembly code. What really interests me at the moment is how easy it is for compilers to do so. We can certainly satisfy ourselves that the ISA is, for almost any definition of the word, “complete” and that is all that matters for a compiler to be able to produce correct and functional code.

The remaining properties matter much more to a compiler. It is these which determine how straightforward the back-end code generation can be and how efficient the resulting code is.

Regularity

If an ISA is “regular”, this implies that if something is done in a particular way in one place, then it should be done that way in every place. For instance, if literal constants in arithmetic operations are defined in a particular way, then they should be defined in that way everywhere else a literal constant is used. That might include logical operations, addressing modes, branch instructions and so on.

Orthogonality

If an ISA is “orthogonal”, then different aspects of it can be discussed and defined largely in isolation from each other. For instance, the number of registers available should not affect the way that literal constants are defined or the way that PC-relative addressing modes work.

Composability

“Composability” refers to the ease of combining different elements of an ISA definition. That could mean, for instance, that all logical and arithmetic operations use the same operand combinations and the same format for literal constants. Or that every memory access instruction uses the same addressing modes, regardless of the data type involved.

All of these properties combine to reduce, or minimize, the number of “special cases” which a compiler implementer has to deal with. If all arithmetic operations have the same basic structure (e.g. in terms of the number and type of operands they take), then code generation is significantly easier. This is particularly true in the case of dynamic compilation, where it would permit a single code sequence to be simply parameterized to deal with any combination of input and output data types.

So what about ARMv8

The new A64 instruction set has evolved from a design effort which has, so far, taken well over five years. Starting from the existing Thumb-2 functionality and adding 64-bit capability and an extended register bank gave a start point. The architecture team took the opportunity to clear up known performance hazards and adjust the functionality to better match the requirements of modern software systems. Of course, the evolving instruction set was extensively modelled and benchmarked and the encoding of it is new, from scratch and "very clean".

Anyone who is familiar with the existing A32 and T32 instruction sets (as supported by ARMv7) won’t see too many surprises as one of the design intentions is to provide similar functionality to those. However, ARM took the opportunity to rationalize and restructure the instruction set in a number of ways.

Here are some highlights of the A64 instruction set:

A64 is a “clean, fixed length instruction set”
All instructions are 32 bits wide, the register fields are contiguous bit fields at fixed positions and immediate values mostly occupy contiguous bit fields.
A64 provides access to a larger general-purpose register file
There are 31 unbanked registers, with each register extended to 64 bits. They form a completely orthogonal set, with the exception of register 31 which can either signify a "zero register" or the stack pointer, depending on context.
PC is not accessible as a named register
One of the most “interesting features” of the original ARM instruction set was the exposure of the Program Counter as a general-purpose register (r15). While this made for some neat programming tricks, it introduced complications for compilers and also for the design of complex pipelines. Removing this direct access to the Program Counter also return prediction easier and simplifies the ABI specification.
Separation of the procedure call return link register from the exception return register
In A32 and T32, both exceptions and procedure calls use the Link Register (r14) for holding the return address. This re-use of r14 does make some things a little trickier than they need to be. A64 uses general-purpose register x30 for procedure calls and introduces a new Exception Link Register (ELR) for each exception level.
Uniform addressing modes
Load and Store addressing modes are uniform across all variants of size and sign for scalar integers, FP and SIMD registers. This makes instruction selection and encoding significantly easier for both static and dynamic compilers.
Increased PC-relative offset addressing
The range for PC-relative load/store instructions increases to ±1MB, large enough for any single compilation unit, making literal pool management easier.
Increased branch range
Conditional branch ranges now extend to ±1MB and unconditional branches to ±128MB. Again, this is large enough for (almost) any single compilation unit, simplifying dynamic code relocation.
Unaligned address support
With the exception of exclusive and ordered accesses, all loads and stores support the use of unaligned addresses. This simplifies porting code to A64.
Rationalized register packing scheme for FP and Advanced SIMD
In A32, the smaller registers are packed into larger ones (D0 and D1 are combined to form Q1, for instance). This introduces some tricky loop-carried dependencies which can reduce the ability of the compiler to vectorize loop structures. In A64, the mapping is simpler (D0 maps to the lower half of Q0, D1 to the lower half of Q1 etc), increasing the opportunities for optimization.

For the record, one or two things have been removed:

The architecture no longer supports the concept of coprocessors.
Floating Point support is now provided in hardware. There is no “soft FP” support in the ABI.
Conditional execution support has been restructured around a new family of conditional select and conditional compare operations. Fewer instructions set the condition code flags and a very few (conditional branches and a handful of data processing instructions) can read and act on them. There is no equivalent of the T32 IT instruction and predicated execution as a general case is no longer provided.
The multiple-register load and store instructions (LDM/STM) are no longer supported. Their functionality is carried out using Load and Store Pair (LDP/STP), which were inherited from Thumb-2.
The separation between arithmetic instructions and SIMD/saturation instructions is much clearer with the latter only able to act on the FP/SIMD register set.
The CPSR no longer exists as a single register. New system instructions allow access to individual fields.

A consistent coding scheme

The late addition of some instructions in A32 resulted in some inconsistency in the encoding scheme. For instance, LDR/STR support for halfwords (LDRH/LDRSH/STRH) is encoded slightly differently to the mainstream byte and word transfer instructions, with the result that the addressing modes are somewhat different. Here are the encodings for the immediate offset forms of LDR and LDRH in the A32 instruction set.

A32 LDR (immediate)

31							24	23							16	15							8	7							0
cond				0	1	0	P	U	0	W	1	Rn				Rt				imm12

A32 LDRH (immediate)

31							24	23							16	15							8	7							0
cond				0	0	0	P	U	1	W	1	Rn				Rt				imm4H				1	0	1	1	imm4L

You can see clearly that the encoding used for the offset is quite different, meaning that the LDRH instruction does not support the same offset range as LDR. The two instructions, although they carry out very similar operations, obey quite different rules and need to be generated and managed in different ways. This is hard for a compiler.

In A64, the encoding scheme has been designed from the ground up, making it much more consistent and ensuring that all memory access instructions (with very few exceptions) have a common encoding scheme and support the same set of addressing modes. Here is the encoding for all forms of the LDR instruction in A64.

A64 LDR

31							24	23							16	15							8	7							0
size		1	1	1	V	0	1	opc		Imm12												Rn					Rt

The transfer size is indicated by the two-bit “size” field and you can see that the rest of the instruction fields (including the offset) are encoded consistently across all variants. This makes generation of a LDR instruction much simpler as a general-purpose algorithm can be used for all data types and all addressing modes.

Similarities with A32/T32

The similarity between A64 and A32/T32 is illustrated easily with a simple example. The three sequences below show a simple C function and the output code in first T32 and then A64. The correspondence between the two is very easy to see.

//C cod
int foo(int val)
{
    int newval = bar(val);
    return val + newval;
}

//T32
foo:
sub sp, sp, #8
strd r4, r14, [sp]

mov r4, r0
bl bar
add r0, r0, r4

ldrd r4, r14, [sp]
add sp, sp, #8
bx lr


//A64
foo:
sub sp, sp, #16
stp x19, x30, [sp]

mov w19, w0
bl bar
add w0, w0, w19

ldp x19, x30, [sp]
add sp, sp, #16
ret

Meanwhile in the real world…

Now that you know what the A64 instruction set contains, what does that mean for actually compiling software? How good is it as a target for compilers? I spoke to the teams in ARM who have been responsible for the code generation for the Dalvik engine and for armcc, the generic C compiler. Suffice it to say that they love the new instruction set! Here’s a digest of what they told me…

Being similar to A32/T32 helps

The basic functionality provided by A64 has been evolved from that found in A32/T32. So, there are few surprises for compiler implementers and code generators. In general, porting to the new instruction set is fairly straightforward. In Porting Android, for example, approximately 80% of the code simply required simple recompilation. Generally, translating A32 assembly code to A64 is straightforward. Most instructions map easily and many sequences actually become simpler. Most of the changes are in procedure entry and exit sequences where LDP/STP needs to replace LDM/STM.

Fixed length is good

All A64 instructions are all the same length, as compared with T32 which is a variable-length instruction set. This makes management and tracking of generated code sequences easier, particularly affecting dynamic code generators.

Long offsets are helpful

A64 instructions generally provide longer offsets, both for PC-relative branches and for offset addressing.

The increased branch range makes it easier to manage inter-section jumps. Dynamically-generated code is generally placed on the heap so it could, in practice, be located anywhere. The runtime system finds it much easier to manage this with increased branch ranges and fewer fix-ups are

The need for literal pools - block of literal data which are embedded in the code stream – has long been a feature of ARM coding. This doesn’t go away completely in A64. However, the larger PC-relative load offset helps considerably with the management of this, making it possible to use one literal pool per compilation unit and removing the need to manufacture locations for literal pools in long code sequences.

Consistent is good

All load/store instructions now support consistent addressing modes. This makes it much easier, for instance, to treat char, short, int and long long in the same way when loading and storing quantities from memory. In addition, the FP/SIMD registers now support the same addressing modes as the core registers, making it easier to use the two register banks interchangeably. Previously, it was common to load data into core registers (to make use of the more flexible and extensive addressing options) and then use additional instructions to move the result into FP/SIMD registers. The increased commonality between the register banks reduces register pressure and makes register allocation easier.

When generating code (statically, but especially dynamically) for common arithmetic functions, A32/T32 often required different instructions, or instruction sequences, to cope with different data types. These operations in A64 are much more consistent so it is much easier to generate common sequences for simple operations on differently sized data types. For instance, “add int”, “add long”, “add float” and “add double” would all result in slightly different functions in A32/T32. In A64, a single, simple parameterised function can be used which uses the same instruction sequence but works on all the different data types.

Having common literal constant encoding across different instruction types greatly simplifies instruction generation and reduces special cases.

Data types are easier

A64 deals very naturally with 64-bit signed and unsigned data types. This would have required special-case instruction sequences in A32/T32. Particularly for languages like Java and Dalvik, which are naturally 64-bit typed, this makes for much more natural code generation.

A64 is more efficient

Analysis shows that A64 is a more efficient and sometimes more compact instruction set that A32/T32. Although it provides 64-bit capability, the instruction size is still 32-bit. For instance, the 64-bit Dalvik JIT engine produces code which is around 6% smaller than the 32-bit equivalent. The resulting code is also simpler and easier to analyze.

More registers help

The A64 register bank is significantly larger and this works to reduce register pressure in almost all applications. For instance, since 64-bit operations do not require two registers to hold input and/or output parameters, register usage is reduced. In T32, such cases would have resulted in a lot of “special case” code.

The A64 Procedure Calling Standard (PCS) allows passing of up to 8 parameters in registers (X0-X7). In contrast, A32/T32 allow for only 4 parameters in registers, with any excess being passed on the stack. Linux syscalls, for example, allow for up to six parameters – in A64, all these parameters can be passed in registers without need to use the stack.The PCS also defines a dedicated Frame Pointer (FP) which makes debugging and call-graph profiling easier by making it possible to reliably unwind the stack.

The “zero register” (general-purpose register 31) is very useful. However, caution is sometimes necessary as its usage is context-dependent and, in some contexts, it refers to the stack pointer instead.

Conditional constructs are more conventional

IT blocks are a useful feature of T32, allowing for efficient sequences which avoid the need for short forward branches around unexecuted instructions. However, they are sometimes hard to generate and hard to analyze. A64 removes these and replaces them with conditional instructions like CSEL (Conditional Select) and CINC (Conditional Increment). These are more straightforward and easier to handle without special cases.

Three-operand operations map better

A32, in general, preserves a true three-operand structure. T32, on the other hand, contains a significant number of 2-operand instruction formats which make it slightly less flexible when generating code. A64 sticks to a consistent three-operand syntax which is easier to use and maps better to high-level requirements like Dalvik.

Shift/rotate behaviour is more intuitive

The A32/T32 shift/rotate behaviour, while well defined, does not always map easily to the behaviour expected by high-level languages. In particular, the behaviour specified for out-of-range shift distances is more intuitive.

Wide range of constants

A64 instructions provide a huge range of options for constants, each tailored to the requirements of specific instruction types.

A64 arithmetic instructions generally accept a 12-bit immediate constant.
A64 logical instructions generally accept a 32/64-bit constant, which has some constraints as to its encoding.
A64 MOV instructions accept a 16-bit immediate which can be shifted to any 16-bit boundary.
A64 address generation instructions are geared to addresses aligned to a 4KB page size.
There are slightly more complex rules for constants used in bit manipulation instructions.

The upshot of all that is that A64 provides for very flexible constants but that encoding them, even determining whether a particular constant can be legally encoded in a particular context, is sometimes quite difficult.

A single instruction set state is a benefit

With only one instruction set state, developing in A64 does not involve interworking.

FP/SIMD registers are easier to use

The simpler mapping scheme between the different register sizes in the FOP/SIMD register bank make these registers much easier to use. The mapping is easier for compilers and optimizers to model and analyze; bit manipulation is more accessible; consistent addressing modes makes getting memory contents in and out of the FP/SIMD register bank much easier; and there is less need to transfer values between core registers and FP/SIMD registers as more processing can be carried out in situ.

Summary

In summary, my investigations of what compilers need and what the A64 ISA provides seem to come to a clear conclusion: that the A64 instruction set is a very good target for automatically generated code in both static and dynamic compilation environments.

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025