Fact: The ARM architecture is the most widely licensed 32-bit embedded instruction set architecture in the industry.
That fact makes the ARM Instruction Set Architecture (ISA) incredibly important to a huge number of people. As an architecture, the ARM ISA has had a relatively short history (the first incarnation was defined in 1985) but it has gone through several incremental revisions between then and now. The latest mainstream version is ARMv7, a 32-bit architecture which encompasses everything from high-performance application processor platforms to the tiniest microcontrollers (see Navigating the Cortex Maze for more information) . In late 2011, at the annual TechCon show in Santa Clara, California, ARM publicly announced ARMv8. In the words of chief architect, Richard Grisenthwaite, this represents “The largest architecture change in ARM’s history.” As such, it is an incredibly significant development.
Over the intervening couple of years, ARM has released more information about the details of ARMv8. Initially, specifications have been released only for the “application” profile, ARMv8-A. This provides full 64-bit capability in the ARM architecture for the first time. Recently, complete tool sets have begun to emerge on the market supporting the ARMv8 architecture, specifically the A64 Instruction Set Architecture (ISA) that is defined as its major component. So, how has that gone? What is the new instruction set like as a compiler target? That has to include both dynamic and static compilation solutions.
I spoke to engineers at ARM who have been involved in these development and the message can be summarised as “About as nice as a compiler could hope for.” But I don’t expect you to take that statement, bold as it is, without any backup. So I dug a little deeper.
The industry would say that, to make a good compiler target, an ISA should exhibit:
So, what do they mean? Just how does the ARMv8 ISA stack up against those criteria?
This refers to the ease of designing efficient hardware which implements the ISA. I don’t really propose to discuss this here as I’m concerned with the ISA as a compiler target rather than the microarchitectural details of individual processors…fascinating though that may be.
This refers to the ease of writing programs in the ISA. For many of the reasons which I will discuss below in relation to automatic code generation, the ARMv8 ISA is an excellent vehicle for hand coding. But here I am not really concerned with how easy it is for humans to write ARMv8 assembly code. What really interests me at the moment is how easy it is for compilers to do so. We can certainly satisfy ourselves that the ISA is, for almost any definition of the word, “complete” and that is all that matters for a compiler to be able to produce correct and functional code.
The remaining properties matter much more to a compiler. It is these which determine how straightforward the back-end code generation can be and how efficient the resulting code is.
If an ISA is “regular”, this implies that if something is done in a particular way in one place, then it should be done that way in every place. For instance, if literal constants in arithmetic operations are defined in a particular way, then they should be defined in that way everywhere else a literal constant is used. That might include logical operations, addressing modes, branch instructions and so on.
If an ISA is “orthogonal”, then different aspects of it can be discussed and defined largely in isolation from each other. For instance, the number of registers available should not affect the way that literal constants are defined or the way that PC-relative addressing modes work.
“Composability” refers to the ease of combining different elements of an ISA definition. That could mean, for instance, that all logical and arithmetic operations use the same operand combinations and the same format for literal constants. Or that every memory access instruction uses the same addressing modes, regardless of the data type involved.
All of these properties combine to reduce, or minimize, the number of “special cases” which a compiler implementer has to deal with. If all arithmetic operations have the same basic structure (e.g. in terms of the number and type of operands they take), then code generation is significantly easier. This is particularly true in the case of dynamic compilation, where it would permit a single code sequence to be simply parameterized to deal with any combination of input and output data types.
The new A64 instruction set has evolved from a design effort which has, so far, taken well over five years. Starting from the existing Thumb-2 functionality and adding 64-bit capability and an extended register bank gave a start point. The architecture team took the opportunity to clear up known performance hazards and adjust the functionality to better match the requirements of modern software systems. Of course, the evolving instruction set was extensively modelled and benchmarked and the encoding of it is new, from scratch and "very clean".
Anyone who is familiar with the existing A32 and T32 instruction sets (as supported by ARMv7) won’t see too many surprises as one of the design intentions is to provide similar functionality to those. However, ARM took the opportunity to rationalize and restructure the instruction set in a number of ways.
Here are some highlights of the A64 instruction set:
For the record, one or two things have been removed:
The late addition of some instructions in A32 resulted in some inconsistency in the encoding scheme. For instance, LDR/STR support for halfwords (LDRH/LDRSH/STRH) is encoded slightly differently to the mainstream byte and word transfer instructions, with the result that the addressing modes are somewhat different. Here are the encodings for the immediate offset forms of LDR and LDRH in the A32 instruction set.
31
24
23
16
15
8
7
0
cond
1
P
U
W
Rn
Rt
imm12
imm4H
imm4L
You can see clearly that the encoding used for the offset is quite different, meaning that the LDRH instruction does not support the same offset range as LDR. The two instructions, although they carry out very similar operations, obey quite different rules and need to be generated and managed in different ways. This is hard for a compiler.
In A64, the encoding scheme has been designed from the ground up, making it much more consistent and ensuring that all memory access instructions (with very few exceptions) have a common encoding scheme and support the same set of addressing modes. Here is the encoding for all forms of the LDR instruction in A64.
size
V
opc
Imm12
The transfer size is indicated by the two-bit “size” field and you can see that the rest of the instruction fields (including the offset) are encoded consistently across all variants. This makes generation of a LDR instruction much simpler as a general-purpose algorithm can be used for all data types and all addressing modes.
The similarity between A64 and A32/T32 is illustrated easily with a simple example. The three sequences below show a simple C function and the output code in first T32 and then A64. The correspondence between the two is very easy to see.
//C cod int foo(int val) { int newval = bar(val); return val + newval; } //T32 foo: sub sp, sp, #8 strd r4, r14, [sp] mov r4, r0 bl bar add r0, r0, r4 ldrd r4, r14, [sp] add sp, sp, #8 bx lr //A64 foo: sub sp, sp, #16 stp x19, x30, [sp] mov w19, w0 bl bar add w0, w0, w19 ldp x19, x30, [sp] add sp, sp, #16 ret
Now that you know what the A64 instruction set contains, what does that mean for actually compiling software? How good is it as a target for compilers? I spoke to the teams in ARM who have been responsible for the code generation for the Dalvik engine and for armcc, the generic C compiler. Suffice it to say that they love the new instruction set! Here’s a digest of what they told me…
The basic functionality provided by A64 has been evolved from that found in A32/T32. So, there are few surprises for compiler implementers and code generators. In general, porting to the new instruction set is fairly straightforward. In Porting Android, for example, approximately 80% of the code simply required simple recompilation. Generally, translating A32 assembly code to A64 is straightforward. Most instructions map easily and many sequences actually become simpler. Most of the changes are in procedure entry and exit sequences where LDP/STP needs to replace LDM/STM.
All A64 instructions are all the same length, as compared with T32 which is a variable-length instruction set. This makes management and tracking of generated code sequences easier, particularly affecting dynamic code generators.
A64 instructions generally provide longer offsets, both for PC-relative branches and for offset addressing.
The increased branch range makes it easier to manage inter-section jumps. Dynamically-generated code is generally placed on the heap so it could, in practice, be located anywhere. The runtime system finds it much easier to manage this with increased branch ranges and fewer fix-ups are
The need for literal pools - block of literal data which are embedded in the code stream – has long been a feature of ARM coding. This doesn’t go away completely in A64. However, the larger PC-relative load offset helps considerably with the management of this, making it possible to use one literal pool per compilation unit and removing the need to manufacture locations for literal pools in long code sequences.
All load/store instructions now support consistent addressing modes. This makes it much easier, for instance, to treat char, short, int and long long in the same way when loading and storing quantities from memory. In addition, the FP/SIMD registers now support the same addressing modes as the core registers, making it easier to use the two register banks interchangeably. Previously, it was common to load data into core registers (to make use of the more flexible and extensive addressing options) and then use additional instructions to move the result into FP/SIMD registers. The increased commonality between the register banks reduces register pressure and makes register allocation easier.
char, short, int
long long
When generating code (statically, but especially dynamically) for common arithmetic functions, A32/T32 often required different instructions, or instruction sequences, to cope with different data types. These operations in A64 are much more consistent so it is much easier to generate common sequences for simple operations on differently sized data types. For instance, “add int”, “add long”, “add float” and “add double” would all result in slightly different functions in A32/T32. In A64, a single, simple parameterised function can be used which uses the same instruction sequence but works on all the different data types.
add int
add long
add float
add double
Having common literal constant encoding across different instruction types greatly simplifies instruction generation and reduces special cases.
A64 deals very naturally with 64-bit signed and unsigned data types. This would have required special-case instruction sequences in A32/T32. Particularly for languages like Java and Dalvik, which are naturally 64-bit typed, this makes for much more natural code generation.
Analysis shows that A64 is a more efficient and sometimes more compact instruction set that A32/T32. Although it provides 64-bit capability, the instruction size is still 32-bit. For instance, the 64-bit Dalvik JIT engine produces code which is around 6% smaller than the 32-bit equivalent. The resulting code is also simpler and easier to analyze.
The A64 register bank is significantly larger and this works to reduce register pressure in almost all applications. For instance, since 64-bit operations do not require two registers to hold input and/or output parameters, register usage is reduced. In T32, such cases would have resulted in a lot of “special case” code.
The A64 Procedure Calling Standard (PCS) allows passing of up to 8 parameters in registers (X0-X7). In contrast, A32/T32 allow for only 4 parameters in registers, with any excess being passed on the stack. Linux syscalls, for example, allow for up to six parameters – in A64, all these parameters can be passed in registers without need to use the stack.The PCS also defines a dedicated Frame Pointer (FP) which makes debugging and call-graph profiling easier by making it possible to reliably unwind the stack.
The “zero register” (general-purpose register 31) is very useful. However, caution is sometimes necessary as its usage is context-dependent and, in some contexts, it refers to the stack pointer instead.
IT blocks are a useful feature of T32, allowing for efficient sequences which avoid the need for short forward branches around unexecuted instructions. However, they are sometimes hard to generate and hard to analyze. A64 removes these and replaces them with conditional instructions like CSEL (Conditional Select) and CINC (Conditional Increment). These are more straightforward and easier to handle without special cases.
A32, in general, preserves a true three-operand structure. T32, on the other hand, contains a significant number of 2-operand instruction formats which make it slightly less flexible when generating code. A64 sticks to a consistent three-operand syntax which is easier to use and maps better to high-level requirements like Dalvik.
The A32/T32 shift/rotate behaviour, while well defined, does not always map easily to the behaviour expected by high-level languages. In particular, the behaviour specified for out-of-range shift distances is more intuitive.
A64 instructions provide a huge range of options for constants, each tailored to the requirements of specific instruction types.
The upshot of all that is that A64 provides for very flexible constants but that encoding them, even determining whether a particular constant can be legally encoded in a particular context, is sometimes quite difficult.
With only one instruction set state, developing in A64 does not involve interworking.
The simpler mapping scheme between the different register sizes in the FOP/SIMD register bank make these registers much easier to use. The mapping is easier for compilers and optimizers to model and analyze; bit manipulation is more accessible; consistent addressing modes makes getting memory contents in and out of the FP/SIMD register bank much easier; and there is less need to transfer values between core registers and FP/SIMD registers as more processing can be carried out in situ.
In summary, my investigations of what compilers need and what the A64 ISA provides seem to come to a clear conclusion: that the A64 instruction set is a very good target for automatically generated code in both static and dynamic compilation environments.