The Armv8-A architecture continues to evolve, with the additions developed through 2016 collectively known as Armv8.3-A. Grouping enhancements in this manner helps the ecosystem manage tools and software support alongside the large numbers of Armv8-A based processors and products in development or production today. These changes add to the gradual migration in cores and related products over several years.
Developed in collaboration with our architecture licensees and other key partners, Armv8.3-A adds:
All these changes are incremental to previous sets of enhancements, with the Armv8-A System register ID mechanism used to identify features in any given implementation.
Please note: Arm recently announced support for a new vector processing architecture, the Scalable Vector Extension (SVE). This extension is independent of the changes introduced with Armv8.3-A. See Technology Update: The Scalable Vector Extension (SVE) for the Armv8-A architecture for more details.
The enhancements introduced with Armv8.3 fall into the following categories:
Note: AArch64 indicates the 64-bit Execution state and AArch32 the 32-bit Execution state in the Arm architecture.
Computer attacks are becoming more sophisticated. Examples of this are exploit mechanisms such as the use of gadgets in Return-Orientated-Programming (ROP) and Jump-Orientated-Programming (JOP). To mitigate against such exploits, Armv8.3-A introduces a feature that authenticates the contents of a register before it is used as the address for an indirect branch or data reference. For address authentication, the functionality uses the upper bits in a 64-bit address value normally associated with signed extension of the address space. This allows the introduction of a Pointer Authentication Code (PAC) as a new field within the upper bits of the value.
The functionality is summarized as follows:
There is growing interest in cloud computing, and, in particular, in an increasingly common use case where a user rents a virtual machine from an Infrastructure as a Service (IaaS) provider. Nested virtualization is an attractive proposition where the workload to run on this virtual machine includes the use of a hypervisor. In this blog, the hypervisor that is run natively on the hardware is described as the host hypervisor, while the nested hypervisor that is run under the control of the host hypervisor is described as the guest hypervisor.
The Armv8.3-A nested virtualization support enables a guest hypervisor to run transparently in non-secure EL1 mode, unaware that it is not executing at EL2. Running a guest hypervisor at EL1, removes the exception trap overhead, performance, and latency costs of running this software as a non-secure user-level process. This feature is only supported in AArch64, and requires implementation of EL2.
New instructions are added to AArch32 and AArch64 to aid floating-point multiplication and addition of complex numbers, where the complex numbers are packed in a vector register as a pair of elements. The Imaginary part of the number is placed in the more significant element, and the Real part of the number is placed in the less significant element.
The instructions include:
The floating-point functionality supported is:
Armv8.3-A adds instructions that convert a double-precision floating-point number to a signed 32-bit integer with round towards zero. Where the integer result is outside the range of a signed 32-bit integer (DP float supports integer precision up to 53 bits), the value stored as the result is the integer conversion modulo 232, taking the same sign as the input float.
The Z-flag is used to determine if the original number was an integer; the other flags (N, C, and V) are always cleared. The Z-flag is set to one to indicate an integer within range, meaning it is cleared when the input number is:
This approach allows a B.NE conditional branch to be used immediately after this instruction to test if the input double-precision number is a true representation of a 32-bit signed integer.
The Armv8.0 support for release consistency is based around the “RCsc” (Release Consistency sequentially consistent) model described by Adve & Gharacholoo in , where the Acquire/Release instructions follow a sequentially consistent order with respect to each other. This is well aligned to the requirements of the C++11/C11 memory_order_seq_cst, which is the default ordering of atomics in C++11/C11.
Instructions are added as part of Armv8.3-A to support the weaker RCpc (Release Consistent processor consistent) model where it is permissible that a Store-Release followed by a Load-Acquire to a different address can be re-ordered. This model is supported by the use of memory_order_release/ memory_order_acquire /memory_order_acqrel in C++11/C11.
 Adve & Gharachorloo Shared Memory Consistency Models: A Tutorial
The Current Cache Size ID Register (CCSIDR) defines the number of sets of a cache level by using a 15-bit field, and the associativity and number of ways in a 10-bit field. To avoid one or both of these becoming limiting factors in an implementation, a second 32-bit register, CCSIDR2, is added and a new format adopted across the 64 bits provided by the existing and new registers.
For a summary of the Armv8-A architecture, see the section on Armv8 architectural concepts in Chapter A1 of the Armv8-A Architecture Reference Manual.
Read Armv8-A Architecture Reference Manual
Armv8.1-A details are currently available as a supplement. Their consolidation alongside the Armv8.2-A details will be published in early 2017.
It is expected that the Armv8.3-A details will be consolidated into the Armv8-A Architecture Reference Manual and published in mid-2017.
128-bit integers are very easy.
-If just adding and subtracting numbers, they can be quite fast by using two 64-bit integers.
But the 128-bit float will (as you know) extend the "range" of values, which is important when calculating "real-world" distances.
(I'm speaking about billions of 128-bit float calculations per second).
So far, the PPC is still the best CPU at doing this; I'd prefer handing the job over to Arm-based designs, because it's easier to obtain an Arm Cortex-A these days (and the cost is far lower), plus the performance of the Cortex-A is generally better and there are a lot of implementations to choose from.
I can do 1024-bit integer calculations, but each time I extend by an integer register, multiplying is dragged down terribly much, so parallel calculations suddenly need a whole bunch more CPUs.