Arm’s Central Processing Units (CPUs) are critical for today’s AI enabled software; interpreting, processing, and executing instructions. Arm’s Instruction Set Architecture (ISA) acts as an interface between hardware and the software, specifying what the processor can do and how it gets done. Arm’s ISA is continually evolving to meet modern computing demands, including the rise of Artificial Intelligence (AI), Machine Learning (ML), adoption of chiplets, and advancing security threats. Constant innovation ensures Arm’s pervasiveness, performance at scale, energy-efficiency, security, and developer flexibility.
To ensure developments align with such a fast-moving market, Arm spends an extensive amount of time reviewing future computing needs and affirming its understanding with its vast, one-of-a-kind ecosystem. The combination of expertise and feedback ensures relevance when creating and publishing an updated instruction set architecture (ISA). This blog post, issued annually, outlines the key additions made to Arm’s A-Profile architecture for the year (Armv9.6-A in 2024) and accompanies the release of full Instruction Set and System Register documentation. Releasing new architecture is only the first step. Arm works closely with a host of partners to enable the Arm ISA in the most widely used software upstream communities, such as Linux kernel and distros, empowering the broadest developer ecosystem on the planet.
Details of previous updates to the A-Profile architecture are available here: 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023.Let’s look at some of the new features we have added this year.
Matrix operations are used to weight features and calculate predictions. They underpin many of today’s important workloads, including AI and Machine Learning (ML). Scalable Matrix Extensions (SME), already in Armv9-A, greatly increase the processing speed and efficiency of matrix multiplications on Arm CPUs. With SME, calculations can be performed on multiple values at the same time, data collation and reuse are more efficient, and there is support for more data types and more effective data compression. SME also uses quantization, reducing the computational complexity of ML models. This reduces demand on memory, saving power, and makes models viable for mobile devices. Quantization is taken 1 step further with SME2, which introduces a streaming mode for any application that needs to emphasize throughput-oriented operations on the CPU.
The 2024 extensions build on SME2 with new support for 2:4 structured sparsity and quarter tile operations.
Starting with the quarter-tile operations, these operations are intended to improve the efficiency of SME when working with small matrices. Existing SME operations support outer product operations, using a pair of input vectors to compute a result matrix:
To better support smaller matrices, quartile-operations allow the inputs to be treated as being from 4 different matrices:
Another improvement brought by the 2024 extensions relates to sparsity. In the example below an input matrix containing Activation data is being multiplied with another matrix containing Weights. Some elements in the Weights matrix are unused (zeros) which do not affect the output.
This introduces 2 inefficiencies:
New structured-sparsity instructions let us address both of these.
In the earlier example, the Weights can be compressed with a metadata tag describing how the data is to be decompressed.
This approach has the advantage of optimizing both the memory footprint of the Weights and the bandwidth needed to fetch them for processing. The Weights could be decompressed in the processor and then used in calculations. However, to avoid unnecessary multiple-accumulates, new instructions allow the compressed data to be used directly as an input.
Chiplets offer greater system composability and performance scaling, making them attractive for AI and accelerated computing. To be adopted at scale, chiplets need to be interoperable. Interoperability is achieved through the standardization of chiplet interfaces and protocols. Arm is accelerating the ecosystem’s evolution to chiplet-based SoC’s through standards that are designed to provide a common language and reduce the risk of fragmentation. Arm’s Chiplet System Architecture (CSA) addresses the partitioning of an Arm-based system into multiple chiplets, including their high-level properties, to define chiplet types that can be standardized and reused. AMBA CHI C2C leverages the existing on-chip AMBA CHI protocol and defines how it is packetized, enabling it to be transported chiplet to chiplet. These initiatives will accelerate the move to a multi-vendor marketplace that provides specialized and interoperable chiplets. An open chiplet marketplace will allow OEMs to enjoy greater levels of customization and integration without the cost associated with developing and manufacturing monolithic silicon designs. Today, the benefits of chiplets are realized through vertically integrated designs. Armv9-A’s 2024 extensions consider this new approach to silicon and how resources are managed across them. Many of today’s computing needs are satisfied by shared-memory computer systems, where multiple applications, or multiple virtual machines (VM’s) run concurrently. To support such systems, Armv8.4-A introduced memory-system resource partitioning and usage monitoring (MPAM) extensions. MPAM provides controls to monitor and partition the use of shared resources. MPAM uses a partition number (PARTID) to identify which software entity each memory access is associated with. This PARTID is transported with the memory access to allow downstream Memory System Components (MSCs) to implement partitioning policies. 2024 sees the addition of MPAM Domains to better support shared-memory computer systems on multi-chiplet and multi-chip systems. MPAM Domains allow different parts of the system to implement different PARTID namespaces, with PARTID translation when an access moves across a domain boundary.
Not requiring a uniform PARTID width across the entire system allows systems to be more easily composed. Domains can also help reduce cost, as each part of the system can support only the needed number of PARTIDs.
Armv9-A’s Trace (ETE and TRBE) and Statistical Profiling (SPE) extensions give developers the information they need to understand how their software is performing, enabling developers to get the most from the hardware platform.
Trace and SPE data can be collected non-invasively while the system is running, with the data written into software-allocated buffers in virtual memory. When running a VM, it is important that those buffers are not paged out by the hypervisor, otherwise profiling data will be lost. At the same time, it is often not desirable to pin all a VM’s memory.
The 2024 extensions introduce a VM Interface for TRBE and for SPE. These interfaces allow the VM and hypervisor to agree of the size and location of profiling buffers. This gives the VM confidence that its profiling data will not be lost, while at the same time allowing the hypervisor to control how much of a VM’s memory needs to be pinned.
2024’s A-profile extensions introduce 2 enhancements to improve the efficiency of caching.
The first feature is Producer-Consumer Data Placement Hints. The new store hint instruction allows a producing thread to hint to the processor that the data from a store or atomic operation will be consumed by a different thread. While for the consuming thread, there is a new prefetch instruction that hints that the data is being generated by another thread and might not yet be present. Together, these hints enable significant scalability improvements for parallel software, enhancing performance of message passing, lock passing, and thread barriers.
For example:
STSHH STRM STR <payload>STSHH STRM STLR <flag>
SEVLLoop: WFE LDAXR <flag> CMP <flag>, <expected value> B.EQ Ready PRFM IR <payload> B.NE LOOPReady: LDR <payload>
SEVL
Loop:
WFE
LDAXR <flag>
CMP <flag>, <expected
value>
B.EQ Ready PRFM IR <payload>
B.NE LOOP
Ready:
LDR <payload>
A system might include devices or accelerators that connect into different levels of the cache hierarchy. For example, in the system below Device A can access the System Level Cache (SLC), while Device B bypasses the SLC.
To make data visible to visible to Device A or B, software running on the CPUs needs to push data into the memory system. Today, software would use a cache operation to the Point of Coherency (PoC), which in the example system is beyond the SLC. That is correct for Device B, but for Device A it would have been sufficient to push data to the SLC.
The 2024 extensions introduce additional cache maintenance operations which target the outer cache. This gives software that is aware of the cache topology greater flexibility, allowing developers to match how far into the system data is pushed to the needs of the device consuming that data.
Armv9-A gives developers the programming tools and environment to innovate at pace for the rapidly expanding AI market. The models and data used for such applications are particularly valuable, so security is paramount. Arm’s Confidential Compute Architecture (CCA) leverages both hardware and software to protect data and applications in use.
Armv9.1-A introduced the Realm Management Extension (RME), which creates a separate computational world on device to run and protect applications and data. The use of a realm prevents attacks from software that runs at higher privilege levels. The contents of a realm, or its processes, cannot be accessed. Data remains encrypted when in use, in transit, and at rest. Armv9.4-A introduced an update so realms can interact with an accelerator and maintain their integrity.Granular Data Isolation (GDI) builds on Armv9-A’s RME. GDI adds 2 new Physical Address spaces (PAS) that a memory location can be assigned to:
What makes these two new PASs different to the existing options is that they are inaccessible to the processors. This allows software to allocate memory buffers to other devices, while the hardware maintains the confidentiality of data within those buffers. For example, the NSP PAS could be used by trusted accelerators to process data while guaranteeing the data is inaccessible to software.
Other enhancements introduced as part of the 2024 extensions include:
The Generic Interrupt Controller (GIC) is the standard solution for A-profile Arm systems and used widely across the Arm ecosystem. The current versions, GICv3 and GICv4, were introduced alongside Armv8-A in 2013. Since then, the shape of systems has evolved, as have the workloads they host. Arm is working on a new version of the GIC architecture, and we look forward to sharing a preview early in 2025.
This blog post provides a brief introduction to the latest features included in the Arm architecture as Armv9.6-A. More detailed information can be found on our developer website.
Over the coming months, Arm will be working with our partners to ensure that the software ecosystem is able to utilize these features as soon as future processors become available.