In 2013 Arm announced the AMBA 5 CHI protocol to provide the performance and scale required for infrastructure applications such as networking and data center. The protocol has been highly successful and has been the foundation for Arm many core systems that are scaling up to 32 or more processors on a single system-on-chip (SoC).
Today, Arm is happy to announce a new major revision to the CHI specification (AMBA 5 CHI Issue B), which adds many new capabilities and performance enhancements. These enhancements have been used to improve memory latency and increase data throughput on Arm’s latest generation of IP including: the Cortex-A75 and Cortex-A55 processors, the CoreLink CMN-600 Coherent Mesh Network and the CoreLink DMC-620 Dynamic Memory Controller. Some of the key features and benefits include:
The specification is officially now available, click the button below for access!
[CTAToken URL = "http://infocenter.arm.com/help/topic/com.arm.doc.ihi0050b/index.html" target="_blank" text="Download the new AMBA specification" class ="green"]
AMBA 5 CHI roots begin back in 2003 when the AMBA 3 AXI (Advance Extensible Interface) was introduced. AXI then went on to become the most widely adopted AMBA standard with the ability to connect up to hundreds of masters and slaves in a complex SoC.
AMBA 4 not only added new capabilities with AXI4, but it also introduced cache coherency with ACE (AXI Coherency Extensions). ACE protocol was used extensively by interconnects to support big.LITTLE applications, with heterogeneous processing which had 2 processors: a “LITTLE” processor for efficiency and a “big’ processor for performance, sharing the same instruction set architecture (ISA).
Building on the success of ACE, AMBA 5 CHI was developed to bring higher performance and scale required for infrastructure applications (such as networking and data center).
Figure 1: evolution from AMBA 3 AXI to AMBA 5 CHI
AMBA 5 CHI has been architected to maintain performance as the number of components and quantity of traffic rises. It provides high frequency, non-blocking data transfers and is a layered architecture well suited for packetized on-chip networks. The protocol provides flow control and Quality of Service (QoS) mechanisms to control how resources in the system shared by many processors are allocated without needing a detailed understanding of every component and how they might interact.
Since the CHI specification separates the protocol and transport layers, it allows differing implementations to provide the optimal trade-off between performance, power and area. Designers can choose from any range of on-chip network topologies ranging from an efficient, small cross-bar to high performance, large scale mesh network.
Figure 2: AMBA 5 CHI layered architecture for diverse topologies
When constructing a CHI systems there are different types of nodes such as processors, accelerators, IO, and memory that will be connected to the on-chip network. At a high level there are three base node types, requestor, home and slave:
Figure 3: nodes within a CHI systems
To cover all the new capabilities in a single blog simply isn't possible. In the future, we'll be discussion more about how Direct Data Transfer significantly reduces latency and transport cycles and how the RAS data protection signaling is used to ensure data arrives at its destination without corruption and how poisoning is used for deferred error handling. However, there are two features, cache stashing and atomics, that are worth exploring in more detail as they typically generate the most discussion.
For data throughput workloads, such as networking and storage, stashing provides a very valuable tool to make sure critical data (such as a network packet header) is quickly processed so the processor can get to the next in queue.
To retrieve data, processors perform a load instruction, which is then retrieved from one of the caches in the hierarchy or from memory. This can stall the processor for hundreds of cycles, while it waits for the data from external memory. Cache stashing provides a mechanism for an RN to place data as close to the point of consumption (a processor cache) in the system thus eliminating these stalls.
Figure 4: cache stashing
Atomic operations were introduced with the Armv8.1-A instruction set and provide high frequency updates to shared data resources. Atomics can be used as an alternative to load-exclusive/store-exclusive and are more beneficial as the system scales to a larger number of requestors.
To illustrate the benefit of atomics, a simple example of a shared counter can be used. If there are multiple requestors sharing the data structure, it is likely to reside in the interconnect system level cache. If these requestors all decide to increment at the same time, the interconnect Home Node can perform the operation atomically within the system cache thus avoiding locking for long periods of time by the requestors.
Figure 5: far atomic operations
Building on the success of the first AMBA 5 CHI release, these new capabilities such as cache stashing and atomics add a new set of tools for systems designers to get the performance with efficiency required for target markets ranging from mobile to automotive to networking to data center. We have been working with leading SoC designers and the EDA community to develop the new specification revisions so over the coming days, weeks and months be on the look-out for more product announcements and discussion about the new protocol CHI enhancements.
explanation of QOS would have also helped.