Introducing new AMBA 5 CHI protocol enhancements | Specification now available

In 2013 Arm announced the AMBA 5 CHI protocol to provide the performance and scale required for infrastructure applications such as networking and data center.  The protocol has been highly successful and has been the foundation for Arm many core systems that are scaling up to 32 or more processors on a single system-on-chip.   

Figure 1: New AMBA 5 CHI protocol enhancements

Today, Arm is happy to announce a new major revision to the CHI specification (AMBA 5 CHI Issue B), which adds many new capabilities and performance enhancements. These enhancements have been used to improve memory latency and increase data throughput on Arm’s latest generation of IP including: the Cortex-A75 and Cortex-A55 processors, the CoreLink CMN-600 Coherent Mesh Network and the CoreLink DMC-620 Dynamic Memory Controller.  Some of the key features and benefits include:

  • Armv8.1-A Large System Extensions
    • Far atomic operations enable the interconnect to perform high frequency updates to shared data
    • Improved virtualization with extended virtual machine IDs and virtual host extensions for type 2 hypervisors
    • Support for up to 52-bit physical address space for more addressable memory in a coherency system
  • Performance Extensions and Latency Reduction
    • Cache stashing allows accelerators or IO devices to stash critical data within a CPU cache for low latency access
    • Direct Data Transfer offers significant latency reduction with fast data path return and memory prefetch
  • Enhanced RAS aligning with Armv8.2-A Architecture
    • Adds end-to-end data protection and poison signaling
    • Enables common error signaling, logging and reporting for CPUs, interconnects and memory controllers

The specification is officially now available, click the button below for access! 

Download the new AMBA specification

Evolution from AXI to CHI

AMBA 5 CHI roots begin back in 2003 when the AMBA 3 AXI (Advance Extensible Interface) was introduced.  AXI then went on to become the most widely adopted AMBA standard with the ability to connect up to 100’s of masters and slaves in a complex SoC. 

AMBA 4 not only added new capabilities with AXI4, but it also introduced cache coherency with ACE (AXI Coherency Extensions).  ACE protocol was used extensively by interconnects to support big.LITTLE applications, with heterogeneous processing which had 2 processors: a “LITTLE” processor for efficiency and a “big’ processor for performance, sharing the same instruction set architecture (ISA).

Building on the success of ACE, AMBA 5 CHI was developed to bring higher performance and scale required for infrastructure applications (such as networking and data center).

Figure 2: evolution from AMBA 3 AXI to AMBA 5 CHI 

AMBA 5 CHI overview

AMBA 5 CHI has been architected to maintain performance as the number of components and quantity of traffic rises.  It provides high frequency, non-blocking data transfers and is a layered architecture well suited for packetized on-chip networks.  The protocol provides flow control and Quality of Service (QoS) mechanisms to control how resources in the system shared by many processors are allocated without needing a detailed understanding of every component and how they might interact.

Since the CHI specification separates the protocol and transport layers, it allows differing implementations to provide the optimal trade-off between performance, power and area.  Designers can choose from any range of on-chip network topologies ranging from an efficient, small cross-bar to high performance, large scale mesh network.

Figure 3: AMBA 5 CHI layered architecture for diverse topologies

When constructing a CHI systems there are different types of nodes such as processors, accelerators, IO, and memory that will be connected to the on-chip network.  At a high level there are three base node types, requestor, home and slave:

  • Request Node (RN) – A node that generates protocol transactions, including reads and writes, to the interconnect.  These nodes could be fully coherent processors or IO coherent devices.
  • Home Node (HN) – A node located within the interconnect that receives protocol transactions from RNs.  The HN is the point-of-coherency for the system and may include a system level cache and/or a snoop filter to reduce redundant snoops.
  • Slave Node (SN) – A node that receives and completes requests from the HNs.  An SN could be used from peripheral or main memory.

Figure 4: nodes within a CHI systems

A closer look at cache stashing and atomics

To cover all the new capabilities in a single blog simply isn't possible.  In the future, we'll be discussion more about how Direct Data Transfer significantly reduces latency and transport cycles and how the RAS data protection signaling is used to ensure data arrives at its destination without corruption and how poisoning is used for deferred error handling.  However, there are two features, cache stashing and atomics, that are worth exploring in more detail as they typically generate the most discussion.

Cache stashing

For data throughput workloads, such as networking and storage, stashing provides a very valuable tool to make sure critical data (such as a network packet header) is quickly processed so the processor can get to the next in queue.

To retrieve data, processors perform a load instruction, which is then retrieved from one of the caches in the hierarchy or from memory.  This can stall the processor for 100’s of cycles, while it waits for the data from external memory.  Cache stashing provides a mechanism for an RN to place data as close to the point of consumption (ie a processor cache) in the system thus eliminating these stalls. 

Figure 5: cache stashing

Far atomic operations

Atomic operations were introduced with the Armv8.1-A instruction set and provide high frequency updates to shared data resources.  Atomics can be used as an alternative to load-exclusive/store-exclusive and are more beneficial as the system scales to a larger number of requestors.  

To illustrate the benefit of atomics, a simple example of a shared counter can be used.  If there are multiple requestors sharing the data structure, it is likely to reside in the interconnect system level cache.  If these requestors all decide to increment at the same time, the interconnect Home Node can perform the operation atomically within the system cache thus avoiding locking for long periods of time by the requestors.

 Figure 6: far atomic operations

Summary

Building on the success of the first AMBA 5 CHI release, these new capabilities such as cache stashing and atomics add a new set of tools for systems designers to get the performance with efficiency required for target markets ranging from mobile to automotive to networking to data center.  We have been working with leading SoC designers and the EDA community to develop the new specification revisions so over the coming days, weeks and months be on the look-out for more product announcements and discussion about the new protocol CHI enhancements.   

As mentioned at the beginning of this blog, the new revision is in it's final review process and will be ready for download later this year.  If you would to be alerted as soon as it is ready, please register with the following link.

Download the new AMBA specification

Learn more about Arm AMBA

 

Anonymous