Extended System Coherency: Part 2 - Implementation, big.LITTLE, GPU Compute and Enterprise

February 17, 2014

7 minute read time.

Chinese Version中文版：扩展系统一致性 - 第 2 部分 - 实施、big.LITTLE、GPU 计算和企业级应用

This is the second part of a series of blogs about hardware coherency. In the first blog I introduced the fundamentals of cache coherency. This part talks about the implementation of hardware cache coherency and use cases.

Implementing Hardware Coherency

ARM’s first implementations of AMBA 4 ACE include the ARM CoreLink CCI-400 Cache Coherent Interconnect, ARM Cortex-A15 and Cortex-A7 processors. These products were first released to our silicon partners in 2011, and we've seen the first ARM big.LITTLE products come to market in 2013.

CoreLink CCI-400 has been licensed by over 24 partners to date for mobile and enterprise applications such as networking or microservers. CoreLink CCI-400 supports up to two AMBA 4 ACE processor clusters allowing up to eight processor cores to see the same view of memory and run an SMP OS.

Mobile Applications: big.LITTLE processing

CoreLink CCI-400 supports all big.LITTLE combinations including Cortex-A15 + Cortex-A7, Cortex-A17 + Cortex-A7, and Cortex-A57 + Cortex-53 with full support for ARMv8-A including 64-bit. big.LITTLE processing is a power optimization technology from ARM where high performance ‘big’ cores and efficiency tuned ‘LITTLE’ cores are combined with software to dynamically transition applications to the right processor at the right time.

Hardware coherency is fundamental to big.LITTLE processing as it allows the big and LITTLE processor clusters to see the same view of memory and run the same operating system. big.LITTLE software such as Global Task Scheduling (GTS) places tasks on the appropriate core at a given time. For moderate workloads all processing may be performed on the LITTLE cores while the big cores are powered down. If a workload requires higher performance a big core is powered up and the task migrated while other moderate workloads continue to run on LITTLE cores. big.LITTLE GTS allows all the cores on an SoC to run simultaneously, for example a device with four big and four LITTLE will appear to the operating system as a octo core processor.

Mobile Applications: GPU Compute

GPU compute with APIs such as OpenCL 1.1 Full Profile and Google RenderScript compute, unlock the combined processing power of CPU and GPU.

The ARM Mali -T600 series and Mali-T760 GPUs support AMBA 4 ACE-Lite for IO coherency with the CPU. This means that the GPU can read any shared data directly from the CPU caches, and writes to shared memory will automatically invalidate relevant lines in CPU caches. Hardware coherency reduces the cost of sharing data between CPU and GPU, and allows tighter coupling.

GPU Compute applications include: computational photography, computer vision, modern multimedia codecs targeting Ultra HD resolutions such as HEVC and VP9, complex image processing and gesture recognition.

ARM is one of the founding members of the Heterogeneous System Architecture (HSA) foundation. This foundation aims to provide a royalty free specification that makes it easier to take advantage of the heterogeneous CPU, GPU and DSP hardware in an SoC. This includes shared virtual memory and a roadmap to fully coherent GPU. These techniques will further reduce the cost of sharing data between processing engines.

See the HSA website for more information.

Enterprise Applications: Networking and Server

Enterprise applications such as networking and server have high performance serial interfaces such as PCI Express, Serial ATA and Ethernet. In most applications all of this data will be marked as shared as there will be many cases where the CPU needs to access data from these serial interfaces. The picture below shows an simplified example system.

CoreLink CCl-400 Cache Coherent Interconnect diagram

Example: network interface

Incoming packet on Ethernet interface stored to DRAM
- Shared writes will automatically invalidate any stale data in CPU caches
CPU processes packet headers
Ethernet interface forwards packet
- Shared reads will look up in CPU cache and DRAM to find the latest data

There is a trend in networking applications to move functionality to software to allow an SoC to support multiple applications. This means that the SoC needs more processing nodes.

The CCI-400 Cache Coherent Interconnect is being designed into a range of smaller enterprise applications including residential gateways, security appliances, WLAN enterprise access points, industrial communications and micro servers. These applications use a range of ARM processors depending on the performance requirements from Cortex-A7 to Cortex-A57 with up to a total of 8 cores maximum and no L3 cache.

ARM has a range of interconnect products to extend performance across a range of core counts:

CoreLink CCI-400 Cache Coherent Interconnect
- Up to 2 clusters, 8 cores
CoreLink CCN-504 Cache Coherent Network
- Up to 4 clusters, 16 cores
- Integrated L3 cache, 2 channel 72 bit DDR
CoreLink CCN-508 Cache Coherent Network
- Up to 8 clusters, 32 cores
- Integrated L3 cache, 4 channel 72 bit DDR

Ian Forsyth talks more about the CoreLink CCN products in this blog post.

CoreLink CCI-400 Cache Coherent Interconnect

The following table details key features of the CoreLink CCI-400:

Feature	Description
Slave Interfaces	2x ACE fully coherent interfaces, up to 8 processor cores (Cortex-A7, Cortex-A15, Cortex-A17, Cortex-A53 or Cortex-A57)3x ACE-Lite IO coherent interfaces for GPU, accelerators and interfaces
Master Interfaces	2x ACE-Lite for memory, with configurable interleaving memory striping option1x ACE-Lite for system
Quality of Service	Integrated bandwidth and latency regulators, QoS Virtual Networks
Address space	44 bit Virtual, 40 bit Physical (1TB), supports ARMv7-A & ARMv8-A
Performance	Approximately 25GB/s sustained bandwidth at 533MHz for dual channel memory
Area	Area can be optimized for application, based on performance and frequency targets

Two of the most commonly asked questions are: how big is it, and how fast does it run? CoreLink CCI-400 has many configuration options including register stages and transaction tracker sizes which allow the interconnect area and performance to be optimized for a given application. At the low end the gate account gets down towards 100k gates. In terms of clock speed, our baseline implementation trials started at 533MHz on a CMOS 32LP process, but we see a number of partners implementing at higher speeds on smaller silicon geometries and with faster implementation techniques.

The following diagram demonstrates an example mobile applications processor with Cortex-A50 series processors, CoreLink MMU-500 System MMU and a range of CoreLink 400 system IP.

Mobile applications processor Cortex-A50 CoreLink MMU-500 and CoreLink-400

In this system the Cortex-A57 and Cortex-A53 provide the big.LITTLE processor combination and are connected to CCI-400 with AMBA 4 ACE to provide full hardware coherency. The Mali-T628 and IO Coherent masters connect to CCI-400 via AMBA 4 ACE-Lite interfaces. As described in the first blog, this IO coherency allows the IO coherent agents to read from processor caches.

The other components in the system include:

MMU-500 System MMU - provides stage 1 and/or stage 2 address translation to support visualization of memory for system components.
TZC-400 TrustZone Address Space Controller - performs security checks on transactions to memory or peripherals and allows regions of memory to be marked as secure or protected.
DMC-400 Dynamic Memory Controller - provides dynamic memory scheduling and interfacing to external DDR2/3 or LPDDR2 memory.
NIC-400 Network Interconnect - provides a fully configurable, hierarchical, low latency connectivity for AMBA 4 AXI4, AMBA 3 AXI3, AHB-Lite and APB components.

Performance Analysis with ARM DS-5 Streamline Performance Analyzer

Arm DS-5 Streamline performance analyzer CCl-400

So how do you optimize for the best performance and power efficiency around CCI-400? One solution is to use the Streamline Performance Analyzer which is part of the ARM DS-5 Development Studio. This brings together system performance metrics, software tracing, statistical profiling, and power measurement to present into a system dashboard to help you optimize the system.

The CCI-400 includes a Performance Monitoring Unit (PMU) which allows the counting of events to measure items like bandwidth, transactions stalls, cache hit rates. These counters can be visualized with the Streamline Performance Analyzer as shown in the screen shot above. This data could be shown alongside SoC power and processor activity to understand what is happening at a system level.

Summary

In the first blog I described how the AMBA 4 ACE bus interface extends hardware cache coherency outside of the processor cluster and into the system. In this blog we looked at implementations of hardware coherency and applications from mobile, like big.LITTLE processing, and enterprise. At the heart of all these applications is a cache coherent interconnect like the CoreLink CCI-400. ARM as an IP provider is in a unique position to offer the complete solution of Cortex processor, Mali graphics and CoreLink cache coherent interconnect as well as tools and physical IP. I personally look forward to seeing more products come to market in 2014 taking full advantage of hardware cache coherency and AMBA 4 ACE, and I'd be interested in your plans or views on how this technology is helping you!

Read part three of this blog:

Read Part 3: CoreLink CCI-500

7 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog