Extended System Coherency: Part 2 - Implementation, big.LITTLE, GPU Compute and Enterprise

February 17, 2014

7 minute read time.

Chinese Version中文版：扩展系统一致性 - 第 2 部分 - 实施、big.LITTLE、GPU 计算和企业级应用

This is the second part of a series of blogs about hardware coherency. In the first blog I introduced the fundamentals of cache coherency. This part talks about the implementation of hardware cache coherency and use cases.

Implementing Hardware Coherency

ARM’s first implementations of AMBA 4 ACE include the ARM CoreLink CCI-400 Cache Coherent Interconnect, ARM Cortex-A15 and Cortex-A7 processors. These products were first released to our silicon partners in 2011, and we've seen the first ARM big.LITTLE products come to market in 2013.

CoreLink CCI-400 has been licensed by over 24 partners to date for mobile and enterprise applications such as networking or microservers. CoreLink CCI-400 supports up to two AMBA 4 ACE processor clusters allowing up to eight processor cores to see the same view of memory and run an SMP OS.

Mobile Applications: big.LITTLE processing

CoreLink CCI-400 supports all big.LITTLE combinations including Cortex-A15 + Cortex-A7, Cortex-A17 + Cortex-A7, and Cortex-A57 + Cortex-53 with full support for ARMv8-A including 64-bit. big.LITTLE processing is a power optimization technology from ARM where high performance ‘big’ cores and efficiency tuned ‘LITTLE’ cores are combined with software to dynamically transition applications to the right processor at the right time.

Hardware coherency is fundamental to big.LITTLE processing as it allows the big and LITTLE processor clusters to see the same view of memory and run the same operating system. big.LITTLE software such as Global Task Scheduling (GTS) places tasks on the appropriate core at a given time. For moderate workloads all processing may be performed on the LITTLE cores while the big cores are powered down. If a workload requires higher performance a big core is powered up and the task migrated while other moderate workloads continue to run on LITTLE cores. big.LITTLE GTS allows all the cores on an SoC to run simultaneously, for example a device with four big and four LITTLE will appear to the operating system as a octo core processor.

Mobile Applications: GPU Compute

GPU compute with APIs such as OpenCL 1.1 Full Profile and Google RenderScript compute, unlock the combined processing power of CPU and GPU.

The ARM Mali -T600 series and Mali-T760 GPUs support AMBA 4 ACE-Lite for IO coherency with the CPU. This means that the GPU can read any shared data directly from the CPU caches, and writes to shared memory will automatically invalidate relevant lines in CPU caches. Hardware coherency reduces the cost of sharing data between CPU and GPU, and allows tighter coupling.

GPU Compute applications include: computational photography, computer vision, modern multimedia codecs targeting Ultra HD resolutions such as HEVC and VP9, complex image processing and gesture recognition.

ARM is one of the founding members of the Heterogeneous System Architecture (HSA) foundation. This foundation aims to provide a royalty free specification that makes it easier to take advantage of the heterogeneous CPU, GPU and DSP hardware in an SoC. This includes shared virtual memory and a roadmap to fully coherent GPU. These techniques will further reduce the cost of sharing data between processing engines.

See the HSA website for more information.

Enterprise Applications: Networking and Server

Enterprise applications such as networking and server have high performance serial interfaces such as PCI Express, Serial ATA and Ethernet. In most applications all of this data will be marked as shared as there will be many cases where the CPU needs to access data from these serial interfaces. The picture below shows an simplified example system.

CoreLink CCl-400 Cache Coherent Interconnect diagram

Example: network interface

Incoming packet on Ethernet interface stored to DRAM
- Shared writes will automatically invalidate any stale data in CPU caches
CPU processes packet headers
Ethernet interface forwards packet
- Shared reads will look up in CPU cache and DRAM to find the latest data

There is a trend in networking applications to move functionality to software to allow an SoC to support multiple applications. This means that the SoC needs more processing nodes.

The CCI-400 Cache Coherent Interconnect is being designed into a range of smaller enterprise applications including residential gateways, security appliances, WLAN enterprise access points, industrial communications and micro servers. These applications use a range of ARM processors depending on the performance requirements from Cortex-A7 to Cortex-A57 with up to a total of 8 cores maximum and no L3 cache.

ARM has a range of interconnect products to extend performance across a range of core counts:

CoreLink CCI-400 Cache Coherent Interconnect
- Up to 2 clusters, 8 cores
CoreLink CCN-504 Cache Coherent Network
- Up to 4 clusters, 16 cores
- Integrated L3 cache, 2 channel 72 bit DDR
CoreLink CCN-508 Cache Coherent Network
- Up to 8 clusters, 32 cores
- Integrated L3 cache, 4 channel 72 bit DDR

Ian Forsyth talks more about the CoreLink CCN products in this blog post.

CoreLink CCI-400 Cache Coherent Interconnect

The following table details key features of the CoreLink CCI-400:

Feature	Description
Slave Interfaces	2x ACE fully coherent interfaces, up to 8 processor cores (Cortex-A7, Cortex-A15, Cortex-A17, Cortex-A53 or Cortex-A57)3x ACE-Lite IO coherent interfaces for GPU, accelerators and interfaces
Master Interfaces	2x ACE-Lite for memory, with configurable interleaving memory striping option1x ACE-Lite for system
Quality of Service	Integrated bandwidth and latency regulators, QoS Virtual Networks
Address space	44 bit Virtual, 40 bit Physical (1TB), supports ARMv7-A & ARMv8-A
Performance	Approximately 25GB/s sustained bandwidth at 533MHz for dual channel memory
Area	Area can be optimized for application, based on performance and frequency targets

Two of the most commonly asked questions are: how big is it, and how fast does it run? CoreLink CCI-400 has many configuration options including register stages and transaction tracker sizes which allow the interconnect area and performance to be optimized for a given application. At the low end the gate account gets down towards 100k gates. In terms of clock speed, our baseline implementation trials started at 533MHz on a CMOS 32LP process, but we see a number of partners implementing at higher speeds on smaller silicon geometries and with faster implementation techniques.

The following diagram demonstrates an example mobile applications processor with Cortex-A50 series processors, CoreLink MMU-500 System MMU and a range of CoreLink 400 system IP.

Mobile applications processor Cortex-A50 CoreLink MMU-500 and CoreLink-400

In this system the Cortex-A57 and Cortex-A53 provide the big.LITTLE processor combination and are connected to CCI-400 with AMBA 4 ACE to provide full hardware coherency. The Mali-T628 and IO Coherent masters connect to CCI-400 via AMBA 4 ACE-Lite interfaces. As described in the first blog, this IO coherency allows the IO coherent agents to read from processor caches.

The other components in the system include:

MMU-500 System MMU - provides stage 1 and/or stage 2 address translation to support visualization of memory for system components.
TZC-400 TrustZone Address Space Controller - performs security checks on transactions to memory or peripherals and allows regions of memory to be marked as secure or protected.
DMC-400 Dynamic Memory Controller - provides dynamic memory scheduling and interfacing to external DDR2/3 or LPDDR2 memory.
NIC-400 Network Interconnect - provides a fully configurable, hierarchical, low latency connectivity for AMBA 4 AXI4, AMBA 3 AXI3, AHB-Lite and APB components.

Performance Analysis with ARM DS-5 Streamline Performance Analyzer

Arm DS-5 Streamline performance analyzer CCl-400

So how do you optimize for the best performance and power efficiency around CCI-400? One solution is to use the Streamline Performance Analyzer which is part of the ARM DS-5 Development Studio. This brings together system performance metrics, software tracing, statistical profiling, and power measurement to present into a system dashboard to help you optimize the system.

The CCI-400 includes a Performance Monitoring Unit (PMU) which allows the counting of events to measure items like bandwidth, transactions stalls, cache hit rates. These counters can be visualized with the Streamline Performance Analyzer as shown in the screen shot above. This data could be shown alongside SoC power and processor activity to understand what is happening at a system level.

Summary

In the first blog I described how the AMBA 4 ACE bus interface extends hardware cache coherency outside of the processor cluster and into the system. In this blog we looked at implementations of hardware coherency and applications from mobile, like big.LITTLE processing, and enterprise. At the heart of all these applications is a cache coherent interconnect like the CoreLink CCI-400. ARM as an IP provider is in a unique position to offer the complete solution of Cortex processor, Mali graphics and CoreLink cache coherent interconnect as well as tools and physical IP. I personally look forward to seeing more products come to market in 2014 taking full advantage of hardware cache coherency and AMBA 4 ACE, and I'd be interested in your plans or views on how this technology is helping you!

Read part three of this blog:

Read Part 3: CoreLink CCI-500

wangyong over 10 years ago

Thanks a lot！
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Neil Parris over 10 years ago

Yes exactly, same address space. e.g. if you had a DMC-400 with 4 slave ports, 2 might be connected to the CCI (one port even, e.g. 0x0000, 0x2000, 0x4000..., one port odd 0x1000, 0x3000, 0x5000...), the other ports might be connected to subsystems like display which supports the full address range, 0x0000, 0x1000, 0x2000.... etc. The address 0x2000 is the same DRAM chip/bank/row no matter what slave port of the DMC the request arrived on.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
wangyong over 10 years ago

Hi Neil,
Thanks a lot. So it depends on dual channel memory controller. If separate memory controllers is used, the 'even' addresses and the 'odd' addresses access the different memory controllers and DRAMs. Regarding DMC-400, the 'even' addresses and the 'odd' addresses will access the same memory channel and DRAMs, and non-striped connection from the display also accesses the same memory channel, so they are all accessing the same address space. Right?
Best regards.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Neil Parris over 10 years ago

Hi Wangyong - lots of great questions here, I'll answer them one by one.
Regarding striping it's worth noting that the most recent versions of CoreLink CCI-400 also support finer grain striping, this is configurable in powers of 2 from 128B up to 4KB. The optimal stripe size may depend on properties of your memory controller, memory type used and traffic patterns. I'd expect the most likely stripe size may be above 256B and at or below 2KB.
In terms of connectivity, many mobile designs will connect the real time traffic from display controllers and video direct to the DMC as none of this data is "sharable" in the sense of hardware cache coherency. The connectivity to the memory controller will depend on the properties of that memory controller. For example the ARM DMC-400 can support up to 4 slave ports, and could support a striping connection from the CCI and non-striped connection from the display. If you were to look at the interfaces from CCI to DMC one port would have the 'even' addresses while the other had the 'odd', but they are all accessing the same address space.
If instead you had separate memory controllers for each memory channel then you would need an interconnect interleaving block to connect from the real-time & display masters to the multiple memory controllers.
The CoreLink MMU-500 is serving a different purpose, it's there to allow translation from virtual address (VA) to intermediate physical address (IPA), or to physical address (PA). For example a display controller may want to work with a contiguous region of memory; this could be contiguous in VA or IPA space and scattered in PA memory. It could also help with visualization, for example multiple virtual OSs each with their own intermediate physical address space.
Regarding CoreLink CCN-504, this supports up to 2 memory channels, and yes these memory channels are interleaved with striping.
Hopefully this answers your questions! Thanks, Neil.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
wangyong over 10 years ago

Hi Neil,
I find that CCI-400 supports "'M1 and M2, striped in 4KB regions, used to load-balance between two memory controllers when ADDRMAPx[1:0] = 0b11" from DDI0470F_cci400_r1p0_trm. The Display and Video Subsystem accesses DDR directly without the decode of CCI-400 in this article. So If M1 and M2 striped in 4KB regions is enabled, is MMU500 in the path from Display and Video Subsystem to DDR used to ensure that Display and Video Subsystem accesses the same address regions as Cortex-A57/53 and MALI-T628?
Does CCN504 also support this feature that I didn't find from ccn504_r1p0_trm?
Best regards.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog