Extended System Coherency: Part 1 - Cache Coherency Fundamentals

December 3, 2013

6 minute read time.

Chinese Version 中文版：扩展系统一致性 - 第 1 部分 - 缓存一致性基本信息

Introduction

The theme of TechCon 2013 was “Where intelligence connects” and in many ways hardware system coherency is an important part of connecting the intelligence of an SoC. This year I presented "Extended System Coherency for Mobile and Beyond" which introduced the fundamentals of cache coherency, discussed implementations and looked at use cases. This blog is the first in a series and starts with cache coherency fundamentals.

So what do we mean by ‘coherency’?

Let’s go back to basics and ask what does coherency mean? Coherency is about ensuring all processors, or bus masters in the system see the same view of memory. For example if I have a processor which is creating a data structure then passing it to a DMA engine to move, both the processor and DMA must see the same data. If that data were cached in the CPU and the DMA reads from external DDR, the DMA will read old, stale data.

There are three mechanisms to maintain coherency:

Disable caching is the simplest mechanism but may cost significant CPU performance. To get the highest performance processors are pipe-lined to run at high frequency, and to run from caches which offer a very low latency. Caching of data that is accessed multiple times increases performance significantly and reduces DRAM accesses and power. Marking data as “non-cached” could impact performance and power.
Software managed coherency is the traditional solution to the data sharing problem. Here the software, usually device drivers, must clean or flush dirty data from caches, and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.
Hardware managed coherency offers an alternative to simplify software. With this solution any cached data marked ‘shared’ will always be up to date, automatically. All processors and bus masters in that sharing domain see the exact same value.

Challenges with software coherency

A cache stores external memory contents close to the processor to reduce the latency and power of accesses. On-chip memory accesses are significantly lower power than external DRAM accesses.

Software managed coherency manages cache contents with two key mechanisms:

Cache Cleaning (flushing):
- If any data stored in a cache is modified, it is marked as ‘dirty’ and must be written back to DRAM at some point in the future. The process of cleaning or flushing caches will force dirty data to be written to external memory.
Cache Invalidation:
- If a processor has a local copy of data, but an external agent updates main memory then the cache contents are out of date, or ‘stale’. Before reading this data, the processor must remove the stale data from caches, this is known as ‘invalidation’ (a cache line is marked invalid). An example is a region of memory used as a shared buffer for network traffic which may be updated by a network interface DMA hardware; a processor wishing to access this data must invalidate any old stale copy before reading the new data.

Challenge 1: Software Complexity

Quote from a system architect at an application processor vendor:

“50% of debug time is spent on SW coherency issues as these are difficult to find and pinpoint.”

Software coherency is hard to debug, the cache cleaning and invalidation must be done at the right time. If done too often it wastes power and CPU effort. If done too little it will result in stale data which may cause unpredictable application behavior, if not a crash. Debugging this is extremely difficult as it will present occasional data corruption.

“We would like to connect more devices with hardware coherency to simplify software and accelerate product schedules.”

The quotes above are from an application processor vendor that is looking to connect more hardware accelerators and interfaces to a coherent interconnect to help reduce the time to market for new products.

Quote from a networking and modem partner:

“Only a few people in our software group understand the careful timing required to share data between the processor and radio subsystem. Scaling this to the 100’s strong software team is very difficult!”

Another partner building modem systems with a Cortex-A CPU is looking to hardware coherency to simplify software.

Challenge 2: Performance and power

Where there are high rates of sharing between requesters, the cost of software cache maintenance can be significant, and can limit performance. For example, ARM benchmarking has found that for a networking application processing the header of every data packet might spend more than a third of the CPU cycles on cache maintenance. Part of the challenge is working out which data needs to be maintained. Worst case, the complete cache contents must be flushed, which may displace valuable data that needs to be read back from DRAM.

The chart below shows a simple example of DMA transfer performance for hardware vs software coherency. For this example the performance of hardware coherency increases as the amount of dirty data in processor caches increases (hit rate). This is because the software coherency version will take longer to clean and invalidate the cache if it has more dirty data.

Performance chart hardware coherency versus software coherency

Extending hardware coherency to the system

Diagram full coherency between ACE processors

Hardware coherency is not a new concept. In fact, the first implementation at ARM is within the ARM11 MPCore processor. Here, up to 4 processor cores are integrated in a single cluster and can run as a “Symmetric Multi-Processor” (SMP) with visibility of each other’s L1 caches and shared L2. This technology is supported by all the latest ARM Cortex applications processors.

Extending hardware coherency to the system requires a coherent bus protocol, and in 2011 ARM released the AMBA 4 ACE specification which introduces the “AXI Coherency Extensions” on top of the popular AXI protocol. The full ACE interface allows hardware coherency between processor clusters and allows an SMP operating system to extend to more cores. With the example of two clusters, any shared access to memory can ‘snoop’ into the other cluster’s caches to see if the data is already on chip; if not, it is fetched from external memory (DDR).

The AMBA 4 ACE-Lite interface is designed for IO (or one-way) coherent system masters like DMA engines, network interfaces and GPUs. These devices may not have any caches of their own, but they can read shared data from the ACE processors. Alternatively, they may have caches but not cache shareable data.

While hardware coherency may add some complexity to the interconnect and processors, it massively simplifies the software and enables applications that would not be possible with software coherency. An example being big.LITTLE Global Task Scheduling.

Summary

Cache coherency is an important concept to understand when sharing data. Disabling caches can impact performance; software coherency adds overheads and complexity; and hardware coherency manages sharing automatically which can simplify software. The AMBA 4 ACE bus interface extends hardware cache coherency outside of the processor cluster and into the system.

The next blog in the series will explore implementations of hardware coherency and look at a range of applications ranging from mobile including big.LITTLE processing and GPU compute, to enterprise including networking and servers.

Part 2 - Implementation, big.LITTLE, GPU Compute and Enterprise

Top Comments

Neil Parris over 11 years ago +1

Hi Wangyong, Short answer, yes, the protocol and communication between clusters via CCI-400 ensures all L1 and L2 caches are coherent. Longer answer: ARM multi-processor clusters like Cortex-A15 have something...

Milind T over 10 years ago

Hi Neil,
One proposed SoC system consists of:
1) Single ARM CPU with L1 Cache & L2 memory
2) One external master without any own cache and this master needs to have shared access to L2 memory.
To maintain coherency, is the following system suitable?
Single ARM CR5 processor with ACP port.
Specifically, is the hardware coherency provided by ACP in this system is sufficient or is there need for some software coherency functions too?
Whether CA7 & CCI based system for this one external master system will be needed/advantageous?
The requirement is less latency & lower power, hence CR5 based system could be more favoured.
Please provide your comments.
Thanks.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Neil Parris over 10 years ago

Hi Pallavi.t - yes that's correct. When a snoop comes into the CPU cluster it's checking the SCU (which has a duplicate of all L1 cache tags) and the L2 tags. If there are multiple clusters powered up then we would need to check in both clusters. As you can see from part 3 of this blog post, when we introduce the snoop filter we can perform these look-ups in the interconnect instead.
Thanks!
Neil.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Pallavi over 10 years ago

Hi Neil,
Can u please clarify that, any I/O coherent device such as GPU to access the data two lookups need to happen one in each cluster's SCU & L2 cache(shared among all core in a cluster).
please correct me if i'm wrong
Thanks!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Neil Parris over 10 years ago

divcesar, latency will be different for each of the ARM cores. If you're using a specific ARM platform I'd recommend looking at the device manufacturer documentation for further details. Thanks!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel
Divino César over 10 years ago

Do you have any information about what are the cache access latencies (L1, L2 and snooped accesses)?
Thanks!
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog