Bitesize Bifrost 2: System coherency

August 31, 2016

3 minute read time.

In the first Bitesize Bifrost blog we introduced you to our new GPU architecture, Bifrost, and looked specifically at the extensive optimization and power saving benefits provided by clause shaders. This time around we’re looking at system coherency, which allows the CPU and GPU to more effectively collaborate on workloads, and why this was considered an important focus for our newest GPU architecture.

In earlier systems there was no coherency between the CPU and GPU. If you created data on the CPU but wanted the GPU to be able to work on it then the CPU would need to write the data to the main memory first. This allowed the GPU to see and access the data in order to process it. However, as the CPU operates with a cache, it was difficult to be certain that all data had been written to the main memory as opposed to simply being written to the cache. This meant the cache needed to be ejected to main memory and cleared (flushed) to ensure all the data was available to the GPU.

The issue this raises is that should you forget to flush the cache, you can’t be sure of the consequences. In some instances all the data would have been written out to main memory and you’d have no problem, or the data may be only marginally out of date and still not cause major issues. However, if the data is largely outdated you can experience serious, visible errors which are difficult to diagnose due to the different timings in the debugger affecting what’s in the cache. This makes it hard to reproduce the error and subsequently address it.

Additionally, as CPU cache sizes grow the cost of flushing them grows too. This can mean it’s only efficient to use the GPU for large, data heavy jobs which make the cache clean worthwhile and that the majority of jobs are therefore quicker and easier to keep on the CPU because of this overhead.

Our previous generation of GPU architecture, Midgard, used a concept known as IO coherency, which was originally used for input/output peripherals. This allows the GPU to check the CPU’s cache when it requests data from memory and effectively ask the CPU to confirm if it has the requested data in its cache. If it has, the GPU will copy the data into its own cache directly from the CPU cache, without going via the external memory. This way the memory latency is significantly reduced, as is external read bandwidth. However, this was a one-way system. Whilst the GPU also has caches of its own, in an IO-coherent system, the CPU cannot peek into the GPU’s caches.

As most of the required data in a graphics system flows from CPU to GPU rather than the other way around, this is an efficient tool for graphical tasks. Also, as GPU caches tend to be smaller, cleaning them at the end of a rendering pass is comparatively less costly and occurs at a single, regulated point in time making it less likely to be missed.

However, compute workloads can be vastly varying in size and the data needs to be able to travel between the CPU and GPU in both directions. This is why our new Bifrost architecture introduces full system coherency to products in the High Performance roadmap, allowing both the CPU and GPU to access each other’s caches. This eliminates the need for software to clean the caches and allows the CPU and GPU to collaborate on smaller jobs as well as larger ones. This extends the potential uses of the GPU’s compute capability and removes the risk of producing those difficult to detect errors that occur when a cache clean operation is missed.

As the Bifrost architecture is capable of scaling to 32 cores we’ve redesigned the level two cache to feature a modular design which is accessible by the cores as a single cache. This cache size is configurable to allow partners to balance just the right size and bandwidth for their specific system.

The single logical cache makes it simple for software to work with, both in the driver and on the GPU, so we can make the most of reusing cached data between shader cores. Partial cache line support means that we can effectively use it as a merging write buffer, resulting in fewer partial writes to DRAM and improving overall bandwidth utilization. The GPU also supports TrustZone™ memory protection, working to enforce restrictions on protected memory accesses.

As we look towards our next range of Bifrost based GPUs further advancements are on their way, so stay tuned and we’ll keep you up to date with the very latest in mobile graphics.

Sean Lumly over 9 years ago

Thanks for the post!
I found it very informative to understand the need for cache coherency as a way to improve performance by avoiding the need to flush caches to keep memory coherent between AP units (with their own caches). This should not only improve the performance of heterogeneous compute applications dramatically, but it should also have the benefit of reducing power considerably for these applications as well.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Mobile, Graphics, and Gaming blog

Unlock the power of SVE and SME with SIMD Loops

Vidya Praveen

SIMD Loops is an open-source project designed to help developers learn SVE and SME through hands-on experimentation. It offers a clear, practical pathway to mastering Arm’s most advanced SIMD technologies…
- September 19, 2025
What is Arm Performance Studio?

Jai Schrem

Arm Performance Studio gives developers free tools to analyze performance, debug graphics, and optimize apps on Arm platforms.
- August 27, 2025
How Neural Super Sampling works: Architecture, training, and inference

Liam O'Neil

A deep dive into a practical, ML-powered approach to temporal super sampling.
- August 12, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Bitesize Bifrost 2: System coherency

Unlock the power of SVE and SME with SIMD Loops

What is Arm Performance Studio?

How Neural Super Sampling works: Architecture, training, and inference