The Heterogeneous System Architecture (HSA) Foundation is a not-for profit consortium for SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose goal is to make it easier for software developers to take advantage of all the advanced processing hardware on a modern SoC. The CPU and GPU on a typical applications processor occupy a significant proportion of die area and applying these resources efficiently across multiple applications can improve the end user experience. Done right, efficiency can be gained in power, performance, programmability and portability.
This blog focuses on some of the hardware innovations and changes that are relevant to shared virtual memory and cache coherency, which are components of the HSA hardware specification.
Traditional memory systems defined separate memory for CPU and GPU. In the case of PCs, the GPU may have completely separate discrete memory chips on a different board. In these systems, any application that wants to share data between CPU and GPU will need to copy it from CPU memory to graphics memory at a significant cost of latency and power.
Mobile systems have had a unified memory system for many years where all processors can access the same physical memory. However, even though this is physically possible, the software APIs and memory management hardware and software may not allow this. Graphics buffers may still be defined separately from other memory regions and data sharing may still require an expensive copy of data between buffers.
Shared virtual memory (SVM) allows processors to see the same view of memory; specifically, the same virtual address on the CPU and GPU will point to the same physical memory location. With this architecture, an application only needs to pass a pointer between processors that are sharing data.
There are multiple ways to implement SVM; it doesn’t mean you have to share the exact same page table. The only requirement is that if a buffer is to be shared between processors then it must appear in the page tables for both memory management units (MMUs). With SVM in place, sharing data becomes as simple as passing a pointer between processors.
Let’s go back to basics and ask what does coherency mean? Coherency is about ensuring all processors, or bus masters in the system see the same data. For example if I have a processor which is creating a data structure in its local cache then passing it to a GPU, both the processor and GPU must see the same data. If the GPU reads from external DDR, the GPU will read old, stale data.
There are three mechanisms to maintain coherency:
A cache stores external memory contents close to the processor to reduce the latency and power of accesses. On-chip memory accesses are significantly lower power than external DRAM accesses.
Software managed coherency manages cache contents with two key mechanisms:
“We would like to connect more devices with hardware coherency to simplify software and accelerate product schedules”“50% of debug time is spent on SW coherency issues as these are difficult to find and pinpoint”
Quotes from a system architect at an application processor vendor.
Software coherency is hard to debug; the cache cleaning and invalidation must be done at the right time. If done too often it wastes power and CPU effort. If done too infrequently it will result in stale data which may cause unpredictable application behaviour, if not a crash. Debugging this is extremely difficult as it will present occasional data corruption.
Looking specifically at CPU and GPU sharing, this software cache maintenance will be difficult to optimize and applications on these systems will try and avoid sharing data due to cost and complexity. One middleware vendor using GPU compute with software coherency noted that the developers spent around 30% of their time architecting, implementing and debugging the data sharing including breaking down image data into sub-frames and careful timing of the mapping and unmapping functions.
When sharing is used with software coherency, the size of the task running on the GPU must be large enough to make it worthwhile, taking into account the cost of software coherency.
Extending hardware coherency to the system requires a coherent bus protocol, and in 2011 ARM released the AMBA 4 ACE specification which introduces the “AXI Coherency Extensions” on top of the popular AXI protocol. The full ACE interface allows hardware coherency between processor clusters and allows an SMP operating system to extend to more cores.
With the example of two clusters, any shared access to memory can ‘snoop’ into the other cluster’s caches to see if the data is already on chip; if not, it is fetched from external memory (DDR). In mobile, this has enabled the big.LITTLE processing model which improves performance and power efficiency by utilizing the right core to suit the size of the task.
The AMBA 4 ACE-Lite interface is designed for IO (or one-way) coherent system masters like DMA engines, network interfaces and accelerators. These devices may not have any caches of their own, but they can read shared data from the ACE processors. Alternatively, they may have caches but these would still need to be cleaned and invalidated by software.
While hardware coherency may add some complexity to the interconnect and processors, it massively simplifies the software and enables applications that would not be possible with software coherency such as big.LITTLE processing.
While processor clusters have implemented cache coherency protocols for many years, this is a new area for GPUs. As applications look to share more data between CPU and GPU, hardware cache coherency ensures this can be done at a low cost in power and latency, which in turn makes it easier, more power efficient and higher performance than any software managed mechanism. Most importantly it makes it easy for the software developer to share data.
There are two ways a GPU could be connected with hardware coherency:
The following diagrams summarize what we’ve learned so far and also describe the coarse and fine grain shared virtual memory. These charts approximate elapsed time on the horizontal axis, and address space on the vertical axis.
The above chart shows traditional memory systems where software coherency required data to be cleaned from caches and copied between processors to ‘share’ the data. In additional to cache cleaning, the target cache would also need to invalidate any old data before reading new data from DRAM. This is time consuming and power hungry and limits the applications that can take advantage of heterogeneous processing.
With shared virtual memory the CPU and GPU can now share physical memory and operate on the same virtual address, which eliminates the copy. If we have an IO coherent GPU, in other words one-way coherent where GPU can read CPU caches, then we remove the need to clean data from CPU caches. However, because this is one-way, the CPU cannot see the GPU caches. This means the GPU caches must be cleaned with cache maintenance operations after processing completes. This ‘coarse-grain’ SVM means the processors must take turns accessing the shared buffer.
Finally, if we enable a fully coherent memory system then both CPU and GPU can see exactly the same data at all times, and we can use ‘fine-grained’ SVM. This means both processes can access the same buffer at the same time instead of taking turns. Handshaking between processors uses cross-device atomics. By removing all of the cache maintenance overheads we can get the best overall performance.
At this point it’s useful to map these hardware technologies to the software APIs. Compute APIs like OpenCL 2.0 can take full advantage of SVM and hardware coherency, and can run on HSA platforms. Not all OpenCL 2.0 implementations are the same; there are a number of optional features that can be enabled if the hardware supports it. These features can also be mapped to the HSA profiles: base profile and full profile, as shown in the table below.
HSA always requires hardware coherency, and with the base profile the scope of shared virtual memory can be limited to the shared buffers. This means only the shared buffers would appear in both CPU and GPU page tables, not the full system memory. This may be easier and lower cost to implement in hardware.
Full coherency is required for fine grain, and this enables both CPU and GPU to work on different addresses with the same data buffer at the same time.
Full coherency also allows the use of atomic operations, which allows processors to work on the same address within the same buffer. Atomic operations allow synchronization between threads, much like in a multi-core CPU. Atomics are optional for OpenCL but required for HSA.
For coarse grain, if hardware coherency is not present then it would need to use software managed coherency including cache maintenance operations, or optionally IO coherency for the GPU.
The hardware required to implement these technologies already exists today in the form of fully coherent processors and cache coherent interconnects. The interconnect is responsible for connecting processors, peripherals and memory together on the SoC. The AMD Kavari APU already has a fully coherent memory between the CPU and GPU. ARM offers IP such as the CoreLink CCI-550 Cache Coherent Interconnect, Cortex-A72 processor and the Mali-G71 GPU, which together support the full coherency and shared virtual memory techniques described above.
Interconnect innovations such as snoop filters, are essential to support scaling to higher performance memory systems. The snoop filter acts as a directory of processor cache contents and allows any memory access to be targeted directly to the processor that holds that data. More detail on this can be found in my blog on extended system coherency.
HSA, with full coherency and shared virtual memory, is all about delivering new, enhanced user experiences through advances in computing architectures that bring improvements across key areas:
Application developers now have access to the complete compute potential on an SOC, where workloads can be moved seamlessly between computing devices enabling right sized computing for the given task.
Ok, thanks this explains it. In other blog post about CCI-500 there seem to be also statement regarding invalidation of CPU cache during write.