By now you would have read the news about the latest ARM® Cortex®-A73 processor and Mali™-G71 GPU. These new processors allow for more performance in an ever thinner mobile device, and accelerate new use cases such as Virtual Reality (VR), Augmented Reality (AR) and the playback and capture of rich 4K content. However, these applications place increased demands on the system, and require more data to be moved between processors, cameras, displays and memory. This is the job of the memory system.
To get the best user experience the memory system must balance the demands of peak performance, low latency and high efficiency. The ARM CoreLink™ interconnect and memory controller IP provide the solution. ARM develops processor, multimedia and system IP together, including design, verification and performance optimization, to get the best overall system performance and to help our silicon partners get to market faster with a reduced integration cost.
There are three key properties that the memory system must deliver:
This blog describes how the latest CoreLink System IP delivers on the above requirements.
The ARM CoreLink CCI-550 Cache Coherent Interconnect and DMC-500 Dynamic Memory Controller have been optimized to get the best from Cortex-A73 and Mali-G71. ARM big.LITTLE™ processing has relied on CCI products to provide full cache coherency between Cortex processors for a number of years now. For the first time, Mali-G71 offers a fully coherent memory interface with AMBA® 4 ACE. This means sharing data between CPU and GPU is easier to develop, lower latency and lower power.
GPU compute exists today, but with software or IO coherency it can be difficult to use. Here’s a quote from a middleware developer regarding the cost:
“30% of our development effort was spent on the design, implementation and debugging of complex software coherency.”Mukund Srinivasan, VP Media Client Business, Ittiam Systems
“30% of our development effort was spent on the design, implementation and debugging of complex software coherency.”
Mukund Srinivasan, VP Media Client Business, Ittiam Systems
A fully coherent CPU and GPU memory system offer a simplified programming model and improved performance efficiency. This is enabled by two fundamental technologies:
Hardware Coherency - which ensures all coherent processors see the same shared data and removes the need to clean and invalidate caches.
The following chart summarizes the benefit of these technologies and highlights how a fully coherent memory system can provide a ‘fine-grained’ shared virtual memory where the CPU and GPU can work on a shared buffer at the same time.
For a more detailed explanation see this blog:
Exploring How Cache Coherency Accelerates Heterogeneous Compute
OpenCL 2.0 is one API that enables programming with fine-grained SVM. Initial benchmarking at ARM is showing promising results. We have created a simple test called “Workload Balancing” that is designed to stress the processing and moving of data between CPU and GPU. As you can see from the chart below, moving from software coherency to a fine-grained fully coherent memory system can reduce overheads by as much as 90%.
A high performance and low latency path to memory for the Cortex processors is fundamental to providing a fluid and responsive experience for all applications. The snoop filter technology integrated into the CoreLink CCI-550 enables a higher peak performance and offers system power savings which are discussed later in the blog.
The following example shows how the snoop filter can improve memory performance of a Cortex-A73 in a system where the LITTLE core, Cortex-A53, is idle and running at a low frequency. Under these conditions, any big core memory access will snoop the LITTLE core and will see a higher latency. This could slow down any applications that access memory and may make the device feel sluggish and less responsive.
With the snoop filter enabled the memory requests are managed by the snoop filter and see a consistently low latency, even if the LITTLE core is in a lower power state and running at a low clock frequency.
As can be seen by the chart below, when the snoop filter is enabled the memory tests in the ‘Geekbench’ benchmark see a significant improvement, as much as 241%. Other tests, like integer and floating point are running within the processor caches and are not accessing memory so they see less of a benefit. Overall the improvement on Geekbench score is as much as 28%. In terms of real-world applications this would deliver a more fluid user experience.
Reducing latency can give a boost to any application that is working with memory, especially gaming, VR, productivity and web browser tasks. CoreLink CCI-550, NIC-450 and DMC-500 introduce a new interface called ‘QoSAccept’ which is designed to minimize the latency of important memory requests.
Benchmarking within ARM has shown a 38% reduction in latency through the interconnect for worst case traffic, in this example a CPU workload is limited to one outstanding transaction.
For more details, refer to this whitepaper:
Whitepaper: Optimizing Performance for an ARM Mobile Memory Subsystem
Mobile devices are getting ever thinner, and while compute requirements are increasing, it means the whole system must deliver improved power efficiency. The CoreLink CCI-550 and DMC-500 play an important role as they are central to the memory system power. The snoop filter technology allows the number of coherent devices to scale without negatively impacting system power consumption. In fact, the snoop filter saves power in two ways:
On-chip power savings - by resolving coherency in one central location instead of broadcasting snoops to every processor.
DRAM + PHY power savings - by reducing the number of expensive external memory accesses, whenever data is found in on-chip caches.
As the chart below demonstrates, we see more power savings as the number of coherent ACE interfaces increase, and as the proportion of sharable data increases. In this example “30% sharable” might represent a system where only the big.LITTLE CPU accesses are coherent, and “100% sharable” might represent a future GPU compute use case where all CPU and multimedia traffic is coherent.
While this example shows a system with 4x ACE interfaces, the CoreLink CCI-550 can scale to 6x ACE total interfaces to support systems with the highest performance 32 core Mali-G71.
Cost, including die area, is always important to the silicon partner and OEM. Reducing the area of silicon gates is also important for reducing power. For these reasons CoreLink CCI-550 has been designed to scale from low cost mobile up to high resolution, high performance tablets and clamshell devices. This scalability also allows the system integrator to tune the design to meet their exact system needs. In terms of peak system bandwidth, CoreLink CCI-550 can offer up to 60% higher peak bandwidth than the CoreLink CCI-500.
To summarize, the interconnect and memory controller play an important role in delivering the performance expected from the latest Cortex and Mali processors. As noted above, CoreLink CCI-550 and DMC-500 can give a 28% increase in Geekbench, a 38% reduction in memory latency, and save potentially 100’s of mW of memory system power. This is fundamental to delivering the highest possible user experience within a strict power envelope.
ARM’s coherent interconnect products are silicon proven, have been implemented across a range of applications, and have been licensed over 60 times by silicon partners including AMD, HiSilicon, NXP, Samsung and Xilinx to name a few.
I look forward to seeing CoreLink CCI-550 in the latest devices!
Please feel free to comment below if you have any questions.