Memory System is Key to User Experience with Cortex-A73 and Mali-G71

May 27, 2016

7 minute read time.

By now you would have read the news about the latest ARM® Cortex®-A73 processor and Mali™-G71 GPU. These new processors allow for more performance in an ever thinner mobile device, and accelerate new use cases such as Virtual Reality (VR), Augmented Reality (AR) and the playback and capture of rich 4K content. However, these applications place increased demands on the system, and require more data to be moved between processors, cameras, displays and memory. This is the job of the memory system.

ARM Develops IP Together at the System Level

To get the best user experience the memory system must balance the demands of peak performance, low latency and high efficiency. The ARM CoreLink™ interconnect and memory controller IP provide the solution. ARM develops processor, multimedia and system IP together, including design, verification and performance optimization, to get the best overall system performance and to help our silicon partners get to market faster with a reduced integration cost.

There are three key properties that the memory system must deliver:

Lower memory latency - to ensure a responsive and fluid experience. This helps maintain a high frame rate providing a more natural VR & AR experience, as well as improving most other use cases such as web browsing and social media interactions.
Higher peak bandwidth - to support the increase in pixels and frame rate expected by 4K and HDR content. Also we’re seeing mobile devices with higher megapixel count or multiple cameras, in both cases we need to move more data to and from memory.
Improved memory efficiency - to move more data in the same or lower power budget. This can be enabled by innovation in the interconnect, for example hardware cache coherency, as well as improvements in the memory controller to get the best utilization of dynamic memory.

This blog describes how the latest CoreLink System IP delivers on the above requirements.

Optimized Path to Memory with CoreLink CCI-550 and DMC-500

The ARM CoreLink CCI-550 Cache Coherent Interconnect and DMC-500 Dynamic Memory Controller have been optimized to get the best from Cortex-A73 and Mali-G71. ARM big.LITTLE™ processing has relied on CCI products to provide full cache coherency between Cortex processors for a number of years now. For the first time, Mali-G71 offers a fully coherent memory interface with AMBA® 4 ACE. This means sharing data between CPU and GPU is easier to develop, lower latency and lower power.

Accelerating Heterogeneous GPU Compute

GPU compute exists today, but with software or IO coherency it can be difficult to use. Here’s a quote from a middleware developer regarding the cost:

“30% of our development effort was spent on the design, implementation and debugging of complex software coherency.”
Mukund Srinivasan, VP Media Client Business, Ittiam Systems

A fully coherent CPU and GPU memory system offer a simplified programming model and improved performance efficiency. This is enabled by two fundamental technologies:

Shared Virtual Memory (SVM) - where all processors use the same virtual address to access a shared data buffer. Sharing data between processes is now as simple as passing a pointer.
Hardware Coherency - which ensures all coherent processors see the same shared data and removes the need to clean and invalidate caches.

The following chart summarizes the benefit of these technologies and highlights how a fully coherent memory system can provide a ‘fine-grained’ shared virtual memory where the CPU and GPU can work on a shared buffer at the same time.

For a more detailed explanation see this blog:

Exploring How Cache Coherency Accelerates Heterogeneous Compute

OpenCL 2.0 is one API that enables programming with fine-grained SVM. Initial benchmarking at ARM is showing promising results. We have created a simple test called “Workload Balancing” that is designed to stress the processing and moving of data between CPU and GPU. As you can see from the chart below, moving from software coherency to a fine-grained fully coherent memory system can reduce overheads by as much as 90%.

Increasing Cortex-A73 Processor Performance

A high performance and low latency path to memory for the Cortex processors is fundamental to providing a fluid and responsive experience for all applications. The snoop filter technology integrated into the CoreLink CCI-550 enables a higher peak performance and offers system power savings which are discussed later in the blog.

The following example shows how the snoop filter can improve memory performance of a Cortex-A73 in a system where the LITTLE core, Cortex-A53, is idle and running at a low frequency. Under these conditions, any big core memory access will snoop the LITTLE core and will see a higher latency. This could slow down any applications that access memory and may make the device feel sluggish and less responsive.

With the snoop filter enabled the memory requests are managed by the snoop filter and see a consistently low latency, even if the LITTLE core is in a lower power state and running at a low clock frequency.

As can be seen by the chart below, when the snoop filter is enabled the memory tests in the ‘Geekbench’ benchmark see a significant improvement, as much as 241%. Other tests, like integer and floating point are running within the processor caches and are not accessing memory so they see less of a benefit. Overall the improvement on Geekbench score is as much as 28%. In terms of real-world applications this would deliver a more fluid user experience.

Reducing Memory Latency with Advanced Quality-of-Service (QoS)

Reducing latency can give a boost to any application that is working with memory, especially gaming, VR, productivity and web browser tasks. CoreLink CCI-550, NIC-450 and DMC-500 introduce a new interface called ‘QoSAccept’ which is designed to minimize the latency of important memory requests.

Benchmarking within ARM has shown a 38% reduction in latency through the interconnect for worst case traffic, in this example a CPU workload is limited to one outstanding transaction.

For more details, refer to this whitepaper:

Whitepaper: Optimizing Performance for an ARM Mobile Memory Subsystem

System Power Savings with CoreLink CCI-550

Mobile devices are getting ever thinner, and while compute requirements are increasing, it means the whole system must deliver improved power efficiency. The CoreLink CCI-550 and DMC-500 play an important role as they are central to the memory system power. The snoop filter technology allows the number of coherent devices to scale without negatively impacting system power consumption. In fact, the snoop filter saves power in two ways:

On-chip power savings - by resolving coherency in one central location instead of broadcasting snoops to every processor.
DRAM + PHY power savings - by reducing the number of expensive external memory accesses, whenever data is found in on-chip caches.

As the chart below demonstrates, we see more power savings as the number of coherent ACE interfaces increase, and as the proportion of sharable data increases. In this example “30% sharable” might represent a system where only the big.LITTLE CPU accesses are coherent, and “100% sharable” might represent a future GPU compute use case where all CPU and multimedia traffic is coherent.

While this example shows a system with 4x ACE interfaces, the CoreLink CCI-550 can scale to 6x ACE total interfaces to support systems with the highest performance 32 core Mali-G71.

Scalability to Minimize Area and Cost

Cost, including die area, is always important to the silicon partner and OEM. Reducing the area of silicon gates is also important for reducing power. For these reasons CoreLink CCI-550 has been designed to scale from low cost mobile up to high resolution, high performance tablets and clamshell devices. This scalability also allows the system integrator to tune the design to meet their exact system needs. In terms of peak system bandwidth, CoreLink CCI-550 can offer up to 60% higher peak bandwidth than the CoreLink CCI-500.

Memory System is Key to User Experience

To summarize, the interconnect and memory controller play an important role in delivering the performance expected from the latest Cortex and Mali processors. As noted above, CoreLink CCI-550 and DMC-500 can give a 28% increase in Geekbench, a 38% reduction in memory latency, and save potentially 100’s of mW of memory system power. This is fundamental to delivering the highest possible user experience within a strict power envelope.

ARM’s coherent interconnect products are silicon proven, have been implemented across a range of applications, and have been licensed over 60 times by silicon partners including AMD, HiSilicon, NXP, Samsung and Xilinx to name a few.

I look forward to seeing CoreLink CCI-550 in the latest devices!

Further Information:

Detailed description of Shared Virtual Memory and Hardware Coherency: Exploring How Cache Coherency Accelerates Heterogeneous Compute
Cache Coherency Fundamentals: Extended System Coherency - Part 1 - Cache Coherency Fundamentals
CoreLink CCI-550 Introduction: The Foundation for Next Generation Heterogeneous Devices
CoreLink CCI on arm.com: http://www.arm.com/products/system-ip/interconnect/corelink-cci-family/index.php
CoreLink CCI developer site: CoreLink Cache Coherent Interconnect Family – ARM Developer
CoreLink DMC-500 Whitepaper: Whitepaper: Optimizing Performance for an ARM Mobile Memory Subsystem
Mali-G71: Mali-G71: ARM's Most Powerful, Scalable and Efficient GPU to Date
Cortex-A73: New ARM Cortex-A73 Processor drives efficiency, performance for mobile designs

Please feel free to comment below if you have any questions.

0 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog