Chinese Version中文版:扩展系统一致性 - 第 2 部分 - 实施、big.LITTLE、GPU 计算和企业级应用
This is the second part of a series of blogs about hardware coherency. In the first blog I introduced the fundamentals of cache coherency. This part talks about the implementation of hardware cache coherency and use cases.
ARM’s first implementations of AMBA 4 ACE include the ARM CoreLink CCI-400 Cache Coherent Interconnect, ARM Cortex-A15 and Cortex-A7 processors. These products were first released to our silicon partners in 2011, and we've seen the first ARM big.LITTLE products come to market in 2013.
CoreLink CCI-400 has been licensed by over 24 partners to date for mobile and enterprise applications such as networking or microservers. CoreLink CCI-400 supports up to two AMBA 4 ACE processor clusters allowing up to eight processor cores to see the same view of memory and run an SMP OS.
CoreLink CCI-400 supports all big.LITTLE combinations including Cortex-A15 + Cortex-A7, Cortex-A17 + Cortex-A7, and Cortex-A57 + Cortex-53 with full support for ARMv8-A including 64-bit. big.LITTLE processing is a power optimization technology from ARM where high performance ‘big’ cores and efficiency tuned ‘LITTLE’ cores are combined with software to dynamically transition applications to the right processor at the right time.
Hardware coherency is fundamental to big.LITTLE processing as it allows the big and LITTLE processor clusters to see the same view of memory and run the same operating system. big.LITTLE software such as Global Task Scheduling (GTS) places tasks on the appropriate core at a given time. For moderate workloads all processing may be performed on the LITTLE cores while the big cores are powered down. If a workload requires higher performance a big core is powered up and the task migrated while other moderate workloads continue to run on LITTLE cores. big.LITTLE GTS allows all the cores on an SoC to run simultaneously, for example a device with four big and four LITTLE will appear to the operating system as a octo core processor.
GPU compute with APIs such as OpenCL 1.1 Full Profile and Google RenderScript compute, unlock the combined processing power of CPU and GPU.
The ARM Mali -T600 series and Mali-T760 GPUs support AMBA 4 ACE-Lite for IO coherency with the CPU. This means that the GPU can read any shared data directly from the CPU caches, and writes to shared memory will automatically invalidate relevant lines in CPU caches. Hardware coherency reduces the cost of sharing data between CPU and GPU, and allows tighter coupling.
GPU Compute applications include: computational photography, computer vision, modern multimedia codecs targeting Ultra HD resolutions such as HEVC and VP9, complex image processing and gesture recognition.
ARM is one of the founding members of the Heterogeneous System Architecture (HSA) foundation. This foundation aims to provide a royalty free specification that makes it easier to take advantage of the heterogeneous CPU, GPU and DSP hardware in an SoC. This includes shared virtual memory and a roadmap to fully coherent GPU. These techniques will further reduce the cost of sharing data between processing engines.
See the HSA website for more information.
Enterprise applications such as networking and server have high performance serial interfaces such as PCI Express, Serial ATA and Ethernet. In most applications all of this data will be marked as shared as there will be many cases where the CPU needs to access data from these serial interfaces. The picture below shows an simplified example system.
Example: network interface
There is a trend in networking applications to move functionality to software to allow an SoC to support multiple applications. This means that the SoC needs more processing nodes.
The CCI-400 Cache Coherent Interconnect is being designed into a range of smaller enterprise applications including residential gateways, security appliances, WLAN enterprise access points, industrial communications and micro servers. These applications use a range of ARM processors depending on the performance requirements from Cortex-A7 to Cortex-A57 with up to a total of 8 cores maximum and no L3 cache.
ARM has a range of interconnect products to extend performance across a range of core counts:
Ian Forsyth talks more about the CoreLink CCN products in this blog post.
The following table details key features of the CoreLink CCI-400:
Two of the most commonly asked questions are: how big is it, and how fast does it run? CoreLink CCI-400 has many configuration options including register stages and transaction tracker sizes which allow the interconnect area and performance to be optimized for a given application. At the low end the gate account gets down towards 100k gates. In terms of clock speed, our baseline implementation trials started at 533MHz on a CMOS 32LP process, but we see a number of partners implementing at higher speeds on smaller silicon geometries and with faster implementation techniques.
The following diagram demonstrates an example mobile applications processor with Cortex-A50 series processors, CoreLink MMU-500 System MMU and a range of CoreLink 400 system IP.
In this system the Cortex-A57 and Cortex-A53 provide the big.LITTLE processor combination and are connected to CCI-400 with AMBA 4 ACE to provide full hardware coherency. The Mali-T628 and IO Coherent masters connect to CCI-400 via AMBA 4 ACE-Lite interfaces. As described in the first blog, this IO coherency allows the IO coherent agents to read from processor caches.
The other components in the system include:
So how do you optimize for the best performance and power efficiency around CCI-400? One solution is to use the Streamline Performance Analyzer which is part of the ARM DS-5 Development Studio. This brings together system performance metrics, software tracing, statistical profiling, and power measurement to present into a system dashboard to help you optimize the system.
The CCI-400 includes a Performance Monitoring Unit (PMU) which allows the counting of events to measure items like bandwidth, transactions stalls, cache hit rates. These counters can be visualized with the Streamline Performance Analyzer as shown in the screen shot above. This data could be shown alongside SoC power and processor activity to understand what is happening at a system level.
In the first blog I described how the AMBA 4 ACE bus interface extends hardware cache coherency outside of the processor cluster and into the system. In this blog we looked at implementations of hardware coherency and applications from mobile, like big.LITTLE processing, and enterprise. At the heart of all these applications is a cache coherent interconnect like the CoreLink CCI-400. ARM as an IP provider is in a unique position to offer the complete solution of Cortex processor, Mali graphics and CoreLink cache coherent interconnect as well as tools and physical IP. I personally look forward to seeing more products come to market in 2014 taking full advantage of hardware cache coherency and AMBA 4 ACE, and I'd be interested in your plans or views on how this technology is helping you!
Thanks a lot!
Yes exactly, same address space. e.g. if you had a DMC-400 with 4 slave ports, 2 might be connected to the CCI (one port even, e.g. 0x0000, 0x2000, 0x4000..., one port odd 0x1000, 0x3000, 0x5000...), the other ports might be connected to subsystems like display which supports the full address range, 0x0000, 0x1000, 0x2000.... etc. The address 0x2000 is the same DRAM chip/bank/row no matter what slave port of the DMC the request arrived on.
Hi Neil,
Thanks a lot. So it depends on dual channel memory controller. If separate memory controllers is used, the 'even' addresses and the 'odd' addresses access the different memory controllers and DRAMs. Regarding DMC-400, the 'even' addresses and the 'odd' addresses will access the same memory channel and DRAMs, and non-striped connection from the display also accesses the same memory channel, so they are all accessing the same address space. Right?
Best regards.
Hi Wangyong - lots of great questions here, I'll answer them one by one.
Regarding striping it's worth noting that the most recent versions of CoreLink CCI-400 also support finer grain striping, this is configurable in powers of 2 from 128B up to 4KB. The optimal stripe size may depend on properties of your memory controller, memory type used and traffic patterns. I'd expect the most likely stripe size may be above 256B and at or below 2KB.
In terms of connectivity, many mobile designs will connect the real time traffic from display controllers and video direct to the DMC as none of this data is "sharable" in the sense of hardware cache coherency. The connectivity to the memory controller will depend on the properties of that memory controller. For example the ARM DMC-400 can support up to 4 slave ports, and could support a striping connection from the CCI and non-striped connection from the display. If you were to look at the interfaces from CCI to DMC one port would have the 'even' addresses while the other had the 'odd', but they are all accessing the same address space.
If instead you had separate memory controllers for each memory channel then you would need an interconnect interleaving block to connect from the real-time & display masters to the multiple memory controllers.
The CoreLink MMU-500 is serving a different purpose, it's there to allow translation from virtual address (VA) to intermediate physical address (IPA), or to physical address (PA). For example a display controller may want to work with a contiguous region of memory; this could be contiguous in VA or IPA space and scattered in PA memory. It could also help with visualization, for example multiple virtual OSs each with their own intermediate physical address space.
Regarding CoreLink CCN-504, this supports up to 2 memory channels, and yes these memory channels are interleaved with striping.
Hopefully this answers your questions! Thanks, Neil.
I find that CCI-400 supports "'M1 and M2, striped in 4KB regions, used to load-balance between two memory controllers when ADDRMAPx[1:0] = 0b11" from DDI0470F_cci400_r1p0_trm. The Display and Video Subsystem accesses DDR directly without the decode of CCI-400 in this article. So If M1 and M2 striped in 4KB regions is enabled, is MMU500 in the path from Display and Video Subsystem to DDR used to ensure that Display and Video Subsystem accesses the same address regions as Cortex-A57/53 and MALI-T628?
Does CCN504 also support this feature that I didn't find from ccn504_r1p0_trm?