The memory sub-system is one of the most complex systems in a SoC, critical for overall performance of the chip. Recent years have witnessed explosive growth in the memory market with high-speed parts (DDR4/3 with/without DIMM support, LPDDR4/3) gaining momentum in mobile, consumer and enterprise systems. This has not only resulted in increasingly complex memory controllers (MC) but also PHYs that connect the memory sub-system to the external DRAM. Due to the high-speed transfer of data between the SoC and DRAM, it is necessary to perform complex training of memory interface signals for best operation.
Traditionally, MC and PHY integration was considered to be a significant challenge, especially if the two IP blocks originated from different vendors. The key reason was the rapid evolution of memory protocols and DFI interface boundary between controller and PHY being incompletely specified, or in some cases ambiguous, with respect to requirements for MC-PHY training.
I’ll try to shed some light on this topic. Recently, with the release of the DFI 4.0 draft specification for MC-PHY interface, things certainly seem to be heading in the right direction. For folks unfamiliar with DFI, this is an industry standard that defines the boundary signals and protocol between any generic MC and PHY. Since the inception of DFI 1.0 back in 2006, the specification has steadily advanced to cover all aspects of MC-PHY operation encompassing all relevant DRAM technology requirements. The DFI 4.0 specification is more mature compared to previous releases and specifically focuses on backwards compatibility and MC-PHY interoperability.
But that’s not the only reason why MC-PHY integration has gotten easier. To understand this better, we need to examine how MC and PHY interact during training. There are 2 fundamental ways that training of memory signals can happen:
Interestingly, PHY IP providers have decided to take ownership of training by implementing support for PHY independent mode in their IP, thereby keeping the reins to optimize the PHY training algorithms based on their PHY architecture. With PHY complexity growing and challenges with closing timing at high DDR speeds, the support for PHY independent mode training adds a valuable differentiator for PHY IP providers.
With the PHY doing most of the heavy lifting during the training, the MC only needs to focus on two questions:
The MC thus deals with the PHYs request for independent-mode training as an interrupt, something it needs to schedule along with a multitude of other things that it does for best memory operation. Training thus becomes a Quality-of-Service (QoS) exercise for the controller with a different set of parameters to optimize. The positive about all this is that QoS is essentially what a good MC does very well.
With the clarity at the DFI interface, silicon proof is really a burden on the PHY because it has to train correctly at high speeds and provide a good data eye. Risk for critical bugs in MC that can only be found through silicon proof is low, something that a strong architecture and design/verification methodology can help eliminate. So the demands on MC have become less on MC-PHY interoperability, but more so on performance (memory bandwidth and latency).
I am leaving that as the topic of my next blog.
ARM is building state-of-the-art memory controllers with emphasis on CPU-to-memory performance, and supporting DFI-based PHY solutions available in the market today. We have setup partnerships with 3rd party PHY providers for ensuring that integration at the DFI boundary is seamless for the end customer. ARM’s controllers support all the different training modes used by different PHYs thereby providing customers flexibility in choosing the best overall solution for their memory sub-system deployment.
Thanks for reading my blog, I welcome your feedback.
Join the ARM Server Segment next week at the Open Compute Project’s yearly main event in San Jose, CA. Our team will be on hand
in support of our ARM partners and also to support OCP’s mission of providing ever more efficient data center server solutions.
OCP Summit is one of the largest attended data center oriented shows each year, bringing together all tiers of the server supply chain and
showcasing a wide variety of compute technologies for Fortune 500 customers. We are excited to be a sponsor of this event.
We will be located in Booth #D14, tucked in along the back wall of the exhibition floor, as shown below.
ARM plays an important role in driving an ever broader, more competitive, more efficient, and more robust ecosystem from
which data center customers can choose. That said, it is never just about ARM, but about our partners, their designs, and their
inspirations that leverage the underlying ARM architecture and ecosystem to deliver optimized and differentiated products into
Our booth will feature server products from our partners SoftIron, Wiwynn, and Prodrive as well as SoC partners AMD and TI.
2015 is an exciting year for ARM in the datacenter as we expect to see AMD and Cavium join the ranks of Applied Micro
in shipping production ARM-based server and networking solutions.
Highlighting our ecosystem enablement work of the past few years and the emergence of 64-bit ARMv8 chips in the marketplace,
we will be raffling off 96Boards.org developer reference boards which sport a 64-bit ARM-based Quad A53 SoC, as shown below.
These boards are the size of a business card, or perhaps a bit smaller, and we are excited to have low-cost 64-bit solutions on hand for our
software ecosystem partners in 2015.
Again, please stop by our booth at OCP Summit 2015, or track down one of the ARM server segment attendees (Lakshmi Mandyam,
Jeff Underhill, or myself). We’d love to hear your ideas and feedback on how we can partner to deliver efficient ARM-based solutions into
the data center in 2015. Safe Travels!
We will be tracking any and all ARM partner related announcements and activity and posting the associated links here in this blog space for
your consumption as OCP week unfolds.
* Applied Micro & Gigabyte: https://www.***.com/news/gigabyte-and-appliedmicro-announce-commercial-availability-of-gigabyte-mp30/
* Datacentered & CodeThink host OpenStack powered cloud: http://datacentred.co.uk/datacentred-world-first-openstack-public-cloud-on-64-bit-arm-servers/
Given the enormous amount of resources compiled on the Community over the past few years, I thought it might be useful to some users to compile all the 'technical' resources in one document. The links below focus on resources for software engineers and developers (although not strictly!) rather than industry news, product releases and other non-technical content. I hope this is of use to some of you- bookmark it if it is so you can come back to it! If you feel I have missed anything then please leave a comment below with the blog/document linked to it. Cheers!
Getting Started- General
We all know that to get anywhere in life you need to have connections. A good connection will open the right doors for you and ensure that you reach your potential with the minimum of wasted energy. The same is true in the microworld of an SoC! Massive growth in system integration places on-chip communication and interconnect at the centre of system performance. System interconnect fabric is the infrastructure that provides cache coherency, system optimization and power savings. Traffic interactions have become complex and, if left unchecked can cause poor, unpredictable system performance. These days we are seeing ARM technology appear in a wide variety of end applications, from enterprise servers to tiny IoT or wearable devices. In every case, the interconnect fabric is central to ensuring the power and performance requirements for each chip are met. ARM has a range of interconnect solutions that are designed for different purposes across the SoC. The CoreLink interconnect family is the lowest risk solution for on-chip communication. Designed and tested with ARM Cortex® and Mali processors, CoreLink interconnect from ARM provides balanced service for both low latency and high bandwidth data streams. The CoreLink Interconnect family is made up of three product families:
ARM CoreLink Interconnect provide the components and the methodology for designers to build SoCs based on the AMBA specs, maximising the efficiency of data movement and storage, delivering the performance needed at the lowest power and cost. I’ll take you through the options and highlight the use case that each interconnect family is best suited for.
There is a recurring theme at the moment; there is a need for more efficient, optimized solutions from edge to core. ARM Cortex A-Series Processors with CoreLink interconnect IP provide a common architecture across the spectrum, scaling from cost efficient home gateways to high performance core networking and server applications. As networking applications continue to evolve in both throughput and services, we can see their workloads are very different from compute-based ones. That requires a different approach to SoC interconnect, for example in the need for scalability and end to end Quality of Service. The Cache Coherent Network family offers maximum performance with integrated snoop filters and AMBA 5 CHI.
Some of the use cases the CCN family are designed for
On the high end of the performance spectrum, macro base station and cloud applications require dense, efficient compute platforms with the right-sized cores to match the appropriate workload. High performance cores are required for server compute and control plane processing, efficient small cores are required to maximize packet throughput and customized accelerators are needed for Layer-1, security and content delivery processing. Ranging from the largest CCN-512, which supports macro base station and cloud applications, to the smallest CCN-502, which supports small cell base stations and WiFi access points, the CCN family is optimised for all infrastructure applications. In fact, it is estimated that about 80% of network energy consumption is attributed base stations so there is a real need for the hardware to be as efficient as possible. Every cloud has a silver lining - with the massive growth in global data it has forced people to reassess the infrastructure that manages this. CCN is part of a dedicated enterprise server solution that ARM is providing, offering a scalable solution that delivers optimal performance depending on the system PPA requirements.
The Cache Coherent Interconnect offers the smallest and lowest power multi-cluster, perfectly suited for mobile SoCs where PPA restrictions are greater. It represents a step down from the CCN in terms of the size of end-use applications, moving from networking to mobile-based SoCs. Mobile systems designers need to support high resolution screens, complex applications and console quality graphics. The Cache Coherent Interconnect is a critical part of the mobile SoC, it provides full cache coherency between big.LITTLE processor clusters and provides IO coherency for other agents such as Mali GPU, network interfaces or accelerators. The CoreLink CCI-500 was released at the beginning of February and offers a scalable and configurable interconnect which enables SoC designers to meet their performance goals with the smallest possible area and power. CoreLink CCI-500 builds on the market-leading success of the previous generation interconnect in three key areas; reduced system power by integrating a snoop filter, increased CPU memory performance and a massive uplift in system bandwidth. The increase in peak system bandwidth, supporting speeds of up to 34GB/s, paves the way for console-quality gaming and seamless 4K content in next-generation mobile devices. It is optimized for mobile but its configurability means it is also suitable for set top boxes, small enterprise and automotive applications.
Example premium mobile system containing CoreLink CCI-500
Finally, the Network Interconnect provides a fully configurable, hierarchical, low latency, low power connectivity for AMBA. The NIC-400 works with the other CCN and CCI products, making a lot of the microconnections to extend I/O coherency to larger numbers of masters. Additionally, it is used in a number of embedded applications and wearable devices where low power and cost are issues to be considered. It is a simple crossbar switch that can be configured from 32 to 256 bits wide, a must for small geometries and increasing numbers of IP cores. The real beauty of the NIC-400 is its configurability; it can be optimised to suit the requirements of a complex SoC using the AMBA protocols. Interconnect performance in terms of raw clock speed depends on many factors including the configuration, size and the system components it is connected to, and of course the silicon technology it is implemented with. An important feature of the NIC is the ability to configure and enable pipeline register stages at various points in the design. This allows a fine grain control in the trade-off between clock speed, and latency.
CoreLink NIC-400 extends I/O coherency to large numbers of masters
As the IP components in a chip become more specialized one of the jobs of the interconnect fabric will be to suit system design requirements and enable rapid on-chip communication between processors, memory and I/O agents. When developing a modern SoC it is so important to choose IP that is fit for requirements. Whether you are looking to build a large server chip, small WiFi access point, premium mobile system or even extend the I/O coherency across the chip, ARM’s CoreLink interconnect portfolio has something tailored to that purpose.
First announced in November 2011, the ARMv8-A architecture and Cortex-A57 and Cortex-A53 processors have cumulatively amassed over 50 licensing agreements. This momentum is particularly strong in manufacturers of application processors targeted at smartphones, with all top-10 players having adopted ARMv8-A. This adoption is set to continue throughout 2015 as premium mobile makers seek to harness the increased potential that the upgrade in system architecture offers. What that means for consumers is devices that are fluid and responsive when handling all of the complex tasks demanded of modern smartphones and tablets.
With this blog I will go through some aspects of the system that make a significant contribution to the increase in performance associated with the shift from 32 to 64-bit.
I don’t think I am alone in thinking that the main drivers in the premium mobile device market are human experiences and expectations. The demand for better user experiences on higher resolution displays whilst retaining fluid responses, with more device-to-device connectivity means that consumers are looking for the next great thing every year.
Recent history has shown that mobile devices are the preferred compute devices of choice. As we move forward this is not going to change.
So why 64-bit in mobile? For the marketing folks it makes perfect sense as 64 is double 32, so it must be twice as good right? However there are also a number of technical merits supporting 64-bit designs going forward. The main reason is that it’s the architecture and instruction-set-architecture (ISA) that makes the difference. The ISA allows compilers to work smarter and the microarchitecture implementation to be more efficient. Here are a few more benefits to 64-bit that have come off the top of my head:
In short, there are plenty of reasons for designers to move to 64-bit now than ever before. If you think I’ve missed out on any of the important benefits that 64-bit brings, please mention them in the comments below.
Bandwidth requirements for premium mobile devices are expected to soar over the next few years and there are several key use cases supporting this trend.
Screen sizes and resolutions have increased across a wide range of devices, and frames per second have increased - not only for consuming content but also for capturing it. As more people capture content via their mobile’s camera, there is greater demand on higher resolution for stills and video capture:
One of the largest users of memory bandwidth in a SoC is the media subsystem components – GPU, video and display. Nobody wants the annoyance of having his or her screen freeze when capturing that crucial moment on camera, so it is vital that the bandwidth efficiency is optimal here.
Whilst we are making advances in frame-buffer compression technology such as AFBC, peak bandwidth requirements continue to grow.
As our mobile devices become central to our digital lives, those capabilities must be paired with the power efficiency required to work through a full day of heavy use on a single charge. Modern mobile design requires a commitment to getting the most out of every milliwatt and every millimetre of silicon.
As engineers and technologists in this market, we have the challenge of delivering this mobile experience within tight energy and thermal constraints.
Thankfully there are some form factors that the market has gravitated around which give us enough stability to allow us to define some SoC power budgets and therefore clearer target to hit.
At ARM we develop our cores, GPUs and system IP with the aim of delivering the maximum performance within an energy or power consumption envelope.
The shift from traditional mobile phones to smartphones and tablets has resulted in a change in user behaviour. The phone in our pocket is now the primary computing device, and we make increasingly complex demands of it. According to a survey of the amount of time mobile users spend on mobile applications, we see more than 85% of time being spent on three types of applications.
A high percentage of time spent on web-based applications such as web browsing and Facebook, closely followed by Gaming and a good part spent on Audio and Video playback and Utility Apps such as Cloud Storage, Notes, Calendars etc.
It’s interesting to note that the three most common tasks all consume power in vastly different ways. Clearly we have to bear these different power profiles in mind when designing a SoC that can deliver optimal performance for all use cases.
big.LITTLE™ Technology is ARM’s silicon-proven energy optimization solution. It consists of two or more sets of architecturally identical but different capability CPUs:
The big processors (in BLUE) are designed for high performance and the LITTLE processors (in GREEN) are designed for maximum efficiency. Each CPU cluster has its own L2 cache that has been designed and sized for high performance in the case of the big cluster and high efficiency in the case of the LITTLE.
big.LITTLE supports ARMv7 processors (Cortex-A7, Cortex-A15 and Cortex-A17) as well as ARMv8-A processors (Cortex-A53, Cortex-A57 and the recently announced Cortex-A72). big.LITTLE uses heterogeneous computing to bring you 40-60% additional energy savings, when measured across common mobile use cases on ARMv8 based devices.
Combined with the hardware benefits of moving to the 64-bit architecture on Cortex-A72 and Cortex-A53, the big.LITTLE software model allows multi-processing across all cores.
The first System IP component I get to introduce at this point is our recently announced CoreLink™ CCI-500 Cache Coherent Interconnect that makes big.LITTLE compute possible. Neil Parris wrote an excellent in-depth blog on how CoreLink CCI-500’s snoop filter improves system performance.
CoreLink CCI-500 allows both sets of clusters to see the same block of memory, which enables a flexible, seamless and fast migration of data from the big cluster to the LITTLE cluster and vice versa. It also allows each cluster to snoop into the caches of the other cluster, reducing the time CPUs spend stalling and hence improving performance and saving power. CCI-500 also doubles peak system bandwidth over CCI-400 which is the semiconductor equivalent of upgrading a highway from two lanes to four, easing congestion when traffic gets busy and saving people time.
Given that we have CCI-500 at the core of our system, we can now look at the other System IP components that work in concert with the CCI to help ARM partners build 64-bit systems. When you look at this example representation of a Premium Mobile SoC you can see there is a significant amount of System IP performing multiple tasks.
CoreLink GIC-500 Generic Interrupt Controller manages migration of interrupts between CPUs and allows for virtualization of interrupts in a hypervisor controlled system. Compared with the previous generation GIC-400, the GIC-500 supports more than eight CPUs and also supports message-based interrupts as well as directly connecting to ARMv8 Cortex-A72 and Cortex-A53 system register interfaces instead of ARMv7 IRQ and FIQ inputs.
CoreLink MMU-500 System Memory Management Unit supports a common physical memory view for IO devices by sharing the same page tables as the CPUs.
Rest of SoC connectivity is serviced by CoreLink NIC-400 which provides a fully configurable interconnect solution to connect sub-systems such as video, display and peripherals. NIC-400 configurability enables partners to build hierarchical, low latency, low power connectivity for AMBA® 4 AXI4™, AMBA 3 AXI3™, AHB™-Lite and APB™ components.
The fact that all of these System IP components are designed, implemented and validated with ARM Cortex processors and the Mali Media library reduces overall system latency. These enhancements play a key role in the performance uplift that 64-bit computing brings to mobile.
The increased processing throughput in 64-bit system impacts debugging solutions as well, particularly the increase in output bandwidth from the trace macrocell. Debug and trace System IP is also critical for helping ARM partners to debug and optimise software for 64-bit systems comprising:
CoreSight SoC-400 currently provides the most complete on-chip debug and real-time trace solution for the entire system-on-chip (SoC), making ARM processor-based SoCs the easiest to debug and optimize. Mayank Sharma has explained how to build customised debug and trace solutions for multi-core SoCs using CoreSight SoC-400, showing the value that a well-thought out debug & trace system can offer to all stages of SoC development.
We’ve discussed for 64-bit mobile devices, consumers expect something new every year with better and better performance. What I’ve done in this blog is introduce some of the key IP components that all contribute to the premium devices that are faster and more power-efficient each year. 2015 will be a year where we see the 64-bit mobile device reach a wide audience thanks to the outstanding work of our ARM partners! Building a 64-bit SoC has never been easier owing to all of the IP that has been designed and optimized for the purpose.
As system performance increases, so does the need to tightly control the thermal and energy envelope of the system. Whether it is lowest latency or highest bandwidth demanded by the processors, ARM System IP delivers outstanding efficiency to achieve the performance required with the lowest power and smallest area.
For more information on the System IP portfolio please visit: System IP - ARM
Chinese Version 中文版：高端移动用户体验优选，ARM优质IP组合
The innovation of smartphone is never ending and will continue to evolve over the time. 5 years ago when people talked about smartphone, touch interface was new to many end users and it was just about to take off displacing QWERTY keyboard. Today, nobody care about touch UI because it is nearly impossible to find a smartphone without touch interface. What we see today is voice driven input interface and gesture recognition applications are being adopted by more and more people, thanks to higher processor performance, higher cellular data rate, and better voice recognition algorithms. All of those technological breakthroughs make it possible to carry out complicated data acquisition and analysis and on the device or in the cloud. Few years ago when the mobile display is less than HD (high-resolution), people can only do simple UI operation or chat online with text messages. Today, as HD or Full HD display has become the basic smartphone specification, HD content streaming or messaging with text, images, voice, and videos are things people do when they take bus to work. Today, as more and more high-end mobile devices are equipped with 4K display or retina display, we can expect to see UHD (ultra-HD) content streaming or broadcasting over cellular networks from studios. The apps, which is really one of the enzymes to ignite the smartphone revolution, will evolve from gimmicky branded apps and games, to a unique personalized mobile services leveraging location, context, usage behavior, and data. The buzz word of contexture awareness, console gaming on mobile devices, mobile payment, are within the reach.
The redefined premium user experience is not possible without the technological breakthrough. The heart of the smartphone innovation includes the CPU, GPU, modem and connectivity, video codec, peripherals, etc., just to name a few. Over the past 5 years, it is observed that the CPU performance has enhanced 36x, while the GPU performance improvement is 40-folds and video codec complexity has increased by 34 times. With the LTE deployment and new techniques in Wi-Fi, the cellular and connectivity data rate has increased by 40 times. The demand of memory bandwidth and capacity has also increased by 16 times.
ARM as the leader in the mobile industry, has provided high performance high efficiency IP to silicon partners to fulfill ever increasing user demand and unprecedented premium user experience. On Feb 4th in Beijing ARM has announced the following IP suites for premium IP user experience:
Cortex-A72 is ARM’s highest performance Cortex-A 64-bit processor as of today. Together with advanced process manufacturing technologies moving from 28nm in 2014, to 20nm in 2015, and to 16FinFET in 2016, Cortex-A72 in 2016 provides 3.5x performance uplift from Cortex-A15 in 2014 in the same smartphone power envelope. On the energy efficiency side, Cortex-A72 device in 2016 is expected to be 75% less energy consumption for the same workload as compared to Cortex-A15 devices in 2014. With ARM’s big.LITTLE technology, Cortex-A72 as the big core with power efficiency Cortex-A53 as little core can provide additional 40%-60% power reduction on average across multiple workloads. The big.LITTLE solution with Cortex-A72 also provides compelling scalable solutions for high-end larger screen mobile devices that require more compute in smaller footprint with less power.
As slimmer and cooler form factors are critical for consumers’ eyes in high-end smartphone devices, Coretx-A72’s better energy efficiency helps device makers to design sleeker and cooler device appearance, thus innovative, non-conventional, fashion devices can be expected in near term.
While CPU such as Cortex-A72 provides high computing capabilities with high efficiency, consumers will not be able to experience superior visual experience with Mali GPU. The newly introduced Mali-T880, as compared to its predecessor Mali-T760 announced in 2014, provides 1.8X better performance and 40% less energy consumption for the same workloads. Complemented with Mali-V550 video codec and Mali-DP550 display processors, ARM Mali enables superior system level benefits and brings advanced gaming and console-like user experience to mobile consumers.
Besides Cortex-A72 and Mali-T880, the premium user experience cannot be achieved without high performance and high efficiency highway system that connects multiple engines and transports data within the system promptly and efficiently. ARM CoreLink CCI-500 delivers as much as twice peak system bandwidth compered to CoreLink CCI-400, CCI-500 also delivers additional 35% memory performance uplift relative to Cortex-A72 with CoreLink CCI-400. With the peak bandwidth delivered by CoreLink CCI-500, 4K display and beyond can be realized. As mobile system becoming more complicated and can varying a lot with respect of CPU cores, GPU cores, display processor capabilities, etc., CoreLink CCI-500 is highly configurable according to system configuration and is scalable to support coherent accelerators and enhance big.LITTLE integration.
Without doubt process manufacturing technology is one of the key factors that drive mobile technologies moving forward and keep setting new bars. Today in ARM’s Premium IP suite announcement, ARM’s POP IP for TSMC 16FinFET is also announced alongside with Cortex-A72, Mali-T880, and CoreLink CCI-500. As advanced process manufacturing technology keep shrinking in nanometer scale, the cost barrier and technology barrier become higher and some silicon partners with limited resources may face more challenges when moving into more sophisticated design. ARM’s POP IP for TSMC 16FinFET greatly reduce barrier for silicon partners so they can focus on their system innovation. ARM’s POP IP for TSMC 16FinFET can achieve 2x power efficiency over 28nm implementation, and enables Cortex-A72 implementation up to 2.5 GHz, maximizing performance and efficiency.
Peter Hutton, EVP and President Product Groups
In the Premium IP suite announcement on 2/4 in Beijing, besides ARM’s executives there are partners from HiSilicon's Daniel Diao, MediaTek's Andrew Chang, and TSMC’s Peter Chen to jointly announce the premium IP suite for next generation premium user experience.
Pete Hutton, EVP and President Product Groups, and Noel Hurley, ARM General Manager of CPU, firstly welcome journalists and reporters and partners.
Besides the technical aspect of announced IP suite, Pete and Noel also shared ARM’s view on premium mobile user experience. From ARM’s view, it is observed that the massive increase in CPU and GPU performance, boosts in connectivity, combined with software innovations in usability and new business models have changed our mobile experience. For example, console gaming experience in mobile device will soon be realized and in a few years maybe people will say mobile gaming quality in console gaming. Multicasting and UHD content streaming at anywhere, anytime are becoming reality. Not only for entertainment, now people can edit office documents such as powerpoint or word, do professional tasks such as CAD mechanical design with their mobile devices and share to their colleague with wireless connections. People can take photos and transforms the photo to a 3D model on a mobile device, and then make a physical toddler or a figure of his/her favorite Disney movie with 3D printer – this is new user experience about content creation, it is no longer limited to digital document creation. Delivering these new user experiences depends upon continued innovation in the mobile device and the SoC that powers it.
Partners such as HiSilicon, MediaTek, and TSMC also shared their views on premium mobile user experience. The recorded sales number in Nov. 11th 1111 光棍節 (China Single Day) showed the red-hot internet booming in China economy, it also indicates with huge mobile traffic and online shopping behavior the mass market requires more performance , more fashionable, slimmer, sleeker and longer-lasting mobile device for unprecedented user experience. And ARM’s new premium IP suite will be the engine to drive the premium user experience in 2016 and 2017. The consumers will be able to enjoy the premium user experience in near future.
In the end of the announcement ARM execs and partners together hold the signage boards of Cortex-A72, Mali-T880, CoreLink-CCI 500, and POP IP for TSMC 16FinFET for joint announcement of premium IP suite.
From left: ARM Mark Dickinson, HiSilicon’s Daniel Diao ,ARM Pete Hutton, ARM Allen Wu, TSMC’s Peter Chen, Mediatek Andrew Chang, ARM Noel Hurley
It’s been a busy start to the year for ARM as the year kicked off with CES 2015 showcasing some of the amazing experiences the consumer electronics of today (and tomorrow!) can provide. We saw a lot of incredible ARM-based products on display highlighting the range of capabilities, from the increasing amount of electronics inside automobiles to premium mobile devices to all manner of wearables and connected devices attempting to shape the future. Personally I think CES is a fantastic way to begin the year as it sets the scene for how the consumer electronics industry is evolving. Brad Nemire compiled a list of some of the more unusual tech gadgets he saw at the show.
Just like CES provides an insight into the latest and greatest consumer technology, ARM’s Premium Mobile announcement last week will shape mobile experiences in the years to come. Three new products were announced, the Cortex®-A72 processor, the Mali-T880 GPU and CoreLink CCI-500 cache coherent interconnect. This IP suite will enable the next generation of premium experiences for mobile devices, supporting use cases such as 4K video and console-quality gaming. Equally as important, it delivers a more fluid interface when dealing with the complex tasks that we as users increasingly ask of our mobile devices.
We have created an infographic to go into detail on how CoreLink CCI-500 is central to this next generation of ARM mobile systems. Click on the image to enlarge it to proper size
CoreLink CCI-500 is part of a complete suite of System IP from ARM which also includes:
Each System IP component plays a key role in maximizing the performance of ARM-based systems. Find out more about the ARM System IP portfolio by visiting http://www.arm.com/products/system-ip/index.php
In summary, the CoreLink CCI-500 cache coherent interconnect provides system-wide improvements in processor performance, system power and peak system bandwidth. Its configurability allows it to be optimised for various PPA requirements, meaning it can be suitable for multiple applications including digital TVs, mobile, industrial and automotive infotainment. Find out more about CoreLink CCI-500 by visiting http://www.arm.com/products/system-ip/interconnect/corelink-cci-500.php
Chinese Version 中文版： ARM Cortex-A72与全新高端移动体验
When I look back at September 2010 – when we announced the ARM Cortex-A15 processor - the smartphone devices shipping then were using a single-core ARM Cortex-A8 processor. In those days, these early smartphones were already beginning to change the way we thought of connected experiences. I recall the advent of integrated email boosted productivity, devices now could offer instant messaging and multimedia capabilities. By the time we got to the beginning of 2014 and phones based on later versions of Cortex-A15-based octa-core designs, the level of CPU performance increased was a spectacular 17x of devices five year before and your smartphone was now an inseparable part of your daily existence and more often than not, your primary compute platform.
What’s more, devices now boost larger and higher resolution displays with vivid colours, sophisticated always on multi-sensor context awareness, high-quality still and video camera capability and responsive gaming – smartphones are setting the standard of what connected experiences the user expects in ever-slimmer devices – placing a significant demand on the SoC to operate within strict thermal and power budgets. The discerning consumer expects smartphones to be the hub of our always-connected lifestyles and will need more capability in the future.
It is with this need of the future consumer in mind that last week, we announced the first ARM IP Suite for the premium mobile experiences – at the heart of which is the new ARM Cortex-A72 processor, our most advanced high-performance CPU. Aimed at devices shipping in 2016, the Cortex-A72 processor will deliver a staggering 50x CPU performance boost compared to devices from five years ago! The IP Suite also includes the Mali-T880 GPU and CoreLink CCI-500 along with premium configurations of Mali–V550 and Mali-D550 video and display processors respectively and complemented by the ARMv8-A high-efficiency Cortex-A53 processor (in big.LITTLE combinations). When used together in a mobile SoC, these products will ensure that 2016 smartphones deliver unprecedented performance and efficiency, and provide users with an unrivalled premium mobile experience.
We believe that the Cortex-A72 processor and the Premium Mobile IP suite will redefine mobile experiences. The Cortex-A72 processor in conjunction with 16nm FinFET+ technology POP IP unleashes the level of performance that can truly enable very interactive productivity applications. Further combination with the CoreLink CCI-500 Interconnect offers twice the peak system bandwidth that would deliver highly responsive and seemless content creation including movie editing and stunning console-like quality of gaming experiences combined with virtual reality applications that give another dimension to making your primary compute platform, your only compute platform.
On an architectural level, the Cortex-A72 benefits from enhancements in all aspects of CPU performance integer and floating point performance and specially memory streaming which will vastly improve large datasets and interactive workloads all within the constrained energy and power budgets of the smartphone. To summarize key benefits, the Cortex-A72 will deliver
2015 marks a point when performance in devices is matched by the release of software that would bring closer integration of productivity suites – with Microsoft Office and Outlook now available for ARM-based smartphones and tablets. Industry standard tools such as AutoCAD 360 accepted across several industries as requisites for modelling, manufacturing and production are starting to really take off. We are also seeing tools widely used by artists, creators, makers all take further advantage of increase CPU and GPU performance. With this in mind, what lies ahead is a very interesting question. We can only hint at the vague direction in which we are gazing at that opportunity, but there are several instances in which the combination of sensor data, sophisticated image capture and integration with the types of aforementioned applications may even re-define the way in which we have viewed content creation.
One can easily imagine a scenario that allows you to synthesize a series of still images of an object or person taken from various angles, utilizing the smartphone camera and accurately plotting the position of the device either via references in the background of the images captured or that through accurate mapping of sensor data on the devices. The synthesized series of photos can then be processed into a 3D model. This rendering with an associated physics model and mesh geometry can be manipulated, used as an avatar in virtual reality or even exported to a 3D printer to instantly create the subject – in this case, a wedding topper. Some of these use cases exist today, but in only in part, cumbersome and often with processing constrained to the cloud and subject to its availability. The future user experience is mobile-first, seamless, instant productivity – something that is no longer relegated to a desk environment, but ushers in smarter interactions at all levels. To deliver this, the all parts of the SoC must grow. Of course, from the exciting directions we see several ARM partners taking, the example I choose here would be rendered quite rudimentary one day in the near future!
The Cortex-A72 processor builds on the success of the Cortex-A57, which is the state-of-art and the high-performance CPU of choice for 2015 devices. On the hardware front too, we see our partners driving an amazing transformation at an unprecedented pace. In just 2 years, we have seen Cortex-A53 processor based SoCs designed into 100s of devices, supporting the huge smartphone adoption in Asia-Pacific markets. We are seeing a big wave of these Cortex-A57 and Cortex-A53 processor-based devices coming to market, with devices such as the Samsung Galaxy Note 4 already shipping and achieving consistently high benchmark scores and leading the mobile experience. We have also recently seen several Cortex-A57 processor-based devices announced, such as LG G Flex2, LG G4 and Xiaomi Mi Note Pro to name a few. This is a good time of the year when technology media is looking forward to several new devices being announced at the key industry expos and we expect several to feature ARMv8-A cores. Stay tuned!
Besides this, devices based on the 32-bit ARMv7-A architecture including processors like Cortex-A15, Cortex-A17 and Cortex-A7 continue to ship in very large numbers in hundreds of end devices. It is excellent to see the range and quality of the devices that these cores are going into, and I am confident that this trend will continue as our product range widens.
The Cortex-A72-based products will build on this very solid foundation next year. It is an exciting time for the ARM partnership and I hope you think so too! To learn more about the products launched, check out our announcement and let us know what you think.
A paper analyzes Samsung Exynos 5433 and Cortex-A53/A57/T760 is posted at AnandTech，AnandTech | ARM A53/A57/T760 investigated - Samsung Galaxy Note 4 Exynos Review.
The outline of this paper is,
Exynos 5433 - The First Mobile ARM A57 SoC
20nm Manufacturing Process
Cortex-A53 - Architecture
Cortex-A53 - Synthetic Performance
Cortex-A57 - Architecture
Cortex-A57 - Synthetic Performance
Mali-T760 - Architecture
CPU and System Performance
Chinese Version 中文版： 扩展系统一致性 - 第 3 部分 - 性能提升和 CoreLink CCI-500 简介
This week we announced the launch of a new suite of IP designed to enhance the premium mobile experience. A central part of this suite is the ARM CoreLink CCI-500 Cache Coherent Interconnect which builds on the market-leading success of the previous generation interconnect, extending the performance and lower power leadership of ARM systems.
One year on and over 47,000 views since my first blog on the subject we can see that system coherency remains an important factor for SoC design starts. CoreLink CCI-400 has seen great success, over 35 licensees across multiple applications from mobile big.LITTLE, to network infrastructure, digital TV and automotive infotainment. In all these applications there is a need for full coherency for multiple processor clusters, and IO coherency for accelerators and interfaces such as networking and PCIe.
Compared to CoreLink CCI-400, the CoreLink CCI-500 offers up to double the peak system bandwidth, a 30 percent processor memory performance increase, reduced system power, and high scalability and configurability to suit the needs of diverse applications. This blog will go into more detail on these benefits, but first I’ll give a quick recap of cache coherency and shared data.
Cache coherency means that all components have the same view of shared memory. The first two parts of this blog series introduced the fundamentals of hardware cache coherency:
In my first blog I discussed the different methods of achieving cache coherency. Historically cache maintenance has required a lot of software overhead to clean and invalidate caches when moving any shared data between on-chip processing engines. AMBA® 4 ACE and the CoreLink CCI-400 introduced hardware cache coherency which happens automatically without the need for software intervention. The second blog talks about applications such as big.LITTLE processing, where all big and LITTLE cores can be active at the same time, or a mix to meet the performance requirements.
The simplest implementation of cache coherency is to broadcast a snoop to all processor caches to locate shared data on-demand. When a cache receives a snoop request, it performs a tag array lookup to determine whether it has the data, and sends a reply accordingly.
For example in the image above we can see arrows showing snoops between big and LITTLE processor clusters, and from IO interfaces into both processor clusters. These snoops are required for accessing any shared data to ensure their caches are hardware cache coherent. In other words, to ensure that all processors and IO see the same consistent view of memory.
For most workloads the majority of lookups performed as a result of snoop requests will miss, that is they fail to find copies of the requested data in cache. This means that many snoop-induced lookups may be an unnecessary use of bandwidth and energy. Of course we have removed the much higher cost of software cache maintenance, but maybe we can optimize this further?
This is where a snoop filter comes in. By integrating a snoop filter into the interconnect we can maintain a directory of processor cache contents and remove the need to broadcast snoops.
The principle of the snoop filter is as follows:
The CoreLink CCI-500 provides a memory system power saving compared to previous generation interconnect due to the integrated snoop filter. This power saving is driven by the benefit of doing one central snoop lookup instead of many, and reducing external memory accesses for every snoop that hits in caches. Furthermore it may enable processor clusters to maintain a low power sleep state for longer while the snoop filter responds to coherency requests.
Mobile systems normally include asynchronous clock bridges for each processor cluster, and communicating across these bridges costs latency. It’s quicker, easier and lower power to communicate with the interconnect snoop filter instead. This reduced snoop latency can benefit processor performance, and benchmarking has shown a 30% improvement in memory intensive processor workloads. This can help make your mobile device faster, more responsive and accelerate productivity applications like video editing.
Also by reducing snoops, the processors in the system can focus their resources on processing performance and less on responding to snoops. In real terms it means that users will have an SoC that can deliver higher performance while requiring less power to do so.
There is a consistent trend towards multi-cluster SoCs across a number of markets as design teams seek to unleash even more computing performance. Scaling to higher bandwidth systems with more processor clusters will show even greater benefits for the snoop filter. In fact it becomes essential when scaling performance beyond two processor clusters. CoreLink CCI-500 is highly scalable and supports configurations from 1 to 4 ACE interfaces (e.g. 1 to 4 processor clusters). Two-cluster big.LITTLE will remain the standard in mobile, but for other applications there is an opportunity for more processors or indeed coherent accelerators.
Infrastructure networking and server applications already have high a proportion of shared memory accesses between processors and IO; the ARM CoreLink CCN Cache Coherent Network family of products already include integrated snoop filters to ensure the high performance and low latency expected by these applications. The CoreLink CCN family remain the highest performance coherent interconnect IP, supporting up to 12 clusters (48 cores), integrated level 3 system cache and clock speeds in excess of 1GHz. CoreLink CCI-500 is optimized for the performance and power envelope required for mobile and other power constrained applications. The complementary CoreLink NIC-400 Network Interconnect provides the low power, low latency ‘rest of SoC’ connectivity required for IO coherent requesters and the many 10’s or 100’s of peripherals and interfaces.
There is no 'one size fits all’ interconnect, instead ARM has a range of products optimized for the needs of each application.
The performance of mobile devices including smartphone and tablet is increasing with every generation; in fact, tablets are replacing many laptop purchases. A key dimension of SoC performance is memory bandwidth, and this is being driven upwards by screen resolution, 3D gaming, multiple higher resolution cameras and very high resolution external displays. ‘Retina’ class display resolution is already commonplace on mobile devices and Ultra-HD 4K has been available on high end TVs for a couple of years. It is only a matter of time before we see 4K content appear in mobile devices.
To support this increase in memory bandwidth SoC vendors are looking to the latest low power double data rate (LPDDR) dynamic RAM (DRAM) technology. LPDDR3 is an established technology that was in 2013 consumer devices while LPDDR4 appeared in some 2014 devices and will continue to grow its rate of adoption in 2015 in both mobile and non-mobile applications. Each generation of LPDDR lowers the voltage but increases the interface frequency, net result: more bandwidth and lower energy per bit. A single 32 bit LPDDR4-3200 interface will offer 12.8GB/s which is typical on today’s premium smartphones.
For mobile devices 32 bit memory channels are common ranging from single channel for lower cost, entry smartphones, through dual channel for high end smartphone, to quad channel for the highest performance tablets.
The CoreLink CCI-500 offers up to double the peak system bandwidth of CoreLink CCI-400 by supporting up to 4 memory channels. This could allow partners to build memory systems supporting 34GB per second and beyond which enables high performance, high resolution tablet computing. Of course scalability for multiple applications is important, and CoreLink CCI-500 can be configured from 1 to 4 memory channels to suit performance requirements.
One of the biggest benefits of the ARM CoreLink Interconnect is that it has been developed, validated and optimized alongside our Cortex® and Mali™ processor products with the high quality levels that our partners expect. This week’s launch also announced the Cortex-A72, ARM’s highest performance Cortex processor, the Mali-T880 GPU, high-end configuration for our latest Mali-V550 video and Mali-DP550 display IP and Artisan physical IP for 16 FinFet.
To complete the SoC ARM also offers a complete suite of system IP including CoreLink NIC-400 network interconnect for low power, low latency, end to end connectivity to the rest of the SoC, CoreLink MMU-500 system MMU for virtualization of IO and the CoreLink GIC-500 for management of interrupts across multiple clusters, not to mention CoreSight for debug and trace. Central to all of this is the CoreLink CCI-500 Cache Coherent Interconnect.
As we have seen with many other computing features that began in enterprise applications, mobile SoCs are rapidly catching up on the amount of shared traffic across the chip. It is proof that mobile computing power is still advancing steadily and incorporating many features that were only introduced to premium laptops a few years ago. The fact that mobile devices are now high performance devices in their own right should come as no surprise.
I for one look forward to seeing how a 2020 device will compare with today’s premium mobiles, and am looking forward to the challenge of making ARM technology that provides the infrastructure for the premium devices of tomorrow. What do you think devices will look like 5 years from now?
Links for further information:
For many, Tetris is simply a tile-matching video game originally designed and programmed by Alexey Pajitnov in 1984. However, for others, it inspires endless possibilities of Maker projects. Most recently, AdaCore’s Tristan Gingold and Yannick Moy have devised the highly-popular puzzle on an Atmel | SMART SAM4S ARM Cortex-M4 microcontroller.
“There are even versions of Tetris written in Ada. But there was no version of Tetris written in SPARK, so we’ve repaired that injustice. Also, there was no version of Tetris for the Atmel SAM4S ARM processor, another injustice we’ve repaired,” the duo writes.
The concept first stemmed from their colleague Quentin Ochem, who had been searching for a flashy demo for GNAT using SPARK on ARM, to run on the SAM4S Xplained Pro Evaluation Kit. Luckily, this kit features an OLED1 extension with a small rectangular display, which surely enough, immediately ‘SPARKed’ the idea of Tetris. Now, throw in the five buttons overall between the main card and the extension, and the team had all the necessary hardware to bring the project to life.
In total, the entire build took approximately five days to complete. Both Gingold and Moy advise, “Count two days for designing, coding and proving the logic of the game in SPARK, another two days for developing the BSP for the board, and a half day for putting it all together.”
For those unfamiliar with SPARK, it is a subset of Ada that can be analyzed very precisely for checking global data usage, data initialization, program integrity and functional correctness. Mostly, it excludes pointers and tasking, which proved not to be a problem for Tetris.
While we’ve seen the retro game played on everything from t-shirts to bracelets, we’ve never experienced the game literally on an MCU. As the team notes, all of the necessary sources can be downloaded in the tetris.tgz archive, while those interested in designing one of their own can find a detailed breakdown of the entire build here.
In an effort to make FPGA-based prototyping available to any engineer, S2C is offering its popular ProtoBridge AXI FPGA-accelerated verification tool along with its Virtex 7 SingleE and Kintex 7 Logic Modules for a limited time at an accessible price entry point.
Protobridge AXI enables designers to read and write data from computers to AXI-based designs mapped to FPGA-based prototypes. By utilizing a rich set of C subroutine calls, ProtoBridge AXI users can easily implement algorithm validation, block-level prototyping, full-chip simulation acceleration, corner case testing and early SoC software development.
ProtoBridge AXI consists of a computer software component and a FPGA design component. The computer software component contains Linux/Windows drivers and a set of C-API/DPI routines to perform AXI transactions. The FPGA design component contains a PCIe, an interconnection module and AXI transactors to be instantiated in users’ design-under-test (DUT). With these enhanced product features, users can read and write at speeds of up to 500 megabytes per second through the PCIe interface, connect 16 Master devices and 16 Slave devices on the AXI bus, and take advantage of the patent pending Shared Memory technology that link the FPGA prototype with third party design tools.
Virtex 7 SingleE Logic Module
The S2C SingleE V7 Logic Module is the industry’s smallest form-factor (260mm X 170mm), all-purpose, stand-alone prototyping system based on Xilinx’s Virtex-7 2000T FPGA. The system utilizes S2C’s 5th generation technology, can handle up to 20M gate designs and features:
Kintex 7 Logic Module
The K7 Logic Module features the largest number of user I/Os in its class with 432 I/Os on four Dedicated I/O connectors and 16 channels of GTX transceivers on two Differential I/O connectors. The GTX transceivers are capable of running up to 10Gbps with -2 grade FPGA devices. Users can easily download to FPGAs, generate programmable clocks, adjust I/O voltages and run self-test on hardware from S2C’s TAI Player Runtime software via a straightforward USB2.0 interface. With S2C K7 TAI Logic Module’s affordable pricing, project managers can deploy large number of FPGA-based prototypes to accelerate hardware verification and software development in parallel.
Ideal Solution for Block-Level and Algorithm Development
Coupled with S2C’s ProtoBridge™ software that accelerates FPGA verification using co-modeling technology, the SingleE V7, and Kintex 7 Logic Modules are the perfect platform for IP and algorithm creation. Engineers are able to leverage the strengths of system-level simulation and RTL-level design accuracy, shorten design and verification time, and ensure higher product quality through improved test coverage.
Designers can achieve these goals with the ability to
Frustrating isn't it? You're using your new smartphone or tablet to view pages on the Internet, watch a video or get the latest traffic information and the mobile communications just can't handle it. You look at your screen and see a little symbol showing that the signal is dropping in and out of 2G, 3G or HSPA 3.5G connections, and then the device gives up altogether. Unfortunately this scenario is still all too common because despite having the latest applications processor, graphics and software in your phone, we still often have to rely on patchy, low data-rate wireless coverage.
But all this is changing with the advent of new 4G LTE and LTE-Advanced communications which cellular operators are now busy deploying, and our ARM® Cortex®-R real-time processors are powering the latest wireless modem chips in your handset to deliver data faster and more reliably. Take, for example, the new Samsung Exynos Modem which has just been announced. The Exynos Modem 300 series use a Cortex®-R to run the 4G-LTE software protocols and manage signal processing for transmitting and receiving data. In fact, the Exynos Modems aren't the only ones using a Cortex-R for this task; there are hundreds of millions of similar chips in phones and tablets already in use throughout the world.
Cortex-R processors are often hidden from view in applications like this, running underlying communications and control tasks in applications ranging from flash memory or hard disc storage to automotive braking, steering or instrument clusters. Designers choose a Cortex-R processor because its microarchitecture and memory system are specifically designed for these tasks where lots of hard real-time events must be serviced within micro-seconds to maintain accurate control and signal processing.
However, technology marches on and the next generation of wireless modems will soon deliver even higher data rates of 300 Mbits per second or more and support so-called ‘carrier aggregation’ which lets wireless operators use a mix of different frequencies to reach all the devices connecting to a cell. This will provide even more reliable communications and it enables operators to make best use of their precious wireless spectrum allocation. Of course this requires yet more real-time processing throughput and the latest Cortex-R7 real-time processor fits the bill here, without increasing the energy consumption for battery-powered devices. Modems for this have been developed and are currently in silicon and going through the testing and approvals process which they must pass before they're allowed to connect to the cellular network. I’m looking forward to getting my next 4G phone in 2015 that will have one inside.
Thanks for reading. Chris Turner
DAC IP Track Submission Deadline January 20th
Don't miss your opportunity to deliver a compelling technical paper at the
Design Automation Conference, June 7-11, 2015.
Watch this short video by the DAC IP Track Committee Chair to learn more.
Click here to submit your paper abstract -- 100 words is all you need!