You may have noticed the ARM announcement last week of a group of Premium Mobile products (if not you can find it here ARM Sets New Standard for the Premium Mobile Experience - ARM) covering a new core processor IP, new GPU IP and a new Interconnect IP. While the headlines may belong to the Core and GPU announcements I want to focus on the new Cache Coherent Interconnect, the ARM CoreLink CCI-500. The real-world system performance of any mobile SoC is really determined by the choice of DDR technology and how effectively the SoC architecture can squeeze every last drop of performance out of that DDR.
In this multi-part blog I want to explore how early performance exploration enables users of the CCI-500 to gain valuable insight into the configurability that CCI-500 brings, how the IP behaves under different loading conditions and therefore enable architects, implementers and verification engineers to be better prepared for projects which plan to include CCI-500 in their SoCs.
One of the key differences between the previous generation of CCI and the latest is in the configurability of the IP. CCI-500 allows users to more effectively tune the interconnect to match the needs of a broad range of mobile SoCs. For example the number of coherent clusters supported by the AMBA ACE protocol has increased to a maximum of 4 but can also be reduced to 1 for smaller applications. The following table provides a few example CCI-500 configurations that might be commonly used.
This configurability obviously allows better matching of CCI configuration with target SoC requirements, however it poses a number of questions. For example what memory bandwidth can a given configuration support? Will adding more ports get me the performance I need? What impact does changing the configuration have?
Exploring these configuration options in a meaningful way requires accurate measurements of the RTL performance of the IP. This is exactly the kind of challenge that the Cadence Interconnect Workbench was architected to address.
Taking the Large example from the previous table the following diagram illustrates the UVM testbench features which are required to start cycle-accurate performance exploration.
As can be seen the testbench needed comprises a number of instances of AMBA VIP to drive each of the 14 interfaces as well as a system scoreboard called Interconnect Validator which tracks transactions through their life-cycle. In addition, test sequences are required to define the shape of AMBA traffic to be injected into the CCI-500 configuration.
The power of Interconnect Workbench is that this potentially tedious, time-consuming and error-prone testbench creation task can be completely automated through a simple spreadsheet.
Below is shown a spreadsheet for a simple configuration (the "small" example in the table) from which a fully working, automatically generated, UVM testbench can be created in a matter of minutes. This simple example is chosen simply to make viewing it in this blog less of an eye test. We have created and tested templates for all the examples listed in the table.
As can be seen, creating this spreadsheet is a much simpler task than writing the 10’s of 1000’s lines of SystemVerilog code by hand.
In the next part of the blog I will present some of the performance results than can be easily extracted using the automated testbench and how different CCI setups can be readily compared to ensure the correct configuration is identified early in your project.
Exploring the ARM CoreLink CCI-500 performance envelope – Part 2