In Part 1 of this blog series (found here ) we introduced the ARM CoreLinkTM CCI-500 Cache Coherent Interconnect and described some of the new configurable features which are available over and above what was available with the previous generation CoreLinkTM CCI-400. We described how a UVM testbench is needed in order to start exploring the enhanced performance capabilities that are on offer, and we introduced an automation tool, Interconnect Workbench, which removes the need for manual testbench creation.
In Part 2 of the blog we start by exploring how CoreLink CCI-500 performs in a CoreLink CCI-400 like configuration and follow that by showing the full performance potential of CoreLink CCI-500 when configured for maximum performance.
As a first experiment we have configured CoreLink CCI-500 to have 2x ACE input ports, 3x ACELite input ports, 2x memory ports and 1 system port, this matches the fixed configuration of the previous generation of Cache Coherent Interconnect, the CoreLink CCI-400. We have then created a scenario which drives saturating transactions into all of the input ports targeting the two memory ports. The transactions are all defined as Non-shareable so we eliminate the effect of L2 Cache Snoops and see just the raw throughput. Running the testbench at 500MHz also provides a useful point of reference as many CoreLink CCI-400 designs are run at this speed.
As in Part 1, we can easily generate a testbench for the CoreLink CCI-500 configurations, a diagram of the generated UVM testbench is shown below.
The generated testbench contains all the necessary instances of fully configured AMBA VIP needed along with an instance of the Interconnect Validator VIP connected to all of the interface VIP. This additional VIP provides full system scoreboard functionality to support transaction tracking from entry point to its exit point including all the necessary coherency modelling need for ACE. In addition it also captures all necessary timing details needed for performance analysis, the graphs shown in this blog all come from this source.
It is also important to note that all the Slave VIP instances which model the memory in the system are configured to have zero delay in order that we only see effective delays and bandwidth of the CoreLink CCI-500.
As can be seen from the chart the CCI-500 delivers around 14GB/s of both READ and WRTE bandwidth.
The big benefit of CoreLink CCI-500 over its predecessor is the capability to support two additional memory ports over and above the two supported by CoreLink CCI-400, in addition the design can be targeted to run at 667Mhz in appropriate technologies. The figure below shows the same non-shareable saturating test running on the CoreLink CCI-500 configuration described previously, i.e. with two memory ports at 500Mhz (the red lines) with a CoreLink CCI-500 configuration with four memory ports running at 667Mhz (the blue lines). The graphs show both read and write bandwidth for the two implementations and the improvement in bandwidth is clearly shown, around 14GB/s vs 33 GB/s of bandwidth for both read and write traffic.
One of the interesting phenomena we see in the simulations is that there is a delay between simulation startup and high bandwidth levels being achieved despite the fact that all masters are trying to make saturating memory accesses from the start. This is caused by the need for the Snoop Filter (more blogs are coming to explain the Snoop Filter) to initialize its RAM, the configuration we have chosen for this test is the “Large” configuration of CoreLink CCI-500 with four memory ports (blue graphs) and also 4x ACE ports (compared to 2x in the 500Mhz case). To support more ACE ports the Snoop Filter RAM is larger and hence takes longer to initialize.
In order to understand how well CoreLink CCI-500 handles demanding scenarios it is useful to visualize how much it is stalling the requesting masters, sometimes this is called “back pressure”. A useful proxy for “back-pressure” in AMBA infrastructure is the concept of Outstanding Transactions. The AMBA ACE, ACE-Lite, AXI4 and AXI3 protocols all support multiple transaction issuing assuming that the receiving interface can support it.
As multiple transactions get issued into the system the number of Outstanding Transactions (OT), i.e. transactions that have been initiated but are incomplete, increases. The OT level will increase until the receiving interface (in this case the CoreLink CCI-500) throttles it, this is generally called the read_acceptance or write_acceptance limit. If we look at the chart below it shows WRITE bandwidth in red (we saw this on the last chart) plotted with WRITE OT Level, in blue, for all of the initiating masters combined. As the system is stalled waiting for the Snoop Filter RAM to be initialized we can see the OT level is flat, but once the RAM initialization is complete the CoreLink CCI-500 starts trying to balance the requesting masters.
After a brief peak of nearly 100 Outstanding Transactions the OT level starts to generally decrease to settle at around the 65 OT level plus or minus around 5 OT. We could run the simulation for longer to understand if this is steady state.
If you happen to be attending DVCon in a week's time I will be jointly presenting alongside Simon Rance from ARM at Session 6.3 on Tuesday 3rd March 2015, we would be very happy to chat about this or other ARM system performance topics.
Event Details | DVCon
Watch out for more parts of this blog in which we will further explore key features of CoreLink CCI-500.
Exploring the ARM CoreLinkTM CCI-500 performance envelope - Part 1