Exploring the ARM CoreLink™ CCI-500 performance envelope – Part 2

February 22, 2015

Introduction

In Part 1 of this blog series (found here ) we introduced the ARM CoreLink^TM CCI-500 Cache Coherent Interconnect and described some of the new configurable features which are available over and above what was available with the previous generation CoreLink^TM CCI-400. We described how a UVM testbench is needed in order to start exploring the enhanced performance capabilities that are on offer, and we introduced an automation tool, Interconnect Workbench, which removes the need for manual testbench creation.

In Part 2 of the blog we start by exploring how CoreLink CCI-500 performs in a CoreLink CCI-400 like configuration and follow that by showing the full performance potential of CoreLink CCI-500 when configured for maximum performance.

CoreLink CCI-500 as a CoreLink CCI-400 replacement

As a first experiment we have configured CoreLink CCI-500 to have 2x ACE input ports, 3x ACELite input ports, 2x memory ports and 1 system port, this matches the fixed configuration of the previous generation of Cache Coherent Interconnect, the CoreLink CCI-400. We have then created a scenario which drives saturating transactions into all of the input ports targeting the two memory ports. The transactions are all defined as Non-shareable so we eliminate the effect of L2 Cache Snoops and see just the raw throughput. Running the testbench at 500MHz also provides a useful point of reference as many CoreLink CCI-400 designs are run at this speed.

As in Part 1, we can easily generate a testbench for the CoreLink CCI-500 configurations, a diagram of the generated UVM testbench is shown below.

The generated testbench contains all the necessary instances of fully configured AMBA VIP needed along with an instance of the Interconnect Validator VIP connected to all of the interface VIP. This additional VIP provides full system scoreboard functionality to support transaction tracking from entry point to its exit point including all the necessary coherency modelling need for ACE. In addition it also captures all necessary timing details needed for performance analysis, the graphs shown in this blog all come from this source.

It is also important to note that all the Slave VIP instances which model the memory in the system are configured to have zero delay in order that we only see effective delays and bandwidth of the CoreLink CCI-500.

As can be seen from the chart the CCI-500 delivers around 14GB/s of both READ and WRTE bandwidth.

Unleashing the full CoreLink CCI-500 performance

The big benefit of CoreLink CCI-500 over its predecessor is the capability to support two additional memory ports over and above the two supported by CoreLink CCI-400, in addition the design can be targeted to run at 667Mhz in appropriate technologies. The figure below shows the same non-shareable saturating test running on the CoreLink CCI-500 configuration described previously, i.e. with two memory ports at 500Mhz (the red lines) with a CoreLink CCI-500 configuration with four memory ports running at 667Mhz (the blue lines). The graphs show both read and write bandwidth for the two implementations and the improvement in bandwidth is clearly shown, around 14GB/s vs 33 GB/s of bandwidth for both read and write traffic.

One of the interesting phenomena we see in the simulations is that there is a delay between simulation startup and high bandwidth levels being achieved despite the fact that all masters are trying to make saturating memory accesses from the start. This is caused by the need for the Snoop Filter (more blogs are coming to explain the Snoop Filter) to initialize its RAM, the configuration we have chosen for this test is the “Large” configuration of CoreLink CCI-500 with four memory ports (blue graphs) and also 4x ACE ports (compared to 2x in the 500Mhz case). To support more ACE ports the Snoop Filter RAM is larger and hence takes longer to initialize.

Managing Bandwidth Requests

In order to understand how well CoreLink CCI-500 handles demanding scenarios it is useful to visualize how much it is stalling the requesting masters, sometimes this is called “back pressure”. A useful proxy for “back-pressure” in AMBA infrastructure is the concept of Outstanding Transactions. The AMBA ACE, ACE-Lite, AXI4 and AXI3 protocols all support multiple transaction issuing assuming that the receiving interface can support it.

As multiple transactions get issued into the system the number of Outstanding Transactions (OT), i.e. transactions that have been initiated but are incomplete, increases. The OT level will increase until the receiving interface (in this case the CoreLink CCI-500) throttles it, this is generally called the read_acceptance or write_acceptance limit. If we look at the chart below it shows WRITE bandwidth in red (we saw this on the last chart) plotted with WRITE OT Level, in blue, for all of the initiating masters combined. As the system is stalled waiting for the Snoop Filter RAM to be initialized we can see the OT level is flat, but once the RAM initialization is complete the CoreLink CCI-500 starts trying to balance the requesting masters.

After a brief peak of nearly 100 Outstanding Transactions the OT level starts to generally decrease to settle at around the 65 OT level plus or minus around 5 OT. We could run the simulation for longer to understand if this is steady state.

If you happen to be attending DVCon in a week's time I will be jointly presenting alongside Simon Rance from ARM at Session 6.3 on Tuesday 3rd March 2015, we would be very happy to chat about this or other ARM system performance topics.

Event Details | DVCon

Watch out for more parts of this blog in which we will further explore key features of CoreLink CCI-500.

Exploring the ARM CoreLinkTM CCI-500 performance envelope - Part 1

SoC Design and Simulation blog

Understanding Scandump: A key silicon debugging technique

Vincent Yang

Scandump is highly effective in silicon debugging as it can capture most internal states through scan chains, making it invaluable in diagnosing silicon issues.
- June 5, 2024
Introduction to AMBA Viz

Tony Nip

AMBA Viz enables faster debug and performance analysis for cycle-accurate simulation and emulation, even for complex interconnects and AMBA bus protocols.
- May 31, 2024
Arm Virtual Platform co-simulation solution accelerates SoC verification

Daniel Owens

Avery Design Systems’ co-simulation design verification solution that integrates SystemC-based Arm virtual platforms with a SystemVerilog environment.
- December 6, 2022

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Exploring the ARM CoreLink™ CCI-500 performance envelope – Part 2

Introduction

CoreLink CCI-500 as a CoreLink CCI-400 replacement

Unleashing the full CoreLink CCI-500 performance

Managing Bandwidth Requests

Understanding Scandump: A key silicon debugging technique

Introduction to AMBA Viz

Arm Virtual Platform co-simulation solution accelerates SoC verification