At this year's ARM® TechCon™, Carbon Design Systems did a joint presentation with ARM entitled "Getting the Most out of the ARM CoreLink™ NIC-400." In this blog, I'll give a high level overview of what we presented and also give an opportunity to download the whitepaper which gets into much more detail.
The ARM CoreLink NIC-400 Network Interconnect is an extremely versatile piece of IP capable of maximizing performance for high-throughput applications, minimizing power consumption for mobile devices, and guaranteeing a consistent quality of service. To achieve this versatility, the IP is highly configurable. To assist you finding the right configuration for your particular application we introduce the concept of virtual prototyping, modeling the NIC-400 and how to drive & analyze traffic through it. This paper highlights the key features of the NIC-400 as well as a methodology, based on real system traffic and software, for making the best design decisions.
The CoreLink NIC-400 Network Interconnect provides a highly configurable network of interconnect switches and bridges to connect up to 128 AXI or AHB-Lite masters to up to 64 AXI, AHB-Lite or APB slaves in a single NIC-400 instantiation.
The NIC-400 is the 4th generation AXI interconnect from ARM and is delivered as a base product of AMBA AXI3 and/or AXI4 interconnect with three optional, license-managed advanced features: QoS-400 Advanced Quality of Service to dynamically regulate traffic entering the network; QVN-400 QoS Virtual Networks to prevent blocking at arbitration points; and TLX-400 Thin Links to reduce routing congestion and ease timing closure for long paths.
The designer can select the topology of the network of switches to increase the efficiency of the interconnect in many ways. Traffic streams from multiple masters can be combined to increase wire utilization, reducing congestion. Grouping of masters by location can shorten paths between IP blocks and switches. The division of large switches in to multiple smaller switches can allow increased frequencies and provide low latencies for critical paths such as between CPU and DDR main memory.
Each switch can select the appropriate data width and clock frequency to meet performance target while minimizing power and area. Different switches can be placed in different domains allowing hierarchical clock gating to reduce power whenever each domain is idle.
The NIC-400 switches and bridges support both AMBA AXI3 and the new AMBA 4 AXI4, with less wires (no WIDs) and enhanced streaming performance (longer burst support). All bridging between AXI4 and AXI3 is handled seamlessly by the NIC-400. APB support is extended to AMBA 4 APB4 with new write strobes and TrustZone signalling.
Designers use the easy to drive AMBA Designer tool to configure the NIC-400. The user starts by defining the masters and slaves in his system, what bus protocol, bus width and clock each use and filling in a matrix of the required master-slave connectivity. The designer then sets up the address maps for each master and a global address map as required along with any remap options.
Finally, the designer establishes the topology of the connectivity between all masters and all slaves in the implementation view. This allows the designer to minimize latency between critical masters and slaves, e.g. between the CPU and main memory, and to group components together to save wires and gates. The GUI then allows the designer to select configuration options for buffer depths, registering options, QoS regulators and other characteristics. The tool will then automatically insatiate the required master/slave interfaces, switch matrices and bridges in RTL and generate an accompanying IP-XACT description.
One of the challenges for creating an accurate model of the NIC-400 is managing the configurability of it. Given the possible matrix of connections, connection types, protocols, QoS, etc there are billions of possible unique model possibilities. Once you’ve configured the interconnect using AMBA Designer, you can upload the IP-XACT file to Carbon IP Exchange which will then compile the model for you and make it available for download. nce compiled, the model may be managed using the same portal and the IP-XACT file is even stored there in case you want to come back and modify the model later or assign it to a team member.
Once created, the NIC-400 model can easily be executed in tandem with traffic generators to quickly start gathering data. This approach, using traffic generators to both produce and consume traffic is not targeted at verifying the NIC-400 since that is assumed to be correct. Instead, this approach is targeted at measuring the performance characteristics of the model as it is exercised. Traffic can be parameterized to mimic the behavior of system IP and can be parameterized to sweep across a range of various options include burst length, priority, address, etc.
In the example shown here, a simple NIC-400 is configured with two masters and two slaves. The masters are set up to mimic the data loads from a CPU and DMA controller and the dummy targets are an Ethernet MAC and a DDR3 memory controller. Of course, since the traffic generators are quite configurable, it’s possible to model any number of different sources or targets and we’ll get more into that in a bit. Note though that we’re analyzing traffic on any of the connections. The graphs shown here track the latency on the CPU interface and the queues in the DDR controller. The exact metrics for the system in question will of course vary based upon design targets however. It’s also beneficial to correlate data across multiple analysis windows and indeed even across multiple runs.
The important thing we’ve done here is establish a framework to begin gathering quantitative data on the performance of the NIC-400 so we can track how well it meets the requirements. The results can be analyzed which will likely lead to reconfiguration, recompilation and re-simulation. It’s not unheard to iterate through hundreds of various design possibilities with only slight changes in parameters. It’s important to vary the traffic parameters as well as the NIC parameters however since, as we’ll see later, the true performance of the NIC-400 and really, all interconnect IP, is how it impacts the behavioral characteristics of the entire system.
The traffic generator approach can do a great job at getting fast, accurate results even when there is nothing more than just the NIC-400 and a few sources and consumers. It can be set up quickly and enables easy testing of a wide range of system possibilities including corner cases which might be difficult to set up with real IP. It does have drawbacks however. Ultimately, no matter how much time you spend assembling your traffic generation schemes, they don’t reflect the actual behavior of the system running real software and this behavior can vary greatly depending upon the system software and IP configuration. Even a slight reordering of system software calls can have a big impact on overall system performance. Therefore, although it’s great to use traffic generators to get a good approximation of the system performance. The best way to validate that the system will actually meet your performance targets is to actually assemble the system. Thankfully, this is also possible using virtual prototypes.
A system level virtual prototype gives you a much more realistic view of what’s going on inside the system. This is obviously extremely important to handle cases which your traffic generator may not model as correctly as the real IP such as ordering, arbitration and number of outstanding transactions. Another item which is extremely important is coherency. While most of the coherent traffic in the system will be handled by the CCI/CCN IP it will still have an impact on system performance and a few software calls can greatly impact the system traffic in order to maintain this coherency.
Fundamentally, software can cause a dramatic impact on the performance of the overall system and getting this software up and running with the real hardware will enable both of these to be optimized in advance of actual silicon. A prime example of this is with system level benchmarks. These benchmarks are often used to market the IP once it has been finalized. Leading edge design teams will use these benchmarks during the design process of the SoC to drive traffic in the system but also to tweak the settings of the IP to maximize performance. This helps ensure that the actual silicon will meet the marketing specifications.
This is an example system level virtual prototype. In this case, the ARM Cortex-A57 is featured as the main processor in the system. There’s a CCI-400 and multiple DMC-400s to handle most memory accesses and the NIC-400 hangs off the CCI to manage memory accesses to the rest of the system. This system is fully capable of booting Linux and then running a variety of system level benchmarks. Obviously we’d need to add a few more components, most notably a GPU, to model the capabilities of a leading edge SoC but even this simple system model is sufficient to optimize the performance of compute oriented benchmarks and maximize the performance of the processor/memory subsystem. We’re using a parameterizable memory here on the NIC-400 to give some system level flexibility on additional components. Most systems will use multiple memories, one for each modeled component or, if desired, the actual IP models can be used. Since it can boot an OS however, this system level virtual prototype is valuable not only to validate the performance of the various system component but it can also be used to enable pre-silicon firmware development.
Using the system model described in the CPAK pictured above, it is possible to execute real system software, boot an OS and execute system level benchmarks to enable optimization of the NIC-400 as well as the other components in the system.
The ARM CoreLink NIC-400 raises the bar on system level performance while also providing the mechanisms to reduce area and power consumption. These new features, coupled with the vast configurability of any interconnect block, provide a vast array of design choices which can be made as the IP is incorporated into an SoC design. A two-step approach using accurate virtual prototypes can give designers confidence in their ability to make design decisions using approximated traffic models and then validate their results by running actual system software.
The whitepaper from which this blog post was derived is 13 pages long and goes into substantially more detail on the topics presented here. It's available for download and I'd be happy to answer any questions which you may still have after reading it.
Is the whitepaper still available. The links doe not seems to work
How does nic400 directs traffic from master to slaves?
Master A, Slave B, ad Slave C are connected to NIC400.
Master A can send traffic to Slave B and Slave C.
Address used for these transaction are overlapping.
How does NIC400 distinguish and route the traffic from Master A to Slave B or Slave C?
Is there any other AXI attribute which can help to distinguish?