Written by Antonio Pacheco, ASIC Digital Design Engineer, Synopsys
To support today’s high volume of data, SoC designers of high-performance computing and networking applications must leverage a scalable chip-to-chip interface that enables high data throughput while minimizing the latency and efficient management of power.
PCI Express (PCIe) is the de-facto chip-to-chip connectivity standard for a wide range of applications from high-performance CPUs, networking, storage devices to battery-powered mobile devices. PCIe was first known as a board level bus system in personal computers, but today, with its wider links, distributed computing capabilities, and higher data rates, PCIe enables external connectivity in SoCs for high-performance servers. This article explains the PCIe architecture and how PCIe can be used to provide external connectivity in Arm-based SoCs.
PCIe is a layered protocol consisting of a physical layer, data link layer, and a transaction layer, as shown i Figure 1.
Figure 1: PCI Express protocol layers
The connection between two PCIe devices is referred to as a “link” and within that link are individual “lanes” – each comprised of two differential pairs moving data between the devices. The example link shown in Figure 1 has a single lane – one differential pair moving data from the transmitter (TX) output on the left device to the receiver (RX) input on the right device, and the other pair moving data using the TX from the right device to the RX of the left device.
Examining the layers from the bottom, the physical layer transmitting data converts outbound data packets into a serialized bit stream across all lanes of the link. Additional functions include:
The physical layer on the side receiving data, performs the reverse of those functions, with one crucial addition. Before the unscrambling function, a clock and data recovery (CDR) module searches for known symbols in the received data stream to reconstruct the clock signal. In this receive path, we have to compensate for differences between the actual and recovered clock from the transmitter (TX) and this is done via the “elasticity buffer” which serves to absorb these slight differences.
The PCIe physical layer scales both in width, from one lane to as many as 32 lanes, and in speed from “Gen1” 2.5GT/s up to “Gen4” at 16GT/s, for bandwidth scaling from 250MB/s up to 64GB/s.
The next higher layer is the data link layer, which provides mechanisms that ensure a reliable data channel between the two linked devices. The data link layer offers many features including:
The uppermost layer in the PCIe interface is the transaction layer where application data travels using various transaction types shown below in Table 1. This layer extends across the entire PCIe hierarchy, and, unlike the two lower layers, communicates beyond directly linked devices. The features of the Transaction Layer include:
Table 1: Definition of transaction types that are transported by the transaction layer
Building an Arm-based SoC with a PCI Express interface requires deep knowledge of the PCIe protocol, the Arm AMBA® protocol, ordering issues, different clocking domains, error mapping, tag management, etc., resulting in longer SoC development time, which designers can overcome with the use of 3rd-party PCIe IP. Designers can enable external connectivity in Arm-based SoCs and reduce their time to market by using a compliant PCIe IP that is proven in millions of devices, allowing designers to focus their attention on the rest of their SoC design. Integration of a proven PCIe IP helps overcome design challenges such as:
A PCIe AXI Master needs to be compliant with the same ordering rules, so it must have very similar ordering logic as described above. Some paths can be simpler, for example, the inbound read path does not require ordering logic as long as it does not reorder inbound reads, since a compliant AXI slave ensures Read-After-Read, by ordering the read data completions. To ensure compliance with the Read-after-Write rule, the Master logic could simply wait for the write response before issuing the read.
Another consideration for SoC designers is where to place their DMA (Direct Memory Access) engine(s). While it’s possible to use an off-the-shelf DMA engine communicating solely over the AMBA interconnect there are limitations to such an architecture. To get the maximum performance, the DMA engine needs to understand both AMBA and PCIe. Consider a system where the AMBA burst size is smaller than the PCIe maximum payload size. AMBA bursts generated by a DMA engine on the AMBA interconnect will translate to smaller-than-optimal PCIe packets. Placing the DMA engine inside the PCIe controller allows for aggregation – where the DMA engine collects several AMBA bursts into a single PCIe packet to optimize PCIe bandwidth and utilization. The resulting reduced number of transactions overall can also pay dividends in power consumption and efficiency per byte.
PCIe has emerged as the standard of choice for chip to chip connectivity between high-performance processors like Arm’s and other devices. However, integrating the PCIe interface into an SoC can be challenging if one doesn’t have deep knowledge of the PCIe and AMBA interface protocols. Designers can overcome these challenges by leveraging optimized PCIe IP that is designed to deal with the nuances of bridging between PCIe and AMBA while also including the latest features of the PCIe protocol. Synopsys’ DesignWare IP for PCI Express to Arm AMBA Bridge is a configurable and scalable solution that meets the needs of a wide range of high-bandwidth, low latency and low power applications. It has been proven in over 1500 designs and production proven in millions of units, allowing designers to integrate the IP into their SoCs with confidence. The IP offers numerous advantages including:
For further information, visit the DesignWare IP for PCI Express websites below.
Visit DesignWare IP for PCl Express website
Just a curios question from a SW engineer: can currrent SoCs (eg. imx6) directly be interconnected via PCIe ?
What would the CPU have to do to send/receive transactions ?
I'm looking for an interconnect for NUMA / asymetric multiprocessing w/ a bundle of SoCs on a board.