Leveraging PCI Express to Enable External Connectivity in Arm-Based SoCs

August 21, 2017

Written by Antonio Pacheco, ASIC Digital Design Engineer, Synopsys

To support today’s high volume of data, SoC designers of high-performance computing and networking applications must leverage a scalable chip-to-chip interface that enables high data throughput while minimizing the latency and efficient management of power.

PCI Express (PCIe) is the de-facto chip-to-chip connectivity standard for a wide range of applications from high-performance CPUs, networking, storage devices to battery-powered mobile devices. PCIe was first known as a board level bus system in personal computers, but today, with its wider links, distributed computing capabilities, and higher data rates, PCIe enables external connectivity in SoCs for high-performance servers. This article explains the PCIe architecture and how PCIe can be used to provide external connectivity in Arm-based SoCs.

PCIe Architecture

PCIe is a layered protocol consisting of a physical layer, data link layer, and a transaction layer, as shown i Figure 1.

PCI Express protocol layers

Figure 1: PCI Express protocol layers

The connection between two PCIe devices is referred to as a “link” and within that link are individual “lanes” – each comprised of two differential pairs moving data between the devices. The example link shown in Figure 1 has a single lane – one differential pair moving data from the transmitter (TX) output on the left device to the receiver (RX) input on the right device, and the other pair moving data using the TX from the right device to the RX of the left device.

Examining the layers from the bottom, the physical layer transmitting data converts outbound data packets into a serialized bit stream across all lanes of the link. Additional functions include:

Buffering and adding physical layer control information such as headers and/or ordered sets to identify the data type
Byte striping to distribute the individual transmitted bytes across all the lanes in the link
Scrambling to randomize data patterns to keep repetitive data from causing the link to emit frequency tones or cross-talk

The physical layer on the side receiving data, performs the reverse of those functions, with one crucial addition. Before the unscrambling function, a clock and data recovery (CDR) module searches for known symbols in the received data stream to reconstruct the clock signal. In this receive path, we have to compensate for differences between the actual and recovered clock from the transmitter (TX) and this is done via the “elasticity buffer” which serves to absorb these slight differences.

The PCIe physical layer scales both in width, from one lane to as many as 32 lanes, and in speed from “Gen1” 2.5GT/s up to “Gen4” at 16GT/s, for bandwidth scaling from 250MB/s up to 64GB/s.

The next higher layer is the data link layer, which provides mechanisms that ensure a reliable data channel between the two linked devices. The data link layer offers many features including:

Unique data link layer packets (DLLPs) which are local to the link and are used for communicating information specific to this layer
An ACK/NAK (acknowledge/no acknowledge) protocol to handshake groups of packets, such that corrupted packets are automatically retransmitted
A link power state protocol allowing the link to enter various lower power consumption states when not transferring data
A flow-control credit protocol, which advertises receive buffer availability in each instance to ensure overflow-free operation, while ensuring efficient link utilization to maximize bandwidth

The uppermost layer in the PCIe interface is the transaction layer where application data travels using various transaction types shown below in Table 1. This layer extends across the entire PCIe hierarchy, and, unlike the two lower layers, communicates beyond directly linked devices. The features of the Transaction Layer include:

Processing of Transaction Layer Packets (TLPs) including the header which carries transaction type, addressing (physical or translated), ordering information, and data when appropriate
Creation and checking of ECRC – End-to-end Cyclic Redundancy Code which confirms that each TLP remained uncorrupted even after having potentially traversed multiple levels of a PCIe hierarchy
Transaction ordering, which ensures memory consistency by:
- Preserving programmers “Memory Ordering Model”, ensuring correct operation of the producer consumer model where data reads must return the most-recently written data
- Allowing certain transactions to overtake others to avoid deadlock scenarios
- If none of the conditions above apply, then transaction ordering can be optionally relaxed to maximize performance for non-producer consumer model applications
Transaction types - TLPs are targeted to memory address space and use messages to encapsulate other protocols over PCIe; Table 1 defines transaction types.

Address Space	Transaction Types	Basic Usage	Use Cases
Memory	Read/write	Transfer data to/from a memory-mapped location that can be cached or not	Regular memory read/write transactions. Some memory write transactions can carry "message" interrupt events (remember: everything is packet!)
I/O	Read/write	Transfer data to/from an I/O-mapped location	Mainly for PCI legacy support
Configuration	Read/write	Device function configuration/setup	Special read/write transactions for PCIe subsystem configuration
Message	Baseline (including vendor defined)	From event signaling mechanism to general purpose messaging	This is not specific to memory, but allows a TLP to target other agents. It's an in-band signaling scheme and can be used to encapsulate other protocols

Table 1: Definition of transaction types that are transported by the transaction layer

External Connectivity in Arm-Based SoCs

Building an Arm-based SoC with a PCI Express interface requires deep knowledge of the PCIe protocol, the Arm AMBA® protocol, ordering issues, different clocking domains, error mapping, tag management, etc., resulting in longer SoC development time, which designers can overcome with the use of 3^rd-party PCIe IP. Designers can enable external connectivity in Arm-based SoCs and reduce their time to market by using a compliant PCIe IP that is proven in millions of devices, allowing designers to focus their attention on the rest of their SoC design. Integration of a proven PCIe IP helps overcome design challenges such as:

Clock domain crossing: PCIe IP generally runs on a clock source derived from the PCIe data rate, so frequencies of 125, 250, 500, or 1000MHz are common depending on PCIe signaling rate (2.5GTs, 5GT/s, 8GT/s, or 16GT/s), PHY interface width and the number of lanes being implemented. The internal clock of an SoC may run at a different rate than the PCIe interface, such as 400, 800, or 1600MHz and change drastically with application load. A well-designed PCIe interface IP will free the SoC designer from needing to consider the actual clock speed, link power state and transaction buffering to ensure gap-free transmission/reception on both the PCIe and SoC interfaces.
Translating Arm AMBA transactions into PCIe TLPs: SoC designers must either carefully match their AMBA and PCIe bandwidth and transfer sizes, or choose a PCIe interface IP that takes care of this automatically. For example, an AMBA burst write might be larger than the PCIe TLP maximum payload size (MPS), forcing the transfer to be decomposed into multiple smaller PCIe write TLPs. Similarly, for read requests exceeding the PCIe maximum read request size (MRRS), then the burst must be decomposed into several smaller TLP reads. This has the added complication of requiring the collection, potentially re-ordering, and reassembling of multiple PCIe read responses to provide an AMBA response that matches the original request. In addition, there are PCIe TLP header attributes which do not map directly into the AMBA interface protocols and need to be dealt with. However, it's possible for designers to map some of these header attributes to AMBA sideband signals, using AxUSER signals and thereby simplify the process of TLP creation.
Translating from AMBA memory transaction ordering model into/from PCIe ordering model: Traffic outbound from the SoC runs through AXI slave transactions. Similarly, traffic inbound to the SoC from PCIe will run through the AXI master transactions. For the PCIe slave interface to meet the Arm ordering model, it must properly handle:
- Read-after-Read: since PCIe does not guarantee ordering between reads, this function must be handled by the ordering logic in the slave.
- Write-after-Write: can generally be achieved by mapping to PCIe non-relaxed posted transactions except for PCIe configuration writes where ordering is not guaranteed.
- Read-after-Write: guaranteed by PCIe ordering rules, ensures producer/consumer model, with some options for relaxation if performance in the target application does require it and data is known to be unrelated.
- Write-after-Read: PCIe allows writes to pass reads to avoid deadlock scenarios. However, the CPU mainly controls this function, so applications needing to comply with this rule naturally do so.

A PCIe AXI Master needs to be compliant with the same ordering rules, so it must have very similar ordering logic as described above. Some paths can be simpler, for example, the inbound read path does not require ordering logic as long as it does not reorder inbound reads, since a compliant AXI slave ensures Read-After-Read, by ordering the read data completions. To ensure compliance with the Read-after-Write rule, the Master logic could simply wait for the write response before issuing the read.

Another consideration for SoC designers is where to place their DMA (Direct Memory Access) engine(s). While it’s possible to use an off-the-shelf DMA engine communicating solely over the AMBA interconnect there are limitations to such an architecture. To get the maximum performance, the DMA engine needs to understand both AMBA and PCIe. Consider a system where the AMBA burst size is smaller than the PCIe maximum payload size. AMBA bursts generated by a DMA engine on the AMBA interconnect will translate to smaller-than-optimal PCIe packets. Placing the DMA engine inside the PCIe controller allows for aggregation – where the DMA engine collects several AMBA bursts into a single PCIe packet to optimize PCIe bandwidth and utilization. The resulting reduced number of transactions overall can also pay dividends in power consumption and efficiency per byte.

Summary

PCIe has emerged as the standard of choice for chip to chip connectivity between high-performance processors like Arm’s and other devices. However, integrating the PCIe interface into an SoC can be challenging if one doesn’t have deep knowledge of the PCIe and AMBA interface protocols. Designers can overcome these challenges by leveraging optimized PCIe IP that is designed to deal with the nuances of bridging between PCIe and AMBA while also including the latest features of the PCIe protocol. Synopsys’ DesignWare IP for PCI Express to Arm AMBA Bridge is a configurable and scalable solution that meets the needs of a wide range of high-bandwidth, low latency and low power applications. It has been proven in over 1500 designs and production proven in millions of units, allowing designers to integrate the IP into their SoCs with confidence. The IP offers numerous advantages including:

Reliability, a key requirement for applications like storage and automotive. The ASIL Ready ISO 26262 PCIe IP, detects and in some cases corrects transient or permanent faults
Support for virtualization technologies such as address translation services and Single Root I/O Virtualization: enables advanced server features to be available in Arm SoCs
Advanced power & clock gating for seamless entry to and exit from the lowest possible power states without requiring software intervention.

For further information, visit the DesignWare IP for PCI Express websites below.

Visit DesignWare IP for PCl Express website

metux over 6 years ago

Just a curios question from a SW engineer: can currrent SoCs (eg. imx6) directly be interconnected via PCIe ?

What would the CPU have to do to send/receive transactions ?

I'm looking for an interconnect for NUMA / asymetric multiprocessing w/ a bundle of SoCs on a board.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

SoC Design and Simulation blog

Arm Virtual Platform co-simulation solution accelerates SoC verification

Daniel Owens

Avery Design Systems’ co-simulation design verification solution that integrates SystemC-based Arm virtual platforms with a SystemVerilog environment.
- December 6, 2022
IP exchange and Cycle Models end-of-life update

Gemma Platt

Arm Cycle Models and Arm IP Exchange are now End-of-Life, understand what this means to you.
- May 25, 2022
Accelerate IP Selection with the New Arm IP Explorer

Zach Lasiuk

The newly announced Arm IP Explorer platform represents a step-change in efficiency for the IP selection process when defining a custom System on Chip (SoC).
- May 4, 2022

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Leveraging PCI Express to Enable External Connectivity in Arm-Based SoCs

PCIe Architecture

External Connectivity in Arm-Based SoCs

Summary

Arm Virtual Platform co-simulation solution accelerates SoC verification

IP exchange and Cycle Models end-of-life update

Accelerate IP Selection with the New Arm IP Explorer