This is a two-part blog on smart acceleration for distributed cloud computing. Part 1 will focus on the technical and commercial drivers along with an overview of the solutions available today. Part 2 will focus on future trends and will take a closer look at building smart acceleration systems.
The increasing number of connected devices and the data generated from them is providing an opportunity for new applications and business models, which in turn is driving a new set of infrastructure solution requirements from the edge back to the datacenter.
Figure 1 highlights the transition to distributed edge computing with IoT devices that have dramatically different requirements from real time latency for industrial/automotive use-cases to high bandwidth for AR/VR. This shift forces more compute to move to the data source.
Containers also promise to accelerate edge application deployment cycles and increase cloud agility in fundamental ways. To scale applications and move the distributed compute to the source requires the same deployment methodologies and container isolation for guest services that run in the cloud today.
Fig. 1 Moving to a distributed compute model
Combining many different types of processors and accelerators is needed to maximize efficiency and meet the real-time demands at the network edge. In a heterogeneous system, the processors and the accelerators must share data, and moving it about can be a headache (especially in the virtualized / container world). Having the memory across these devices operate in a coherent manner – meaning that all devices can address all memory attached to those devices in a single, consistent way – is one of the holy grails of heterogeneous computing.
But before discussing future technology solutions, it is useful to look back on the traditional Host-Accelerator systems and level set on the terminology. A traditional accelerator system includes;
The ubiquitous connectivity between Host and Accelerator is PCIe with block transfers typically in 512B to 4KB chunks.
The increased use of virtualized guest OS and containers drove new virtualized networks and storage stacks, which initially were run on the Host Processor. As network speeds increased (10GbE, 25GbE, 100GbE) and NVM storage increased from thousands to millions of operations per second (IOPS), these software stacks were consuming 50 percent or more of the highly valuable host processing instead of being used for the actual application. Open vSwitch (OVS - a multi-layer, open source virtual switch), is an example stack that runs on the host processor.
To address this issue, Smart-Accelerators such as Smart-NICs and NVMe storage devices have been deployed. These devices combine efficient compute with accelerators and IO virtualization techniques (VirtIO, SR-IOV, etc) to reduce the host processing overhead to less than 5 percent.
Smart acceleration did not stop evolving at off-load. Today, Smart-NICs have become self-hosted, which means that they have enough computing power and functionality to run full OS stacks with hypervisors, virtual machines and containers. Connecting other acceleration elements to the self-hosted NIC is also another growing trend. These accelerators include security off-load, distributed memory and ML to name a few. Finally, the traditional host is now completely freed up to run high value applications and services.
A good overview of the path to self-hosted acceleration is covered by the following AWS Nitro introduction blog which describes the AWS EC2 instance evolution over time.
The ability to develop heterogeneous solutions with right levels of compute and acceleration to balance performance and power efficiency has been a hallmark of the Arm architecture and partnership activity. An example is the cell phone in your pocket which has an integrated Arm processor balanced with GPU, video, and display to provide a rich user experience while maintaining energy efficiency to significantly extend battery life. Within the infrastructure space, the performance, scale and variety of devices is obviously very different from a cell phone, but the ultimate benefits of implementing designs using a heterogeneous architecture remain the same.
Today, Arm smart acceleration technology is being deployed across the distributed computing infrastructure for a wide range of solutions, including cellular 4G/5G base band and radio, edge computing, NFV, core network, security appliance and datacenter network designs. Example product families include Broadcom NetXtreme, Cavium Octeon TX, Mellanox Bluefield, Marvell Armada, NXP LayerScape, SocioNext SyncQuacer and Xilinx UltraSoC.
To be continued… stay tuned for part two where we will take a closer look at future architecture trends and dive into how these smart acceleration devices and systems are built. Some of the topics that will be covered include: