Reduce TCO with Arm Based SmartNICs

November 14, 2019

9 minute read time.

What is a SmartNIC

A SmartNIC is a Network Interface Card that includes general-purpose CPUs. The CPUs are used to offload processing that is done by server CPUs. Arm CPUs are being selected for SmartNIC SoCs because of their efficiency, performance, and the well supported software ecosystem. For example, two Arm-based SmartNIC platforms are the Broadcom Stingray and Mellanox Bluefield. These platforms are built around Arm A-72 CPUs. SmartNICs also include DRAM and storage and boot standard operating systems like Linux. In fact, a SmartNIC can appear as a host on the network. For this reason, SmartNICs enable use cases that go far beyond networking. Ultimately, the advantage of using a SmartNIC is that they reduce operating costs by efficiently offloading processing from servers.

A host machine with a SmartNIC attached to its PCIe bus.

SmartNIC Use Cases

The base use case for a SmartNIC is networking offload. Although a "dumb" NIC is capable of accelerating network processing, they are limited to fixed functions like TCP and VXLAN offload. SmartNICs have these fixed function accelerators as well, but the addition of the CPUs makes them more flexible and powerful. For example, with a SmartNIC, we could offload VXLAN through an on-board accelerator. You can then use the CPUs to run an eBPF program that does request rate limiting and IP blacklisting for DDOS attack prevention. If the SmartNIC is attached to a server running NGINX, we could then turn off the NGINX layer 7 rate limiter. This would free up server resources that can be reallocated towards handling more user requests.

Another use case for SmartNICs is disaggregated storage management. A SmartNIC can be used to virtualize storage arrays. This is so that they appear as local storage to an operating system that needs to access the array. They can also be used as an NVMe-oF controller. Overall, the idea is to move the software overhead of implementing a disaggregated storage solution onto an efficient SmartNIC.

The last SmartNIC use case that we highlight in this post is general compute offload. As mentioned earlier, a SmartNIC can run an operating system. If we run Linux, this means we can run any Linux applications we would like on the SmartNIC CPUs. For example, if we have an NGINX server, we could run an NGINX static file cache on the SmartNIC attached to the server. When a request for a static file is received, and if the file is cached in the SmartNIC DRAM, the file can be served from the SmartNIC without the request ever making it onto the PCIe bus. Another example of compute offload would be video encoding which we explore next.

SmartNIC Video Encoding Motivation

Video encoding is done to make video files smaller, and to better serve the variety of devices that are used for streaming. This encoding work is a back-end operation that does not impact a user's streaming experience. The encoding job does not need to happen as fast as possible since it is not part of an active video stream. This creates the opportunity to save on power and energy by running the encode job on more efficient hardware like a SmartNIC. We decided to test this idea by encoding some video on the Stingray PS225 to get a sense for the potential power and energy savings it could yield.

Video Encode Offload Experiment

We measured the power and energy that is used to encode the video on both the host machine and the SmartNIC that is attached to the host machine. Below is a description of the hardware.

Broadcom Stingray PS225 SmartNIC

8x Arm Cortex-A72 CPUs @ 3.0Ghz
8GB RAM (2400 MT/s)
- Note: There are 4GB and 16GB versions available
PCIe 3.0 (x8)
2x25GbE SFP interfaces
16.04.6 LTS
Kernel 4.14.79

More information can be found on the Broadcom Stingray webpage.

SmartNIC overhead processing examples

SmartNIC Host Machine - Lenovo M920

The host machine is a standard desktop. The SmartNIC would be attached to a server class system, but because this is a demo system that we transport around the globe in a hard case, we opted for a smaller Lenovo M920 desktop. Here are the top-level specs of the system.

6x CPUs (12x Threads) i7-8700 (Coffee Lake) @ 3.2Ghz - 4.6Ghz
16GB RAM (2666 MT/s)
Ubuntu 18.04.2 LTS
Kernel 4.18.0

The Workload

The workload was 2048 frames of uncompressed I420 video that is stored locally on each system. The compression standard that we used was the H.264 High Profile. This is the profile that is used for HD video on Blu-ray Discs. We used Docker to containerize this workload to make it easier to deploy. Details on how we containerized and deployed the workload are brought up later in this post.

Execution Time

Oddly enough, the execution time was different by a factor of 2. That is, the SmartNIC took twice as long to complete the job when compared to the host.

Execution time:
- Host:
  - Approx. 105 seconds
- SmartNIC:
  - Approx. 210 seconds

Power & Energy Results

Video Encode Power (Left) & Video Encode Energy (Right)

The graph on the left shows the average power that is consumed during the execution of the job. Here we see that the SmartNIC reduces average power consumption by a factor of 6. When the time it takes to encode the video does not affect the end-user SLO and SLA (i.e. Iso-SLA), we can consider average power to be the performance metric of interest. In this case, we see a performance improvement by a factor of 6.

The graph on the right shows the energy that is consumed during execution of the job. Here we see that the energy used to complete the job is reduced by a factor of 3. This is because energy factors in the fact that the SmartNIC takes twice as long to complete the job than the host does. When the work to be done is fixed (that is Iso-work), we can consider energy consumed to be the performance metric of interest. In this case, we see a performance improvement by a factor of 3.

Overall, these results indicate that it is worth exploring compute offloading use cases on SmartNICs.

Other Setup Details

We think it is worth going into a little more detail on how we set up the host and SmartNIC because it highlights the versatility of a SmartNIC.

SmartNIC/Host Virtual Network with Docker Swarm

The SmartNIC and host expose virtual network interfaces which can be used for private TCP/IP based communication. Any communication that is done through this virtual interface goes through the PCIe bus and not the external network. To make the deployment of the workload easier, we used Docker Swarm to cluster the SmartNIC and host through these private interfaces. Having the Docker Swarm virtual network allowed us to easily deploy containerized workloads and monitoring tools.

Also note that this cluster is heterogeneous. That is, the cluster is made up of nodes with different hardware architectures. In our case, the host is x86 and the SmartNIC is Arm. This is evidence of the maturity of containerization and virtual network technologies. The industry is at a point where cloud native tools transcend hardware architectures, so it is easy to mix and match architectures in the same cluster. If you are interested in learning more about multi-architecture support in containers and virtual networks, look at two other blog posts that are called Architecture Agnostic Container Build Systems and Deploying a Multi-Arch Docker Registry.

Workload Container

The Dockerfile below shows how the encode job container was built. 2048 frames of uncompressed I420 video was volume that is mounted from a directory that is called /videos on the host/SmartNIC. This allowed us to swap out different uncompressed video files without having to rebuild the container. The VideoLAN encoder x264 and a helper script were also packaged in the container. From this single Dockerfile, we built two versions of the container image; one built for Arm which runs on the SmartNIC, and another built for x86 which runs on the host. These container images were loaded into the SmartNIC and host respectively. Alternatively, we could have uploaded the images to a multi-architecture registry so that the SmartNIC and host can pull the images as needed. We talk about how to deploy multi-architecture docker registries in a previous post. The last thing to note is that the Dockerfile below assumes that the x264 encoder is present on the build machine. This can be ensured by setting up makefiles that build x264 before building the Docker image, or by using Docker's multi-stage build system. Both of these methods were discussed in Cloud Management Tools on Arm.

FROM ubuntu:16.04

ADD ./x264 /usr/bin/x264
ADD ./encode.sh /usr/bin/encode.sh

VOLUME ["/videos"]

CMD ["/bin/bash"]

Monitoring Containers

We wanted to monitor metrics like CPU and memory utilization. To do this, we deployed Prometheus, Grafana, and a few Prometheus metric exporters. This allows us to produce visually pleasing graphs that we show at conferences. Since the SmartNIC is just a standard Linux system, all these tools that are simply worked without issue.

Grafana Showing SmartNIC/Host Cluster Metrics

Power and Energy Measurement

The SmartNIC did not have power consumption counters. We measured power by using a wall outlet power meter. To obtain the power consumption for just the encode job, we measured the average idle power of the system, and then subtracted that out from the average power that is measured during the encode job. To measure energy, we multiplied the encode job power by execution time. For the wall power graph shown above, we mapped in the average power we saw on the wall outlet meter. This is acceptable since the power readings are stable. This is just to give a nice visual when we demo the system. On expo show floors, we always show the wall outlet power to make sure that the graph is right.

SmartNIC Kernel Config

The default kernel on the PS225 was lightweight. This is likely because there is limited storage (16GB) available on the SmartNIC. The kernel had to be recompiled with a config that was more featured. The features that were missing were the usual things that are required to run containers and virtual networks. For example, VXLAN, cgroups, namespaces, etc. We used the kernel config that was generated in an earlier blog post called Configuring The MacchiatoBin For Kubernetes and Swarm. In that blog, we explain how to create a kernel config from scratch that will function properly with containers and virtual networks.

Closing Remarks

A SmartNIC is a powerful and efficient device that can be used in various scenarios. We have shown that a SmartNIC can offer significant power savings by offloading compute from a power hungry server. We would like to encourage readers to try experimenting with either a Broadcom Stingray or Mellanox Bluefield SmartNIC.

Stingray SmartNIC

3 comments
0 members are here

Architectures and Processors blog

Introducing GICv5: Scalable and secure interrupt management for Arm

Christoffer Dall

Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
- April 28, 2025
Getting started with AARCHMRS Features.json using Python

Joh

A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
- April 8, 2025
Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

Samer El-Haj-Mahmoud

Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
- January 28, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog