A SmartNIC is a Network Interface Card that includes general-purpose CPUs. The CPUs are used to offload processing that is done by server CPUs. Arm CPUs are being selected for SmartNIC SoCs because of their efficiency, performance, and the well supported software ecosystem. For example, two Arm-based SmartNIC platforms are the Broadcom Stingray and Mellanox Bluefield. These platforms are built around Arm A-72 CPUs. SmartNICs also include DRAM and storage and boot standard operating systems like Linux. In fact, a SmartNIC can appear as a host on the network. For this reason, SmartNICs enable use cases that go far beyond networking. Ultimately, the advantage of using a SmartNIC is that they reduce operating costs by efficiently offloading processing from servers.
A host machine with a SmartNIC attached to its PCIe bus.
The base use case for a SmartNIC is networking offload. Although a "dumb" NIC is capable of accelerating network processing, they are limited to fixed functions like TCP and VXLAN offload. SmartNICs have these fixed function accelerators as well, but the addition of the CPUs makes them more flexible and powerful. For example, with a SmartNIC, we could offload VXLAN through an on-board accelerator. You can then use the CPUs to run an eBPF program that does request rate limiting and IP blacklisting for DDOS attack prevention. If the SmartNIC is attached to a server running NGINX, we could then turn off the NGINX layer 7 rate limiter. This would free up server resources that can be reallocated towards handling more user requests.
Another use case for SmartNICs is disaggregated storage management. A SmartNIC can be used to virtualize storage arrays. This is so that they appear as local storage to an operating system that needs to access the array. They can also be used as an NVMe-oF controller. Overall, the idea is to move the software overhead of implementing a disaggregated storage solution onto an efficient SmartNIC.
The last SmartNIC use case that we highlight in this post is general compute offload. As mentioned earlier, a SmartNIC can run an operating system. If we run Linux, this means we can run any Linux applications we would like on the SmartNIC CPUs. For example, if we have an NGINX server, we could run an NGINX static file cache on the SmartNIC attached to the server. When a request for a static file is received, and if the file is cached in the SmartNIC DRAM, the file can be served from the SmartNIC without the request ever making it onto the PCIe bus. Another example of compute offload would be video encoding which we explore next.
Video encoding is done to make video files smaller, and to better serve the variety of devices that are used for streaming. This encoding work is a back-end operation that does not impact a user's streaming experience. The encoding job does not need to happen as fast as possible since it is not part of an active video stream. This creates the opportunity to save on power and energy by running the encode job on more efficient hardware like a SmartNIC. We decided to test this idea by encoding some video on the Stingray PS225 to get a sense for the potential power and energy savings it could yield.
We measured the power and energy that is used to encode the video on both the host machine and the SmartNIC that is attached to the host machine. Below is a description of the hardware.
More information can be found on the Broadcom Stingray webpage.
SmartNIC overhead processing examples
The host machine is a standard desktop. The SmartNIC would be attached to a server class system, but because this is a demo system that we transport around the globe in a hard case, we opted for a smaller Lenovo M920 desktop. Here are the top-level specs of the system.
The workload was 2048 frames of uncompressed I420 video that is stored locally on each system. The compression standard that we used was the H.264 High Profile. This is the profile that is used for HD video on Blu-ray Discs. We used Docker to containerize this workload to make it easier to deploy. Details on how we containerized and deployed the workload are brought up later in this post.
Oddly enough, the execution time was different by a factor of 2. That is, the SmartNIC took twice as long to complete the job when compared to the host.
Video Encode Power (Left) & Video Encode Energy (Right)
The graph on the left shows the average power that is consumed during the execution of the job. Here we see that the SmartNIC reduces average power consumption by a factor of 6. When the time it takes to encode the video does not affect the end-user SLO and SLA (i.e. Iso-SLA), we can consider average power to be the performance metric of interest. In this case, we see a performance improvement by a factor of 6.
The graph on the right shows the energy that is consumed during execution of the job. Here we see that the energy used to complete the job is reduced by a factor of 3. This is because energy factors in the fact that the SmartNIC takes twice as long to complete the job than the host does. When the work to be done is fixed (that is Iso-work), we can consider energy consumed to be the performance metric of interest. In this case, we see a performance improvement by a factor of 3.
Overall, these results indicate that it is worth exploring compute offloading use cases on SmartNICs.
We think it is worth going into a little more detail on how we set up the host and SmartNIC because it highlights the versatility of a SmartNIC.
The SmartNIC and host expose virtual network interfaces which can be used for private TCP/IP based communication. Any communication that is done through this virtual interface goes through the PCIe bus and not the external network. To make the deployment of the workload easier, we used Docker Swarm to cluster the SmartNIC and host through these private interfaces. Having the Docker Swarm virtual network allowed us to easily deploy containerized workloads and monitoring tools.
Also note that this cluster is heterogeneous. That is, the cluster is made up of nodes with different hardware architectures. In our case, the host is x86 and the SmartNIC is Arm. This is evidence of the maturity of containerization and virtual network technologies. The industry is at a point where cloud native tools transcend hardware architectures, so it is easy to mix and match architectures in the same cluster. If you are interested in learning more about multi-architecture support in containers and virtual networks, look at two other blog posts that are called Architecture Agnostic Container Build Systems and Deploying a Multi-Arch Docker Registry.
The Dockerfile below shows how the encode job container was built. 2048 frames of uncompressed I420 video was volume that is mounted from a directory that is called /videos on the host/SmartNIC. This allowed us to swap out different uncompressed video files without having to rebuild the container. The VideoLAN encoder x264 and a helper script were also packaged in the container. From this single Dockerfile, we built two versions of the container image; one built for Arm which runs on the SmartNIC, and another built for x86 which runs on the host. These container images were loaded into the SmartNIC and host respectively. Alternatively, we could have uploaded the images to a multi-architecture registry so that the SmartNIC and host can pull the images as needed. We talk about how to deploy multi-architecture docker registries in a previous post. The last thing to note is that the Dockerfile below assumes that the x264 encoder is present on the build machine. This can be ensured by setting up makefiles that build x264 before building the Docker image, or by using Docker's multi-stage build system. Both of these methods were discussed in Cloud Management Tools on Arm.
ADD ./x264 /usr/bin/x264
ADD ./encode.sh /usr/bin/encode.sh
We wanted to monitor metrics like CPU and memory utilization. To do this, we deployed Prometheus, Grafana, and a few Prometheus metric exporters. This allows us to produce visually pleasing graphs that we show at conferences. Since the SmartNIC is just a standard Linux system, all these tools that are simply worked without issue.
Grafana Showing SmartNIC/Host Cluster Metrics
The SmartNIC did not have power consumption counters. We measured power by using a wall outlet power meter. To obtain the power consumption for just the encode job, we measured the average idle power of the system, and then subtracted that out from the average power that is measured during the encode job. To measure energy, we multiplied the encode job power by execution time. For the wall power graph shown above, we mapped in the average power we saw on the wall outlet meter. This is acceptable since the power readings are stable. This is just to give a nice visual when we demo the system. On expo show floors, we always show the wall outlet power to make sure that the graph is right.
The default kernel on the PS225 was lightweight. This is likely because there is limited storage (16GB) available on the SmartNIC. The kernel had to be recompiled with a config that was more featured. The features that were missing were the usual things that are required to run containers and virtual networks. For example, VXLAN, cgroups, namespaces, etc. We used the kernel config that was generated in an earlier blog post called Configuring The MacchiatoBin For Kubernetes and Swarm. In that blog, we explain how to create a kernel config from scratch that will function properly with containers and virtual networks.
A SmartNIC is a powerful and efficient device that can be used in various scenarios. We have shown that a SmartNIC can offer significant power savings by offloading compute from a power hungry server. We would like to encourage readers to try experimenting with either a Broadcom Stingray or Mellanox Bluefield SmartNIC.
Hi,A decrease in performance can be acceptable when the metric you want to optimize for is your energy bill. There can be lower priority operations that do not have a strict time to completion requirement (e.g. video transcoding that isn't an active customer stream). In these cases, there will be a cost savings if you trade off performance for better energy efficiency. It may take more time to complete the work, but the overall energy consumed is lower, thus lowering your energy bill. In short, if your workload has the luxury of time, then you might want to trade away time to gain in other areas, like energy consumption.
As for other kinds of workloads, it depends on what the workload requirements are. Given that the cost of energy is a significant factor in data centers these days, we shouldn't assume that raw compute performance is the top metric to optimize for.
Regarding the workload example you present on this article,the performance decreased by a factor of 2 comparing to "traditional" setup, Why would one want to use SmartNIC with this cost of performance?What did I miss? (is there a typo?)What about other kind of workloads performance, would it be the same factor? e.g: cryptographic calculations