Understanding Overlay Networks In Cloud Deployments

In a previous post on Mesos/Marathon on Arm, we deployed a few containers on a WeaveNet overlay network. However, we didn't discuss any details about overlay networks in cloud deployments. In this post, we will discuss what overlay networks are, their usefulness, and explore some core technologies used to create overlay networks.

What is an Overlay?

An overlay abstracts a physical network to create a virtual network. The physical network is the underlay, and the virtual network on top of the physical network is the overlay. Overlays are useful because we can hide the complexities of the underlay and created a simplified network for our service. An example of a service that operates on an overlay is Voice-Over-IP (VoIP). VoIP uses the infrastructure of the Internet as the underlay, while the overlay is the virtual network of phone numbers used to address each phone.

Overlays in the Cloud

Cloud environments are dynamic. It's not uncommon to see containers deployed in the hundreds and thousands. Some of these containers are ephemeral, some are long-running, some will crash unexpectedly, and they will need to scale. We can also have hosts crash which would require that hundreds/thousands of containers be rescheduled at once. The overlays in which these containers communicate on will need to cope with such a dynamic environment. There are several overlay solutions that can handle this environment. Some of these solutions include WeaveNet, Swarmkit (via Docker Swarm), Flannel, and Calico. All of which work on aarch64 (Arm64) platforms.

Below are some of the features that these overlay solutions provide:

  • An implementation of a core technology that facilitates the overlay. For example, there are Virtual Extensible LAN (VXLAN) based overlays, and Border Gateway Protocol (BGP) based overlays. We will discuss these later.
  • A distributed key-value (KV) store like etcd, consul, or zookeeper. The KV store is used to store configuration information about the overlay. This information is used to maintain a consistent and reliable overlay.
  • DNS; allows containers to reference each other by service name rather than IP address. Recall, in this dynamic environment, we should expect IP addresses to change often.
  • Deployment of multiple overlays on a single cluster. Multiple overlays can enhance security by isolating containers into different virtual networks. For example, if we're running an e-commerce site, we can have the user-facing containers on one virtual network, and the back end containers on a separate virtual network.
  • Support for encryption and security profiles.
  • Makes it easy to add more services. For example, authentication, debuggers, monitors, load balancers, etc.
  • Generally, container software does not have to change if the underlay topology changes.

Similar to how a VoIP virtual network facilitates virtual point to point connections between phones. A container overlay network facilitates virtual point to point connections between containers. The image below shows the Microservice architecture described in Cloud Management Tools on Arm. The virtual connections between the Microservices are made reliable by the overlay. This frees the developer to focus on the business logic of their service.

Overlays in the cloud

VXLAN Based Overlays

VXLAN is a network tunneling technology that is supported by the Linux Kernel. Network tunneling means that we're hiding a protocol (VXLAN) within another protocol (underlay's TCP/IP). VXLAN tunnels layer 2 frames inside of L4 UDP datagrams. This creates the illusion that containers on the same VXLAN are on the same L2 network. We'll use an example to better understand how this works.

The image below shows two hosts labeled Host 1 and Host 2 which are on different networks. These hosts are running containers labeled Container A and Container B. Connected to each container is a pipe labeled Veth. A Veth is a Linux networking construct called a Virtual Ethernet Pair, it's used to connect network devices in one network namespace to network devices in another network namespace. The other end of the Veth pair is connected to a bridge which is shown in yellow. Connected to the bridge is another pipe labeled Vtep1/Vtep2. A VTEP is another Linux networking construct called a Virtual Tunnel Endpoint. It's the entry/exit point for VXLAN tunnels. This is where the frames from the container will get encapsulated inside a UDP datagram. VTEPs get their own MAC and IP addresses and show up as network interfaces (assuming you're checking the right network namespace). Last, notice that the VTEP is outside of the container. This means the containers have no knowledge of the existence of the tunnel. This is why VXLANs are able to fool the containers into thinking they are connected to the same network segment. Let's walk through an Ethernet frame delivery between Container A and Container B.

VXLAN Based Overlays

To send data to Container B, Container A will construct a standard TCP/IP frame like the one shown in the image below. If we inspect thbore network header (i.e. packet header), we see that the source and destination IP addresses are of Container A and Container B respectively. If we look at the link header (i.e. frame header), we see that the source and destination MAC addresses are of Container A and Container B respectively. This shows us that Container A really does think it's connected to the same L2 network as Container B. Later, we'll explain how Container A obtains the MAC address of Container B even though they are on different networks.

VXLAN Based Overlays: TCP/IP Frame

The frame that is created by Container A will be sent to the bridge via the Veth pair (labeled 1 above). Once we're at the bridge, if the network header destination IP address is for a host outside of the VXLAN, the frame is sent to the physical network interface. However, if it's for a host inside the VXLAN (like Container B), the frame will be sent to Vtep1 (labeled 2 above). When the frame enters the tunnel at Vtep1, the VXLAN headers are added. The image below shows what the encapsulated VXLAN frame looks like. We see the original frame from Container A with some additional headers. First, we see the transport header which contains a VXLAN header and a UDP header. Next, is the outer (VXLAN) network header. Here we see that the source and destination IP addresses are the VTEPs of Host 1 and Host 2 respectively. The important thing to understand is that the routing will be done with these outer IP addresses. Thus, the underlay routing is happening with Host VTEP IPs, not with Container A and Container B IPs. Last, we have the link header, the source and destination MACs will change as it gets routed through different networks (hence the 'XXXX' and 'YYYY'). After entering the tunnel at Vtep1, the frame will exit Host 1's eth0 interface (labeled 3 above). The packet will be routed across different networks (labeled 4 above) using the VTEP IPs (outer VXLAN network header). Once the packet makes it to Host 2 (labeled 5 above), the process is reversed. The VXLAN frame will enter Vtep2 and get deencapsulated (labeled 6 above), then the original frame will move from the bridge to Container B via the Veth pair (labeled 7 above).

VXLAN Based Overlays: VXLAN Encapsulation

Now that we understand the VXLAN tunneling process, we can answer the question of how does Container A know Container B's MAC address? It's done with a standard ARP message, except the ARP message will get tunneled through the VTEPs. When the VTEP sees the ARP message from the container, it will send it to the VXLAN multicast group so that all VTEPs participating in the VXLAN will see the ARP. The underlay does not see this ARP message since it's getting tunneled. Thus, the underlay's hardware will not broadcast any ARP messages.

A final note about VXLAN is that the encap/decap of frames adds overhead to the network stack. The overhead can be reduced with HW acceleration, but it can't be eliminated. This overhead can be avoided with overlay solutions like Calico. Calico updates a contaibner host's routing table with the routes of the containers participating in the overlay. It uses BGP to enable sharing of this routing information between hosts in the cluster. We will take a look at BGP in the next section.

BGP Based Overlays

The Border Gateway Protocol is the routing protocol of the Internet. It's used for routing between Autonomous Systems. An Autonomous System can be an ISP, an Internet Exchange Point, or a Transit Provider (to name a few). It's called the Border Gateway Protocol because traditionally it's used at the borders/edges of Autonomous Systems. Calico uses BGP to deploy overlays. Calico sets up a mesh of BGP peers, where the peers are the hosts that make up the cluster. Each BGP peer will advertise container routes to all other peers. When peers receive the route information, they will update their routing tables. Since we are not tunneling, the network performance should be near to a native network stack. That said, this approach doesn't work in all situations; for example, in public clouds where we have no control of the underlay network. In these cases, Calico can use IP-in-IP or VXLAN tunnels. Let's take a look at what happens to a host's routing table when running a Calico overlay.

We have two machines, Machiato-1 and Machiato-2 with IP addresses 192.168.45.11 and 192.168.45.12 respectively. Below is what their routing tables look like before they become BGP peers. As expected, we see a few different routes as well as a default route. eth2 are the only physical interfaces on these machines.

BGP Based Overlays

BGP Based Overlays

Below is what each machine's BGP status looks like after peering. As we can see, machiato-1 is peered with machiato-2 (192.168.45.12), and in the second image, we see that machiato-2 is peered with machiato-1 (IP 192.168.45.11).

BGP Based Overlays - Machiato 1

Machiato-1

BGP Based Overlays - Machiato 2

Machiato-2

After peering and deploying a few containers to the cluster, let's look at the routing tables again. In machiato-1's routing table, we see 9 veth's with names that start with cali (as in Calico). This tells us there are 9 containers running on this host. We also see that the IP addresses for these containers are in the 192.168.170.192/26 subnet. Next, notice that IP destinations of 192.168.60.128/24 will be sent to machiato-2 (192.168.45.12). Based on what we see in machiato-1's table, we should expect to see IP addresses in the 192.168.60.128/24 subnet for containers running on machiato-2.

BGP Based Overlays: Machiato-1's routing table

Taking a look at machiato-2's routing table. We see there at 5 containers running, and as expected, the IP addresses for these containers are in the 192.168.60.128/24 subnet.

BGP Based Overlays: Machiato-2 Routing Table

The last thing to notice about the routing tables above, is that the routes to the BGP peers seem to use a Linux Tun device (labeled tunl0) instead of the eth2 physical interface. After looking into this a bit further (not shown above), it appears that an IP-in-IP tunnel is being used. It's not clear to us why this is the case, as IP-in-IP tunneling should be off by default when using the Project Calico getting started with Kubernetes instructions. We will have to follow up on this as we do further experiments with Calico. If anyone has an explanation, please leave a comment below.

Closing Remarks

We should have an appreciation for overlays and the various cloud overlay projects that exist. We mentioned that we've successfully deployed WeaveNet, Flannel, SwarmKit, and Calico on aarch64 systems. If you are exploring overlay networks on aarch64 platforms for the first time, we suggest starting out with either WeaveNet, Flannel, or SwarmKit. This is because Calico is still not officially supported on Arm. Thus, you would have to build Calico from source which can be a bit tedious. There has been progress on Arm support in Calico, but more work is needed. If you are experienced with cloud overlay solutions, and would like to learn more, we suggest getting involved in Project Calico. Take a look at the Cross build docker images GitHub issue and post a message expressing interest to help. Even though the work is on setting up cross compiling, since the code is in golang, solving the cross compiling problem will solve the native aarch64 compile problem (Note: we did a native aarch64 compile for the Calico example above).

Get involved with Project Calico

Anonymous