Adapting Kubernetes for high-performance IoT Edge deployments

August 21, 2024

This blog post is co-authored with Jon Hermes

Heterogeneity is a natural property of edge computing utilizing different hardware solutions to better address specific requirements. The Evolving Edge Computing and Harnessing Heterogeneity blog post discusses many other aspects of heterogeneity. This blog post addresses heterogeneity in the context of Kubernetes managed edge computing.

Edge and Quality of Service (QoS)

Arm-based edge solutions enhance the design space by providing various levels of heterogeneity in compute capabilities, from clusters of heterogenous nodes where each node addresses a different design point (cost, size, power, energy), to big-little where heterogeneity is intrinsic to each node and to dedicated accelerators (Cortex-R, Cortex-M, NPUs and others). Another source of heterogeneity addressed in this blog post originates from dynamic changes of computing capabilities due to physical factors such as energy, power, temperature and others.

Management of Quality of Service (QoS) is a main requirement of cloud computing in general and especially important for edge computing. Multi-application and multi-component applications need to allocate resources to each component to deliver the expected Quality of Service in all the conditions the system is designed to be operated.

QoS for Containers

Containers in Linux use Cgroups as resource management. All processes that belong to a container will either be on the main cgroup for that container or in a sub-group.

CPU as a resource is managed in multiple ways. In case of docker, similar in podman, cpuset-cpus that sets CPU affinity and cpus that sets how much time can all the processes in the group utilize in a defined period so 0.2 CPU means 20ms every 100ms, by default the period is 100ms. Cgroups can be oversubscribed so it is a high-level software responsibility to guarantee that enough resources exist for all running containers.

Kubernetes current model utilizes fraction of CPU time as CPU resource allocation metric where each container is allocated guaranteed and maximum amount of CPU time. CPU core heterogeneity is not accounted in Kubernetes, so in heterogeneous nodes, different core types and operating frequency, containers will provide different performance results even though the same allocation is used. Edge applications require allocations specific per node type if QoS specification is required.

Compute Capacity Model

Performance characterization of applications is a very complex subject and even more so when multiple hardware configurations are used. This proposal is not intended to address the general question of predicting performance for applications in heterogenous hardware but to determine if a simple model can be used to set computing requirements for CPU-bound applications that is based on CPU performance.

Allocation based on CPU capacity in heterogenous clusters (diverse set of nodes) and heterogenous nodes (nodes with different sets of cores) can provide two major benefits:

* CPU bound applications can provide similar performance independent of node or core type since sufficient CPU resources will be allocated. This improves portability of these applications across different core or node types.

* Environmental characteristics like temperature, energy and power constraints can affect core performance and may require resource reallocation to maintain required system QoS.

The Dhrystone benchmark was used to estimate compute capacity of a core. This benchmark was used because of its simplicity, and it estimates only raw integer performance of a core having impact from other elements, branch prediction, vector operations and memory hierarchy. It is expected that most simple compute intensive applications will have similar behavior to this benchmark. The benchmark was run on multiple boards. Each board had different core types or core frequency. The following figure shows that benchmark results per core correlate extremely well with frequency. It also shows large performance differences between core types even when cores are scaled to same frequency. Cortex-A53 (Pi3/Odroidc2) for example is equivalent to a 0.3 A76 (Pi5) assuming both are running at the same frequency. The results also shows that performance on different SoCs with the same core type can differ. For example, Cortex-A72 shows a 6% difference between Pi4 and NanoPi M4v2, even though others like Cortex-A53 there is no performance difference. The linear model shows that a very simple scaling factor based on frequency can be used as a proxy for CPU capacity changes for the same core.

Using the results above a model is derived to estimate compute capacity of a node with the following assumptions:

Compute capacity is linearly related to frequency.
- Capacity (frequency) = K1 * frequency

Compute capacity of a core is also linearly related to capacity of reference core.
- CapacityCoreA = K2 * CapacityCoreB (at same frequency)

This model estimates compute capacity by comparing benchmark results from a reference core (A76 of Raspberry Pi5 running at 2.4GHz) with the results for cores in the current system. In this report, Dhrystones is used as benchmark as mentioned in section Node Characterization. The model requires the following parameters:

Reference core benchmark results: Rref
Reference core benchmark frequency: Fberef
Core benchmark result: Rcore
Core benchmark frequency: Fbecore
Core current operating frequency: Fcur

Compute capacity of a core is defined as C = [Rcur/Rref] * [Fberef/Fbecur]. This implies that core capacity of reference core is 1. Current core capacity is defined as Ccur = C * [ Fcur/Fberef], it is frequency scaled to the current operation frequency of the core.

This model requires two parameters from a reference core: benchmark result and frequency and three for each core present on the current system. An Cortex-A53 running at the same frequency as a Cortex-A76 has a core capacity of .3. Cortex-A53s run at 1.2GHz (normally) and the reference core runs at 2.4GHz so each Cortex-A53(1.2GHz) has a .15 core capacity when compared to Cortex-A76 (2.4GHz).

Compute capacity lower than 1.0 implies that the measured core has lower performance than the reference core at the same operating frequency. It is expected to be a lower bound assuming that benchmark takes advantage of all capabilities of a reference core. If resource allocation is determined at reference core and scaled to be used a measured core, in the case of capacity lower than 1.0 the result of resources reserved will be pessimistic by reserving more resources than is probably needed. Compute capacity higher than 1.0 implies that the measured core has higher performance than the reference core at the same operating frequency. It is expected to be an upper bound assuming that benchmark takes advantage of all capabilities of a reference core and measured core. If resource allocation is determined at reference core and scaled to be used a measured core, in the case of capacity higher than 1.0 the result of resources reserved will be optimistic by reserving less resources than is probably needed. The last case is undesirable since it can create prevent the system to operate according to expected behavior like performance or latency for example.

Kubernetes and Compute Capacity Model

Orchestrators and Compute Capacity

Resource allocation for containers is used in orchestrators at admission control by only allowing new containers to be allocated to a node if enough resources are available. Current CPU resource utilizes core count as the metric, a node that has 4 cores available will have 4 cores listed as CPU resources, without taking the core type into consideration.

The model described in previous section (Compute Capacity Model) can be used to scale the CPU resources to better describe the expected compute capacity of the node. The model can account for core types and core frequency where the previous mode only accounts for number of cores.

Kubernetes in detail

^{Figure 2: Kubernetes software components}

Kubernetes is the most prominent open-source container orchestration software, designed to provide users with cluster-scale automation of software deployment, scaling, and management. The unit of software managed by Kubernetes is the container, in the form of a “pod” which describes one or more containers.

At its core, Kubernetes consists of a collection of tools and databases that run in the cluster and form the backbone infrastructure, and an endpoint agent called a “kubelet” which runs on each node in the cluster. The “kubelet” communicates with the backbone infrastructure to determine which containers it should be managing, and to send back runtime information for orchestration users to observe. K3S, which was used for our testing, does not change this overall design.

Shown in 'Figure 2: Kubernetes software components', there is a boundary on the node between the “kubelet” and the underlying container software. In our testing, each node was using containerd. We decided that the best insertion point for our changes would be in the “kubelet” itself, at the terminal edge of Kubernetes before the handover to containerd.

Heterogenous cluster support

Kubernetes provides for resource management of its “pods”, both for limiting and requesting both compute and memory. However, Kubernetes does not rigorously define its compute resources (hereafter referred to as “CPU”).

Without any changes, a pod definition can contain a CPU limit and request in dimensionless units of CPU time (for example, “100milli”, meaning 10% time on a single CPU). When such limits and requests on a pod are given to a “kubelet” to interpret, the “kubelet” will do so naively and take some percentage of CPU time from whatever computing hardware is present on the node.

This is insufficient for heterogenous clusters. In a homogenous cluster, CPU time is consistent between all nodes and a percentage of any given node’s CPU is equivalent to the same percentage elsewhere, but in a heterogenous cluster this will result in either under or over-provisioning if the node that is running that container is smaller or larger than expected.

The goal for proper heterogenous cluster support is to first define these resources, and then to ensure that the Kubernetes software respects those definitions. Our working model for the changes is diagrammed in 'Figure 3: Heterogeneous cluster showing scaling factors', expecting a cluster of heterogenous nodes of varying CPU strengths.

^{Figure 3: Heterogeneous cluster showing scaling factors}

We select a baseline core as our “unit core”, specifically a single Arm Cortex-A76 core running at 2.4GHz. All pod definitions can remain in the same format as before giving a percentage of CPU time, but that is a percentage of our specific “unit core” instead of a dimensionless core. Hereafter this will be referred to as “DhryUnits”. Because this change is definitional, existing pod definitions will continue to function without error.

The first change in code is to introduce what is called a “scaling factor” (hereafter, SF). The section 'Kubernetes in detail' introduces our methodology and .model is described on section Compute Capacity Model which are now being used. Given a list of Dhrystone measurements scaled to 2.4GHz for each make and model of core that we want to convert between, we calculate the difference between this node’s core and our “unit core”. This calculation can optimally be done once after reading machine information from cadvisor, which is one of the backbone infrastructure tools included with Kubernetes and available for use at run time.

If we do not have measurements for a given node, a warning is sent, and we leave the SF at 1.0. In other words, we fall back to normal Kubernetes behavior whenever we encounter an uncharacterized node, operating no worse than what Kubernetes did before our changes.

With the SF in hand, the “kubelet” needs to primarily update two locations: during capacity calculation, and during container creation and resizing.

For capacity calculation, this is done once during “kubelet” initialization. Previously, it would report that it has 1000m cores for each physical core on the machine, we scale this by the scaling factor to report that it has 1000m/SF DhryUnits for each physical core on the machine.

For container creation and resizing, the terminal end of Kubernetes results in creating a container resource configuration for each container it manages. Inside of that configuration is the CpuQuota and CpuShares, that container should be assigned. Both of those numbers should be scaled by the SF.

^{Figure 4: Runtime testing results of our heterogeneous cluster modifications}

With the changes described, we then tested a single pod definition requiring 100m CPU across a cluster of differently sized nodes, shown in the following diagram:

^{Figure 5: Example of pods running on heterogeneous nodes}

The capacity was calculated correctly for each node and the amount shown is in DhryUnits, and the pod was given the appropriate amount of real CPU time for each node as well.

Since the server components already respect capacity for admission control, no changes are needed beyond these. In 'Figure 2: Kubernetes software components', we have split operation such that from Kubernetes’ point of view all nodes and pods use DhryUnits, and containerd only ever uses real CPU Shares. In this way, a cluster can be made of many kinds of nodes and there is one consistent way to correctly provision compute resources, avoiding under- and over-provisioning

Heterogenous Node support

There is one other kind of heterogeneity which is worth exploring. We must also account for cases where a single node contains multiple different kinds of cores with different relative compute power and clock frequency.

This style of compute architecture often goes by the moniker “big little” and has been used in several Arm multi-core chips. Even with the changes above to support heterogenous clusters, such heterogenous nodes would fail to correctly run pods.

The core change needed here is to further make SF not a singular property of a node, but to instead calculate the SF for each core on the node or each domain of cores. The same Dhrystone measurements are used again, and the calculation now emits a mapping of coreid to SF for that core on this node. For ease of use, a second mapping is created which is an ordered mapping of domain to sets of coreids (hereafter, “CpuSets”).

Capacity is changed from `1000m/SF * numcores` to 1000m/SF per core and summed across all cores with their various SFs. This now accurately reflects the real capacity of all nodes including heterogenous ones.

Container creation and resizing is more complicated. First, we require new pod definition metadata. If unspecified, we default to using the 'little' cores on a node since they are at a finer granularity and more power efficient. Otherwise, we look for a pod metadata field to specify “big” or “little” and we select either the largest or smallest domain of cores to proceed.

Knowing the correct CpuSet, we scale the container’s compute requirements with the SF of one of the cores of that domain and make one more change to the container resource configuration. As we specify CpuQuota and CpuShares, we can also specify CpuSets to ensure that the container runs only on specific cores.

^{Figure 6: Working model for a bigLITTLE node with two different kinds of cores}

^{Figure 7: Results of running pods with different bigLITTLE domains specified}

There is one gap left here before full heterogenous node support: admission control still sees the node as having a singular capacity, but now our pods can be targeted to one or another domain of cores on the node. For completeness, hetero-nodes should have separate capacities and admission control should be aware that a pod definition which requires one domain of cores cannot be satisfied by using the capacity of another domain. This work was not completed.

Power, thermal, and energy aware model

One of the goals of this proposal is enabling management of QoS when operating under non-ideal conditions where in ideal condition the system can provide stated compute capacity. Currently orchestrators assume that system is operating under ideal conditions so any degradation will cause adverse effects from QoS metrics being unmet to application failures.

Edge systems are exposed to different environmental conditions than cloud computing. Edge nodes with similar hardware configuration may need to adapt to different environment like being powered by battery or limited energy sources, being exposed to high temperatures and reduced cooling capabilities.

The proposed solution divides the problem into three parts:

Determining the desired configuration to run according to the environmental conditions
- Conditions:
  - Lowering power to lower maximum operating temperature
  - Operating in a more energy efficient regime
  - Guaranteeing a specific energy source lifetime
  - Preventing undesired thermal throttling
- Core maximum operating frequency, number of active cores are possible changes
- Out of scope for now, it is assumed that an external application with provide that capability
Determining the current capacity according to new operating conditions
- Use the model from Compute Capacity Model converting operating frequency and active cores to capacity
Change pods configuration to achieve the desired QoS under the new operating condition.

The current implementation is a python program using the CRI interface to containerd external to kubelet. The application runs the following algorithm every time a change is made to cores operating frequency:

Determine current system capacity, after frequency change: C_max
Determine total capacity required = sum of capacity requirements for all containers = C_req
Determine total priority capacity required = sum of capacity requirements for all priority containers = C_prireq
If (C_req <= C_max):
- Scale shares to the current frequency using capacity requirements for each container keeping capacity constant, shares provide core time guarantees
- QoS is preserved as originally set
Else: (C_req > C_max)
- If (C_prioreq <= C_max):
  - Scale shares using capacity requirements for each priority container
  - Scale leftover shares proportionally for non-priority containers
  - QoS is preserved for priority containers but not for non-priority containers
- Else: (C_prioreq > C_max):
  - Scale shares proportionally using capacity requirements for each priority container
  - Non-priority containers are set with low-priority and as best-effort

This algorithm preserves capacity set at pod and scales shares appropriately. Shares are set according to current operating conditions like frequency of operation and number of cores. Only when the system capacity is lower than what is running on the system as priority workloads, the priority pods will be affected. Priority now is set as a label on the pod.

The following figure shows how the Dynamic capacity and priority manager interfaces with Kubernetes.

^{Figure 8: Dynamic Capacity and Priority Manager}

Current implementation does not start or stop pods and containers but changes resource allocations towards preserving QoS. A few gaps are present in the current implementation:

Users should be able to define conditions that starts QoS management. In the current implementation, this is left to the system/user.
No user feedback for changes on system capacity
No user feedback for pods/containers that are receiving less resources than requested
No dynamic prioritization. Priority is set at pod/container startup. Possible to change labels but that it will not propagate to CRI
No granular priorities. A pod/container is either priority or not
Goal oriented scheduling is not available
Starting and stopping pods/containers

In the future

This proposal has the objective to describe a possible solution to address heterogeneity at the edge but more importantly serve as a starting point so better solutions can be discussed.

Even though this work is oriented towards edge computing, it can also be applied to the cloud as more and more heterogeneity becomes prevalent. Even on current cloud infrastructure multiple generations of systems coexist but Kubernetes or container cluster are homogenous. Mobile is another area that this proposal can be applied since container-base environment are being used to deploy applications.

Find out more about how Arm is transforming Edge Computing
Enabling Edge Computing

0 comments
0 members are here

Embedded and Microcontrollers blog

Adapting Kubernetes for high-performance IoT Edge deployments

Alexandre Peixoto Ferreira

In this blog post, we address heterogeneity in IoT edge deployments using Kubernetes.
- August 21, 2024
Evolving Edge Computing and Harnessing Heterogeneity

Alexandre Peixoto Ferreira

This blog post identifies heterogeneity as an opportunity to create better edge computing systems.
- August 21, 2024
Demonstrating a Hybrid Runtime for Containerized Applications in High-Performance IoT Edge

Chris Adeniyi-Jones

In this blog post, we show how a hybrid runtime and k3s can be used to deploy an application onto an edge platform that includes an embedded processor.
- August 21, 2024

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Adapting Kubernetes for high-performance IoT Edge deployments

Edge and Quality of Service (QoS)

QoS for Containers

Compute Capacity Model

Kubernetes and Compute Capacity Model

Orchestrators and Compute Capacity

Kubernetes in detail

Heterogenous cluster support

Heterogenous Node support

Power, thermal, and energy aware model

In the future

Adapting Kubernetes for high-performance IoT Edge deployments

Evolving Edge Computing and Harnessing Heterogeneity

Demonstrating a Hybrid Runtime for Containerized Applications in High-Performance IoT Edge