Heterogeneity is a natural property of edge computing utilizing different hardware solutions to better address specific requirements. The Evolving Edge Computing and Harnessing Heterogeneity blog post discusses many other aspects of heterogeneity. This blog post addresses heterogeneity in the context of Kubernetes managed edge computing.
Arm-based edge solutions enhance the design space by providing various levels of heterogeneity in compute capabilities, from clusters of heterogenous nodes where each node addresses a different design point (cost, size, power, energy), to big-little where heterogeneity is intrinsic to each node and to dedicated accelerators (Cortex-R, Cortex-M, NPUs and others). Another source of heterogeneity addressed in this blog post originates from dynamic changes of computing capabilities due to physical factors such as energy, power, temperature and others.
Management of Quality of Service (QoS) is a main requirement of cloud computing in general and especially important for edge computing. Multi-application and multi-component applications need to allocate resources to each component to deliver the expected Quality of Service in all the conditions the system is designed to be operated.
Containers in Linux use Cgroups as resource management. All processes that belong to a container will either be on the main cgroup for that container or in a sub-group.
CPU as a resource is managed in multiple ways. In case of docker, similar in podman, cpuset-cpus that sets CPU affinity and cpus that sets how much time can all the processes in the group utilize in a defined period so 0.2 CPU means 20ms every 100ms, by default the period is 100ms. Cgroups can be oversubscribed so it is a high-level software responsibility to guarantee that enough resources exist for all running containers.
Kubernetes current model utilizes fraction of CPU time as CPU resource allocation metric where each container is allocated guaranteed and maximum amount of CPU time. CPU core heterogeneity is not accounted in Kubernetes, so in heterogeneous nodes, different core types and operating frequency, containers will provide different performance results even though the same allocation is used. Edge applications require allocations specific per node type if QoS specification is required.
Performance characterization of applications is a very complex subject and even more so when multiple hardware configurations are used. This proposal is not intended to address the general question of predicting performance for applications in heterogenous hardware but to determine if a simple model can be used to set computing requirements for CPU-bound applications that is based on CPU performance.
Allocation based on CPU capacity in heterogenous clusters (diverse set of nodes) and heterogenous nodes (nodes with different sets of cores) can provide two major benefits:
* CPU bound applications can provide similar performance independent of node or core type since sufficient CPU resources will be allocated. This improves portability of these applications across different core or node types.
* Environmental characteristics like temperature, energy and power constraints can affect core performance and may require resource reallocation to maintain required system QoS.
The Dhrystone benchmark was used to estimate compute capacity of a core. This benchmark was used because of its simplicity, and it estimates only raw integer performance of a core having impact from other elements, branch prediction, vector operations and memory hierarchy. It is expected that most simple compute intensive applications will have similar behavior to this benchmark. The benchmark was run on multiple boards. Each board had different core types or core frequency. The following figure shows that benchmark results per core correlate extremely well with frequency. It also shows large performance differences between core types even when cores are scaled to same frequency. Cortex-A53 (Pi3/Odroidc2) for example is equivalent to a 0.3 A76 (Pi5) assuming both are running at the same frequency. The results also shows that performance on different SoCs with the same core type can differ. For example, Cortex-A72 shows a 6% difference between Pi4 and NanoPi M4v2, even though others like Cortex-A53 there is no performance difference. The linear model shows that a very simple scaling factor based on frequency can be used as a proxy for CPU capacity changes for the same core.
Using the results above a model is derived to estimate compute capacity of a node with the following assumptions:
This model estimates compute capacity by comparing benchmark results from a reference core (A76 of Raspberry Pi5 running at 2.4GHz) with the results for cores in the current system. In this report, Dhrystones is used as benchmark as mentioned in section Node Characterization. The model requires the following parameters:
Compute capacity of a core is defined as C = [Rcur/Rref] * [Fberef/Fbecur]. This implies that core capacity of reference core is 1. Current core capacity is defined as Ccur = C * [ Fcur/Fberef], it is frequency scaled to the current operation frequency of the core.
This model requires two parameters from a reference core: benchmark result and frequency and three for each core present on the current system. An Cortex-A53 running at the same frequency as a Cortex-A76 has a core capacity of .3. Cortex-A53s run at 1.2GHz (normally) and the reference core runs at 2.4GHz so each Cortex-A53(1.2GHz) has a .15 core capacity when compared to Cortex-A76 (2.4GHz).
Compute capacity lower than 1.0 implies that the measured core has lower performance than the reference core at the same operating frequency. It is expected to be a lower bound assuming that benchmark takes advantage of all capabilities of a reference core. If resource allocation is determined at reference core and scaled to be used a measured core, in the case of capacity lower than 1.0 the result of resources reserved will be pessimistic by reserving more resources than is probably needed. Compute capacity higher than 1.0 implies that the measured core has higher performance than the reference core at the same operating frequency. It is expected to be an upper bound assuming that benchmark takes advantage of all capabilities of a reference core and measured core. If resource allocation is determined at reference core and scaled to be used a measured core, in the case of capacity higher than 1.0 the result of resources reserved will be optimistic by reserving less resources than is probably needed. The last case is undesirable since it can create prevent the system to operate according to expected behavior like performance or latency for example.
Resource allocation for containers is used in orchestrators at admission control by only allowing new containers to be allocated to a node if enough resources are available. Current CPU resource utilizes core count as the metric, a node that has 4 cores available will have 4 cores listed as CPU resources, without taking the core type into consideration.The model described in previous section (Compute Capacity Model) can be used to scale the CPU resources to better describe the expected compute capacity of the node. The model can account for core types and core frequency where the previous mode only accounts for number of cores.
Kubernetes is the most prominent open-source container orchestration software, designed to provide users with cluster-scale automation of software deployment, scaling, and management. The unit of software managed by Kubernetes is the container, in the form of a “pod” which describes one or more containers.
At its core, Kubernetes consists of a collection of tools and databases that run in the cluster and form the backbone infrastructure, and an endpoint agent called a “kubelet” which runs on each node in the cluster. The “kubelet” communicates with the backbone infrastructure to determine which containers it should be managing, and to send back runtime information for orchestration users to observe. K3S, which was used for our testing, does not change this overall design.
Shown in 'Figure 2: Kubernetes software components', there is a boundary on the node between the “kubelet” and the underlying container software. In our testing, each node was using containerd. We decided that the best insertion point for our changes would be in the “kubelet” itself, at the terminal edge of Kubernetes before the handover to containerd.
Kubernetes provides for resource management of its “pods”, both for limiting and requesting both compute and memory. However, Kubernetes does not rigorously define its compute resources (hereafter referred to as “CPU”).
Without any changes, a pod definition can contain a CPU limit and request in dimensionless units of CPU time (for example, “100milli”, meaning 10% time on a single CPU). When such limits and requests on a pod are given to a “kubelet” to interpret, the “kubelet” will do so naively and take some percentage of CPU time from whatever computing hardware is present on the node.
This is insufficient for heterogenous clusters. In a homogenous cluster, CPU time is consistent between all nodes and a percentage of any given node’s CPU is equivalent to the same percentage elsewhere, but in a heterogenous cluster this will result in either under or over-provisioning if the node that is running that container is smaller or larger than expected.
The goal for proper heterogenous cluster support is to first define these resources, and then to ensure that the Kubernetes software respects those definitions. Our working model for the changes is diagrammed in 'Figure 3: Heterogeneous cluster showing scaling factors', expecting a cluster of heterogenous nodes of varying CPU strengths.
We select a baseline core as our “unit core”, specifically a single Arm Cortex-A76 core running at 2.4GHz. All pod definitions can remain in the same format as before giving a percentage of CPU time, but that is a percentage of our specific “unit core” instead of a dimensionless core. Hereafter this will be referred to as “DhryUnits”. Because this change is definitional, existing pod definitions will continue to function without error.
The first change in code is to introduce what is called a “scaling factor” (hereafter, SF). The section 'Kubernetes in detail' introduces our methodology and .model is described on section Compute Capacity Model which are now being used. Given a list of Dhrystone measurements scaled to 2.4GHz for each make and model of core that we want to convert between, we calculate the difference between this node’s core and our “unit core”. This calculation can optimally be done once after reading machine information from cadvisor, which is one of the backbone infrastructure tools included with Kubernetes and available for use at run time.
If we do not have measurements for a given node, a warning is sent, and we leave the SF at 1.0. In other words, we fall back to normal Kubernetes behavior whenever we encounter an uncharacterized node, operating no worse than what Kubernetes did before our changes.
With the SF in hand, the “kubelet” needs to primarily update two locations: during capacity calculation, and during container creation and resizing.
For capacity calculation, this is done once during “kubelet” initialization. Previously, it would report that it has 1000m cores for each physical core on the machine, we scale this by the scaling factor to report that it has 1000m/SF DhryUnits for each physical core on the machine.
For container creation and resizing, the terminal end of Kubernetes results in creating a container resource configuration for each container it manages. Inside of that configuration is the CpuQuota and CpuShares, that container should be assigned. Both of those numbers should be scaled by the SF.
With the changes described, we then tested a single pod definition requiring 100m CPU across a cluster of differently sized nodes, shown in the following diagram:
The capacity was calculated correctly for each node and the amount shown is in DhryUnits, and the pod was given the appropriate amount of real CPU time for each node as well.
Since the server components already respect capacity for admission control, no changes are needed beyond these. In 'Figure 2: Kubernetes software components', we have split operation such that from Kubernetes’ point of view all nodes and pods use DhryUnits, and containerd only ever uses real CPU Shares. In this way, a cluster can be made of many kinds of nodes and there is one consistent way to correctly provision compute resources, avoiding under- and over-provisioning
There is one other kind of heterogeneity which is worth exploring. We must also account for cases where a single node contains multiple different kinds of cores with different relative compute power and clock frequency.
This style of compute architecture often goes by the moniker “big little” and has been used in several Arm multi-core chips. Even with the changes above to support heterogenous clusters, such heterogenous nodes would fail to correctly run pods.
The core change needed here is to further make SF not a singular property of a node, but to instead calculate the SF for each core on the node or each domain of cores. The same Dhrystone measurements are used again, and the calculation now emits a mapping of coreid to SF for that core on this node. For ease of use, a second mapping is created which is an ordered mapping of domain to sets of coreids (hereafter, “CpuSets”).
Capacity is changed from `1000m/SF * numcores` to 1000m/SF per core and summed across all cores with their various SFs. This now accurately reflects the real capacity of all nodes including heterogenous ones.
Container creation and resizing is more complicated. First, we require new pod definition metadata. If unspecified, we default to using the 'little' cores on a node since they are at a finer granularity and more power efficient. Otherwise, we look for a pod metadata field to specify “big” or “little” and we select either the largest or smallest domain of cores to proceed.
Knowing the correct CpuSet, we scale the container’s compute requirements with the SF of one of the cores of that domain and make one more change to the container resource configuration. As we specify CpuQuota and CpuShares, we can also specify CpuSets to ensure that the container runs only on specific cores.
There is one gap left here before full heterogenous node support: admission control still sees the node as having a singular capacity, but now our pods can be targeted to one or another domain of cores on the node. For completeness, hetero-nodes should have separate capacities and admission control should be aware that a pod definition which requires one domain of cores cannot be satisfied by using the capacity of another domain. This work was not completed.
One of the goals of this proposal is enabling management of QoS when operating under non-ideal conditions where in ideal condition the system can provide stated compute capacity. Currently orchestrators assume that system is operating under ideal conditions so any degradation will cause adverse effects from QoS metrics being unmet to application failures.
Edge systems are exposed to different environmental conditions than cloud computing. Edge nodes with similar hardware configuration may need to adapt to different environment like being powered by battery or limited energy sources, being exposed to high temperatures and reduced cooling capabilities.
The proposed solution divides the problem into three parts:
The current implementation is a python program using the CRI interface to containerd external to kubelet. The application runs the following algorithm every time a change is made to cores operating frequency:
This algorithm preserves capacity set at pod and scales shares appropriately. Shares are set according to current operating conditions like frequency of operation and number of cores. Only when the system capacity is lower than what is running on the system as priority workloads, the priority pods will be affected. Priority now is set as a label on the pod.
The following figure shows how the Dynamic capacity and priority manager interfaces with Kubernetes.
Current implementation does not start or stop pods and containers but changes resource allocations towards preserving QoS. A few gaps are present in the current implementation:
This proposal has the objective to describe a possible solution to address heterogeneity at the edge but more importantly serve as a starting point so better solutions can be discussed.
Even though this work is oriented towards edge computing, it can also be applied to the cloud as more and more heterogeneity becomes prevalent. Even on current cloud infrastructure multiple generations of systems coexist but Kubernetes or container cluster are homogenous. Mobile is another area that this proposal can be applied since container-base environment are being used to deploy applications.
Find out more about how Arm is transforming Edge ComputingEnabling Edge Computing