SMARTER: An Approach to Edge Compute Observability and Performance Monitoring

April 16, 2020

25 minute read time.

The decreasing cost and power consumption of intelligent, interconnected, and interactive devices at the edge of the Internet are creating massive opportunities to instrument our cities, factories, farms, and environment to improve efficiency, safety, and productivity. Developing, debugging, deploying, and securing software for the estimated trillion connected devices presents substantial challenges. As part of the SMARTER (Secure Municipal, Agricultural, Rural, and Telco Edge Research) project, Arm has been exploring the use of cloud-native technology and methodologies in edge environments to evaluate their effectiveness at addressing these problems at scale. This blog is part of a series, read the previous blog to find out more about SMARTER.

Read the previous SMARTER blogs

Motivation

In the past few years, decentralizing applications from data centers to machines closer to where valuable data is collected has become a catalyst for rethinking the way we manage application life cycles. One logical approach to enabling a seamless transition from cloud to edge is to take existing application orchestration models popular in the cloud, and tweak them such that they work transparently for the edge. Two of the biggest players in the cloud space, Docker and Kubernetes, make the development and deployment of highly distributed applications much less of a headache for the common developer. Given the success of these tools in the cloud, there now exists a push to use this same model to also manage applications running at the edge, making for an even more challenging distributed system problem.

In the cloud space, it can seem like everyone and their brother has a solution for APM (Application Performance Monitoring) and observability. For some perspective on the scale of number of existing solutions, I found an interesting site, OpenApm which gives a nice overview of popular open-source tools used for APM within the community. Given the saturation of this market, I set out to select a stack which maps well to the edge, where we make the following assumptions about the machines:

Nodes in our edge cluster sees abrupt power outages and connectivity loss more frequently than their data center counterparts.
Edge nodes may live on networks behind Firewalls/NATs, meaning nodes in the cloud will not necessarily be able to initiate connections to peers at the edge.
Edge nodes are much more diverse than nodes in the cloud. Everything from Raspberry Pi's to Jetson AGX Xavier's on the higher end of the compute spectrum. I assume they are server-class devices running a flavor of the Linux operating system.
Available bandwidth to the cloud may be limited, we must be careful with how much of that is allocated to APM metrics and trace data versus the "real" data our applications want to generate and ship to the cloud.

This post describes how to set up your own edge computing playground with APM and observability build in from the ground up. Currently, there exists no single cluster environment which manages both the cloud and edge portions of your system transparently, so for now we manage our cloud and edge using two logically independent control planes. We perform the following to set up on our infrastructure:

Create our own bare-metal Kubernetes cluster using K3s and managed with Helm, then bring up all of our data aggregation and storage mechanisms in the cluster. K3s is a project out of Rancher Labs seeking to deliver a lighter-weight Kubernetes package.
Create a separate k3s cluster for our edge and bring up all of our data collection daemons on each of the nodes in this cluster.

System Architecture Overview

System Architecture Overview

Cloud Setup

Before you begin setting up the infrastructure in this guide, go ahead and clone the repository by running the following:

git clone https://gitlab.com/arm-research/smarter/edge-observability-apm

The instructions in this guide assume will assume you issue commands from the base of the cloned repository.

Create Kubernetes cluster

Create a bare-metal, single node 1.17 Kubernetes x86 cluster setup using the k3s installation convenience script, with Flannel as the cluster CNI (Container Networking Interface) and RBAC enabled. Setting up your own bare-metal cluster avoids having to spend money on managed Kubernetes services like Amazon EKS or Google Kubernetes Engine.

To install k3s simply run:

export THIS_HOST_IP=$(hostname -I | awk '{print $1;}')
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.17.2+k3s1 sh -s - --write-kubeconfig-mode 664 --bind-address $THIS_HOST_IP --advertise-address $THIS_HOST_IP --no-deploy servicelb --no-deploy traefik

Here I ask that you setup a dev machine which will have your kubeconfig file generated during your cluster bring up, such that you will be able to run kubectl commands. To do a quick check that you have done everything properly, run kubectl get all and make sure that you get a valid response back from your new cluster's API server. For k3s, you can fetch your cluster kubeconfig from the directory /etc/rancher/k3s/k3s.yaml on your master.

Install Helm 2 on your preferred Linux dev machine

The "package manager" for Kubernetes, Helm makes the deployment of complex applications composed of many Kubernetes objects easier. For many of the APM and observability tools used in this guide, I opted to use Helm 2. To install Helm 2, you should follow the following instructions on your Linux dev machine:

curl -fsSl https://raw.githubusercontent.com/helm/helm/master/scripts/get -o get-helm2.sh
sudo bash get-helm2.sh
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
helm init --service-account tiller --wait

Setup a Load Balancer and Ingress Controller for your cluster

Our load balancer sits in front of our cluster and balances incoming traffic to our internal cluster services. If you use a managed Kubernetes service like Amazon EKS, they will generally handle load-balancing and the assignment of static IPs for you, however, in the case of our bare-metal cluster, we must have control over the network the nodes live in, such that we can reserve a range of IPs to be allocated for our load balancer. The tool we use to set up load-balancing is MetalLB.

To install the load-balancer, run the following from your dev machine:

kubectl apply -f https://raw.githubusercontent.com/google/metallb/v0.8.3/manifests/metallb.yaml

Now you must create a config map for MetalLB to give it control over a specific set of internal IPs. Export $HOST_IP in your dev machine environment and apply the config to your cluster by running:

export HOST_IP=<YOUR_MASTER_IP>
envsubst '${HOST_IP}' < cloud/metal-lb/metalconfig.yaml > cloud/metal-lb/metalconfig-custom.yaml
kubectl apply -f cloud/metal-lb/metalconfig-custom.yaml

Now with MetalLB installed, we need to configure a reverse proxy server which is responsible for actually handling the ingress traffic into our cluster either from our edge devices or any authorized user, who wishes to view collected data with a web-ui for instance. The responsibility of this component is to configure our HTTP load balancer (MetalLB) according to the Ingress API objects created by users of the cluster. To do this we install nginx-ingress.

From the root of the repository run:

helm repo update
helm install stable/nginx-ingress --name my-nginx -f cloud/nginx-ingress/nginx-values.yaml --set rbac.create=true

Install Cert Manager

To encourage best practices when working with exposed cluster ingress endpoints, I have opted to include cert manager in this example project. Cert manager makes TLS security very easy through the custom resource definitions it provides for certificate generation. We generate self-signed certificates in this tutorial to secure our endpoints. To install into your cluster run:

kubectl apply --validate=false -f https://raw.githubusercontent.com/jetstack/cert-manager/v0.13.0/deploy/manifests/00-crds.yaml
kubectl create namespace cert-manager
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
      --wait \
      --timeout 500 \
      -f cloud/cert-manager/cert-manager-values-local.yaml \
      --name cert-manager \
      --namespace cert-manager \
      --version v0.13.0 \
      jetstack/cert-manager
kubectl apply -f cloud/cert-manager/selfsigned-issuer.yaml

Export environment variables for installation

In order to configure helm chart values for your environment before the deploying our apps, export the following variable on your dev machine:

export SMARTER_DATA_DOMAIN=<YOUR_MASTER_IP(dash separated)>.nip.io

Install Elasticsearch and Kibana

Very popular tools within the APM and observability space, the Elastic Stack (ELK), provides a data ingestion and visualization solution for the cloud space. For the edge, we are sending our node and application performance metrics up to our cluster where they will be stored in our distributed Elasticsearch DB instance, and visualized using Kibana.

Make sure you have docker installed on your dev machine, and run, to create our Elasticsearch and Kibana credentials:

docker rm -f elastic-helm-charts-certs || true
rm -f elastic-certificates.p12 elastic-certificate.pem elastic-stack-ca.p12 || true
password=$([ ! -z "$ELASTIC_PASSWORD" ] && echo $ELASTIC_PASSWORD || echo $(docker run --rm docker.elastic.co/elasticsearch/elasticsearch:7.6.1 /bin/sh -c "< /dev/urandom tr -cd '[:alnum:]' | head -c20")) && \
docker run --name elastic-helm-charts-certs -i -w /app \
  docker.elastic.co/elasticsearch/elasticsearch:7.6.1 \
  /bin/sh -c " \
    elasticsearch-certutil ca --out /app/elastic-stack-ca.p12 --pass '' && \
    elasticsearch-certutil cert --name security-master --dns security-master --ca /app/elastic-stack-ca.p12 --pass '' --ca-pass '' --out /app/elastic-certificates.p12" && \
docker cp elastic-helm-charts-certs:/app/elastic-certificates.p12 ./ && \
docker rm -f elastic-helm-charts-certs && \
openssl pkcs12 -nodes -passin pass:'' -in elastic-certificates.p12 -out elastic-certificate.pem && \
kubectl create secret generic elastic-certificates --from-file=elastic-certificates.p12 && \
kubectl create secret generic elastic-certificate-pem --from-file=elastic-certificate.pem && \
kubectl create secret generic elastic-credentials  --from-literal=password=$password --from-literal=username=elastic && \
rm -f elastic-certificates.p12 elastic-certificate.pem elastic-stack-ca.p12

encryptionkey=$(echo $(docker run --rm docker.elastic.co/elasticsearch/elasticsearch:7.6.1 /bin/sh -c "< /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c50"))
kubectl create secret generic kibana --from-literal=encryptionkey=$encryptionkey

In order to setup, from your dev machine run at the root of the repository:

helm repo add elastic https://helm.elastic.co
helm repo update
envsubst '${SMARTER_DATA_DOMAIN}' < cloud/elasticsearch/elasticsearch-values.yaml > cloud/elasticsearch/elasticsearch-custom.yaml
helm install --wait --timeout 500 -f cloud/elasticsearch/elasticsearch-custom.yaml --name elasticsearch elastic/elasticsearch
envsubst '${SMARTER_DATA_DOMAIN}' < cloud/kibana/kibana-values.yaml > cloud/kibana/kibana-custom.yaml
helm install -f cloud/kibana/kibana-custom.yaml --name kibana elastic/kibana

Install InfluxDB

InfluxDB is a fantastic database for efficiently storing and querying time-series data at scale. Hence it is perfect for storing edge node performance data in our system.

Install it by running the following from the root of the repository on your dev machine:

helm repo add influxdata https://helm.influxdata.com/
helm repo update
helm install -f cloud/influxdb/influxdb-values.yaml --name influxdb influxdata/influxdb

Install Grafana

Grafana is another visualization tool widely used among the APM community to view and analyze time-series data stored in the cloud. For our use case, we will be using Grafana to view the node and application metrics data stored in our InfluxDB instance installed in the previous step.

Install it by running the following from the root of the repository on your dev machine:

envsubst '${SMARTER_DATA_DOMAIN}' < cloud/grafana/grafana-values.yaml > cloud/grafana/grafana-custom.yaml
helm install --name grafana -f cloud/grafana/grafana-custom.yaml stable/grafana

Install Netdata cloud components

To track the health of our nodes running in both the cloud and at the edge, we install Netdata, which is a massively popular open-source monitoring agent. Even more popular than Netdata however is Cortex-A75. While Cortex-A75 serves a very similar purpose it employs a pull model for metrics from all its nodes, meaning the master process running in the cloud would try to initiate a request for new metrics data at the edge, where it may be blocked by firewalls/NATs. Netdata however, employs a push model for metrics, meaning the nodes produce performance data and attempt to send it to a master living in the cloud, making it a better choice for the edge.

The Netdata master process aggregates all the information it is receiving and forward it to our InfluxDB instance installed previously for long-term storage. The Netdata UI provided by the master will only display about an hour of real-time data from the nodes, so if you would like to keep historical performance data for later analysis, you must write it out to permanent storage. We can then leverage Grafana to view and analyze this historical data.

To install the cloud components of Netdata into your cluster, perform the following from your dev machine from the root of the repository:

envsubst '${SMARTER_DATA_DOMAIN}' < cloud/netdata/netdata-values.yaml > cloud/netdata/netdata-custom.yaml
git clone https://github.com/netdata/helmchart.git ~/netdata
helm install --name netdata -f cloud/netdata/netdata-custom.yaml ~/netdata/

Install Jaeger

Monitoring node health by viewing high-level performance characteristics of an application or the node itself is only one piece of the puzzle. Say we have identified that one of our applications is stalling on disk I/O unexpectedly on one of the edge nodes we manage. While having the source of the performance bottleneck is nice, we need to delve deeper into the application itself to find which code paths within the application are creating the disk stalls. Further, we may not even be able to remote into the node to dig around given the firewall/NAT configurations at the time. Jaeger provides us a minimally intrusive application tracing framework which conforms to the open tracing standard. Using Jaeger, as your application runs, collects trace data at the function level, called a span, indicating what arguments were passed to the function as well as execution time. From this granular span data, we can bundle correlated spans to construct execution traces, not only on a per-service basis, but also across service boundaries. Our cloud allows us to collect and store trace data for each of our nodes, and view and analyze them using web UIs.

To install the cloud components of Jaeger, run the following from the root of the repository on your dev machine:

helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
envsubst '${SMARTER_DATA_DOMAIN}' < cloud/jaeger-cloud/jaeger-values.yaml > cloud/jaeger-cloud/jaeger-values-custom.yaml
helm install --name jaeger -f cloud/jaeger-cloud/jaeger-values-custom.yaml jaegertracing/jaeger

Edge Setup

Create k3s Cluster

At this point in the tutorial, we have setup a bare-metal k3s cluster with data ingestion pipelines and web UIs which eagerly await interesting APM data to be produced from our edge nodes. To manage these nodes, we opt to use k3s once more. The beauty of k3s in many ways is that Arm devices are a first-class citizen. Many popular cloud-native open-source tools today focus on x86, creating headaches for developers who may like to use these tools on their own Arm clusters.

To install k3s, provision an x86 or Arm node to serve as your master. You do not have to even worry about installing Docker, as the k3s master runs as a single binary directly against the host. To install and run the k3s master as a systemd service, run:

curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server" sh -s - --disable-agent --write-kubeconfig-mode 664 --no-deploy servicelb --no-deploy traefik --no-flannel --no-deploy coredns

Note if you are running your master in a machine in the public cloud (ie EC2 instance), pass the flag --advertise-address <PUBLIC_IP> to the above command.

For your edge nodes, lets assume they are all running 64 bit Arm Linux. If you have a Raspberry Pi, you can try out running Ubuntu server, which has images available here. Ensure that your edge node has docker installed before continuing on to the next steps.

Before we install our k3s agent on our node, we will install a cni built for edge computing use cases, rather than relying on flannel which k3s will attempt to install by default. For information what makes this cni tailored for edge computing, refer to the previous blog post. To install the cni on your node run:

git clone https://gitlab.com/arm-research/smarter/smarter-cni.git
cd smarter-cni
sudo ./install.sh

With our cni installed, we now are ready to install the k3s agent. In the following command K3S_URL is the address where your master node can be reached and K3S_TOKEN can be obtained by running sudo cat /var/lib/rancher/k3s/server/node-token on your edge master node.

Now on each edge nodes which you wish to include in your cluster run (filling in the variables appropriately):

curl -sfL https://get.k3s.io | K3S_URL=https://<myserver>:6443 K3S_TOKEN=XXX sh -s - --docker --no-flannel

Fetch your kubeconfig for your k3s by copying the file /etc/rancher/k3s/k3s.yaml from your edge master machine to your dev machine. You may have to open the file and replace 127.0.0.1 in the server spec to the hostname of your server.

Now that we are using two clusters, we need a way to manage which cluster we target when we run our kubectl commands. Fortunately, Kubernetes provides a simple way of doing this. On your command line set the variable "KUBECONFIG" by doing the following:

export KUBECONFIG=<path to cloud kubeconfig>:<path to k3s kubeconfig>

You can open up each of these kubeconfigs respectively and modify the fields as per the following markup:

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: <redacted>
    server: https://<your-master-ip>:2520
  name: k3s-edge
contexts:
- context:
    cluster: k3s-edge
    user: default
  name: k3s-edge
current-context: k3s-edge
kind: Config
preferences: {}
users:
- name: default
  user:
    password: <redacted>
    username: admin

You may do the same for your cloud cluster's kubeconfig, changing k3s-edge to cloud in all fields besides current-context. This field determines what cluster your kubectl commands will target. Now to switch between clusters, you can simply run kubectl config use-context <k3s-edge or cloud>.

To view your current configuration run kubectl config view, you should see the all the information from each of your two kubeconfigs displayed here, with current-context being set to the cluster you are currently targeting. For more information on multi-cluster configuration you can read here.

kubectl create ns observability
kubectl create secret generic elastic-credentials  --namespace=observability --from-literal=password=<YOUR ELASTIC PASSWORD> --from-literal=username=elastic

Recall that you can obtain your elastic credentials by running the following command against your cloud k3s instance:

kubectl get secrets/elastic-credentials --template={{.data.password}} | base64 -d

Run Netdata Collector on each Edge Node

Now that we have our k3s cluster up and running, let's deploy the edge side of our APM/observability infrastructure. At the moment, you have an instance of the Netdata master running in the cloud, awaiting information to be streamed up from the edge. In order to run a single copy the Netdata collector on each of our nodes, we use a Kubernetes DaemonSet.

The Netdata collector can be configured such that it acts as a headless collector of data, and forwards all metrics directly to the master living in our cloud via a TCP connection. For the edge use-case, this is exactly what we would like. In my own rough inspection, I found that the headless collector running on a Raspberry Pi 3B+ consumed about 2% cpu, 29MB RSS, and 700Kb/s of network, all while the device was running close to 20 containers which had their metrics collected at 1s intervals.

Ensure you have the variable SMARTER_DATA_DOMAIN set as before, and in addition export the following variables:

export SMARTER_EDGE_DOMAIN=<YOUR_EDGE_MASTER_IP>
export SMARTER_CLOUD_IP=<YOUR_CLOUD_MASTER_IP>

To deploy this app we apply the yaml to our edge cluster by doing the following (ensure your kubectl targets k3s):

envsubst < edge/netdata/netdata-configMap.yaml > edge/netdata/custom/netdata-configMap-custom.yaml
envsubst < edge/netdata/netdata-daemonSet.yaml > edge/netdata/custom/netdata-daemonSet-custom.yaml
kubectl apply -f edge/netdata/custom

This will create the Netdata collector DaemonSet as well as a ConfigMap which is used to store key value pairs that we can share to our entire cluster. There are a couple things that must be done here to configure our headless collectors appropriately when running with Kubernetes. If you inspect the folder edge/netdata/custom, you will find a few interesting features:

To get information about the host node, we must run the Netdata container in privileged mode, and mount many directories from our host into the container such that it can read information about system state appropriately.
We use a ServiceAccount associated with a particular cluster role, such that our Netdata pod is authorized to query our k3s API server. This is done by specifying the serviceAccountName field under spec->template->spec.

If you open our ConfigMap for Netdata at edge/netdata/custom/netdata-configMap-custom.yaml you will find the contents of the Netdata config file which will ultimately be used by the Netdata collector when running inside its container. If you wish to reconfigure Netdata, you simply modify this configmap and reapply the file, then remove and reapply the daemonset for the changes to be propagated through the cluster.

Run Fluent Bit on each Edge Node

Fluent Bit is a light-weight stream processing engine developed by Treasure Data (now part of Arm), who also authored the popular Fluentd log collector/processor/aggregator. Fluent Bit is the lighter-weight brother of Fluentd, making it a fantastic choice for running on resource constrained devices at the edge. As an example application, we use Fluent Bit to collect and stream all the logs in our cluster back to our Elasticsearch instance in the cloud, where we can then use the Kibana UI to filter and analyze the logs.

For the same reasons as the Netdata DaemonSet, we also create a ServiceAccount for Fluent Bit, such that it can query our API server and append Kubernetes pod metadata to the logs it collects from the docker daemon. When you view the logs in Kibana, you can filter them based on their Kubernetes metadata making them very easy to digest.

To begin running Fluent Bit on each node run (ensure your kubectl targets k3s):

envsubst < edge/fluent-bit/fluent-bit-ds.yaml > edge/fluent-bit/custom/fluent-bit-ds-custom.yaml
kubectl apply -f edge/fluent-bit/custom

Run Jaeger Agent on each Edge Node

The Jaeger Agent is a headless collector that runs on each one of our edge nodes and collects information about the spans and traces produced by each one of the applications which are instrumented with OpenTracing clients. As the applications run, the OpenTracing client will bundle up span and trace data then send it to our agent via UDP, where it will then be forwarded to the Jaeger Collector in our cloud cluster. As of January 2020, Jaeger does not explicitly have support for Arm devices, so I have taken the time to port the Jaeger Agent to Arm64 and Arm. To see the Dockerfile recipes required to build the Jaeger Agent for Arm, you can reference this repository. You may use this repository to build the Jaeger Agent for yourself, or you may use images I have prebuilt for convenience. Before deploying the Jaeger Agent, export the env variable JAEGER_AGENT_IMAGE with the value registry.gitlab.com/arm-research/smarter/jaeger-agent-arm:latest to use my image, or the image tag for the image you built yourself.

To start the Jaeger Agent on each node with my image run (ensure your kubectl targets k3s):

export JAEGER_AGENT_IMAGE=registry.gitlab.com/arm-research/smarter/jaeger-agent-arm:latest
envsubst < edge/jaeger/jaeger-agent-ds.yaml > edge/jaeger/custom/jaeger-agent-ds-custom.yaml
envsubst < edge/jaeger/jaeger-agent-configMap.yaml > edge/jaeger/custom/jaeger-agent-configMap-custom.yaml
kubectl apply -f edge/jaeger/custom

Run an Example Workload

As a demonstrative example of the infrastructure we have setup in this tutorial, we will run a modified example application employing Jaeger tracing from a tutorial originally found here. I have forked the tutorial from github, made modifications, and built docker images for each of the three sample services. The source for the apps and their corresponding Dockerfiles in the forked repository can be found here.

Before deploying this sample application, export the env variables CLIENT_IMAGE, FORMATTER_IMAGE, PUBLISHER_IMAGE with the proper image names. You may build your own images by referencing the the forked repository, or to save time, I have gone ahead and prebuilt images for all three services, which have the names: registry.gitlab.com/arm-research/smarter/edge-jaeger-tutorial:client, registry.gitlab.com/arm-research/smarter/edge-jaeger-tutorial:formatter, and registry.gitlab.com/arm-research/smarter/edge-jaeger-tutorial:publisher respectively.

To deploy the application with my images set, simply apply the example DaemonSets I have created by running (ensure your kubectl targets k3s):

export CLIENT_IMAGE=registry.gitlab.com/arm-research/smarter/edge-jaeger-tutorial:client
export FORMATTER_IMAGE=registry.gitlab.com/arm-research/smarter/edge-jaeger-tutorial:formatter
export PUBLISHER_IMAGE=registry.gitlab.com/arm-research/smarter/edge-jaeger-tutorial:publisher
envsubst < edge/application/client/client-ds.yaml | kubectl apply -f -
envsubst < edge/application/formatter/formatter-ds.yaml | kubectl apply -f -
envsubst < edge/application/publisher/publisher-ds.yaml | kubectl apply -f -
kubectl label node <your node name> formatter=yes publisher=yes client=yes

Tracing

Tracing Architecture Overview

To trigger service events on your edge nodes, run the following command from a machine on the same network as your edge node:

curl "http://<your edge node ip>:8080/hello?helloTo=josh"

Running this command will make an http request of the node, which will ultimately respond by saying "Hello, josh!". On the backend the requests goes to a formatting micro-service to create the string, and a publisher service which logs the data which will be returned to the user to stdout.

You may run that command targeting any one of your edge nodes as many times as you like, with any name set as the value to the "helloTo" key.

If you navigate to http://jaeger-query-<CLOUD_MASTER_IP(dash separated)>.nip.io you will be able to navigate through the generated trace data in an intuitive UI. For example, if you cloud ip is 18.34.90.214, your url would be http://jaeger-query-18-34-90-214.nip.io. Here you will notice that our services are tagged with our node names prepended to the service name itself so you can distinguish spans based on the node.

This tutorial will give you more context on what valuable application information you can extract using opentracing and jaeger.

An example trace captured by the jaeger tracing infrastructure

An example trace captured by the Jaeger tracing infrastructure

Logging

Logging Architecture Overview

We can also take a look at the logs being generated by each one of our services by navigating to http://kibana-<CLOUD_MASTER_IP(dash separated)>.nip.io. To login, your username is elastic, and the password is the value you queried from your cloud cluster at the beginning of the edge setup instructions. In Kibana, to configure your logging index, go to Management->Kibana->Index Patterns->Create Index Pattern, enter the index pattern logstash* then use select time from the next prompt and continue. In the discover tab you will then be able to filter and view the logs in any manner you'd like. As a simple example, if you fetch the pod name for our publisher service by running kubectl get pods | grep publisher, you can filter the logs for only those generated by this publisher pod. If you do so, you should be able to see the "Hello, josh!" message along with a timestamp.

Filtered logs for a pod displayed by Kibana+Elasticsearch

Performance

Performance Architecture Overview

Using the Netdata dashboard we can also view the real-time performance data at the node and pod level by navigating to http://netdata-<CLOUD_MASTER_IP(dash separated)>.nip.io. Here we can see real-time metrics for all of our nodes, and also see any alarms that have been generated given a set of rules which we can configure. If you navigate to the 'nodes' tab, you can see real-time status for all nodes in your cluster sorted by the health of the node, if you click on an unhealthy node, you can go into its dashboard and perform further inspection.

An example of pod metrics displayed for the past hour in Netdata

Finally, for long-term metric storage, we can navigate to http://grafana-<CLOUD_MASTER_IP(dash separated)>.nip.io where we are able to configure an example dashboard and view historical performance data from each of our nodes. To setup an example dashboard perform the following steps:

Login using with the default credentials username: admin, password: admin then update the password upon login.
Press add data source, and select influxdb
Enter the url as http://influxdb:8086 and the database as opentsdb. Press Save & Test to save the new data source.
On the left menu, select create (the plus symbol), then import and enter the dashboard id 2701
Under the data source, select InfluxDB and create

This dashboard serves as a great entry-point for node metrics, but will require a few modifications to display information specific to your pods. It can be customized at a later time to fit your needs.

If you run the curl request repeatedly you will be able to see the spikes in activity in the Netdata or Grafana dashboards.

An example of node metrics queried for the last 7 days in Grafana

An example of node metrics queried for the last 7 days in Grafana

Final Thoughts

To summarize, we have brought up two independent clusters from scratch to manage the cloud and edge side of our sample system, and deployed data aggregators in the cloud along with data collectors on each node at the edge. This setup provides us the ability to keep track of the three pillars we usually consider when building APM/observability systems:

Performance Monitoring
Log Streaming/Storage/Filtering
Application Tracing

Each one of the collectors running on our edge nodes is designed to minimize introduced overhead, such that more compute resource can be spent extracting value from the quintillion bytes of data produced every day.

All the tools in this tutorial were designed to be used in the cloud, and don't map perfectly to the edge use-case. Moving forward there are a few areas where a system like this could be better tailored for edge computing:

Improve Fault Tolerance: Our current architecture is not built from the ground up for the edge. If a node loses connectivity back to our cloud cluster, we will lose all logs/metrics/trace data for the duration of the outage. As a solution, you may be able to write metrics to disk during the duration of a connectivity outage, for them to be sent up once connection is restored.
Centralize More Information In a Single Dashboard: Our current solutions requires a bit of "Dashboard Hopping" to acquire a larger picture of the source of any issues you may face. It would be nice to have a single interface which could pull in and display all relevant features of an issue implicitly. This capability could be configured theoretically using ELK, and could be explored at a later date.
FaaS Monitoring: If we could deploy a serverless architecture to the edge, it would be nice to collect performance metrics on a per-function invocation basis, and display this data to interested parties.
Intelligent Log Filtering: As of right now, Fluent Bit is configured to stream up all logs from all containers, which is not the most scalable solution when dealing with thousands of edge nodes in system. Here we could investigate processing logs using pattern recognition such that only abnormal sequences of logs would be sent up.

If you have any questions or comments, please feel free to contact me.

Contact Josh Minor

This post is the second in a five part series. Read the other parts of the series using the links below:

Part one: SMARTER: A smarter-cni for Kubernetes on the Edge

Research Articles

HOL4 users' workshop 2025

Hrutvik Kanabar

Tue 10th - Wed 11th June 2025. A workshop to bring together developers/users of the HOL4 interactive theorem prover.
- March 24, 2025
TinyML: Ubiquitous embedded intelligence

Becky Ellis

With Arm’s vast microprocessor ecosystem at its foundation, the world is entering a new era of Tiny ML. Professor Vijay Janapa Reddi walks us through this emerging field.
- November 28, 2024
To the edge and beyond

Becky Ellis

London South Bank University’s Electrical and Electronic Engineering department have been using Arm IP and teaching resources as core elements in their courses and student projects.
- November 5, 2024