In the previous blogs, we looked at the evolution of automation and tooling, how they are used to look at the performance and push the limits of OpenShift at large scale and highlights from the recent scale test runs (OpenShift Scale-CI, part 1: Evolution, OpenShift Scale-CI, part 2: Deep Dive, and OpenShift Scale-CI, part 3: OCP 4.1 and 4.2 Scale Run Highlights). In this blog post, we will look at potential problems and journey towards making automation and tooling intelligent enough to overcome them.
Problem with Current Automation and Tooling
The automation and tooling is not intelligent enough to handle failures leading to system degradation. When using them to run tests — especially ones focusing on performance, scalability, and chaos against distributed systems like Kubernetes/Openshift clusters — the system/application components might start degrading. (For example, node failures and system components failures like ApiServer, Etcd, and SDN e.t.c., might occur.) The CI pipeline/automation orchestrating the workloads/test cases is unable to stop the execution and signal the cluster admin when this happens and cannot move onto later test cases. This is needed, even if the cluster is still functioning, because of the high availability and self-healing design. This leads to:
- Inaccurate results.
- Loss of time (to both clusters as well as engineers), which gets very expensive for large-scale clusters ranging from 250 - 2000 nodes.
- Clusters ending up in an unrecoverable state due to the test load as the automation did not stop when cluster health started to degrade.
Today, human monitoring is necessary to understand the situation and stop the test automation/CI Pipeline to fix the issue, which is not really feasible when running the tests against multiple clusters with different parameters. Our team took shifts to monitor the cluster to make sure all was in order during the scale test runs, which put a lot of stress on the control plane. One of the goals behind building the CI pipeline/automation was to enable engineers to focus on writing new tools and new test cases while it continuously uses hardware by churning out the data from performance and scalability test runs. So, how can we solve this problem of automation and tooling not taking system degradation into account?
Cerberus to the Rescue
We built a tool called Cerberus to address the problem of automation and tooling not being able to react to system degradation. Cerberus watches the Kubernetes/OpenShift clusters for dead nodes and system component failures and exposes a go or no-go signal, which can be consumed by workload generators like Ripsaw, Scale-CI, CI pipelines/automation like Scale-CI Pipeline or any applications in the cluster and act accordingly.
What Components Can Cerberus Monitor?
It supports watching/monitoring:
- Node’s health
- System components and pods deployed in any namespace specified in the config
System components are watched by default as they are critical for running the operations on Kubernetes/OpenShift clusters. It can be used to monitor application pods as well.
Daemon Mode vs Iterations
Cerberus can be run in two modes 1) Daemon 2) Iteration. When running in daemon mode which is the default, it keeps monitoring the cluster till the user interrupts it. It has a tuning set where the wait duration can be specified before starting each watch/iteration. This is key, as setting it to a low value might lead to increased requests to the server, thus overloading it. It is important to tweak it appropriately, especially on a large-scale cluster.
In the iterations mode, it will run for the specified number of iterations and exit when done.
Cerberus can be run using python or as a container on the host with access to the Kubernetes/OpenShift cluster as documented. Here is a short demo showing the functionality:
Notifications on Failures
Automation/Tools consuming the signal exposed by Cerberus and acting accordingly is just one side of the story. It is also important to notify the team/cluster-admin to take a look at the cluster, analyze it, and fix the issue when something goes wrong. Cerberus has support for slack integration, and when enabled, it can ping on slack with the information about the cluster API to identify the cluster and failures found.
Tagging everyone or sending a message without tagging a particular person might be confusing as to who takes charge of fixing the issue and rekicks the workload. The Cerberus cop feature addresses this. A cop can be assigned over the week in the config for it to read and only tag the particular person who has been assigned the cop function in the channel.
We have looked at how Cerberus aggregates the failures and exposes a go/no-go signal as well as how it notifies the user. Does it generate a report/collect data on failures? Yes, it does. It generates a report with details about each watch per iteration. It also has support to provide more information on the failure by inspecting the failed component by collecting logs, events, et cetera, when inspect component mode is enabled in the config.
There are a number of potential use cases. Here are couple of them for which we are using Cerberus:
- We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable. The go/no-go signal exposed by Cerberus here is used to stop the test run and signal us on system degradation.
- When running chaos experiments on a Kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components, which means that the chaos experiment won't be able to find it. The go/no-go signal is used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario.
We, in the OpenShift group at Red Hat, are planning to enable Cerberus in the upcoming scale test runs with nodes ranging from 250 - 2000 nodes. It’s going to be interesting to see how well it scales.
Stay tuned for updates on more tooling and automation enhancements as well as highlights from OpenShift 4.x large scale test runs. Any feedback is appreciated and as always, feel free to create issues and enhancements requests on github or reach out to us on sig-scalability channel on Kubernetes slack.