Everyone is into Kubernetes these days, and everyone wants their applications to run in a highly available manner, but what happens when the underlying infrastructure is failing? What if there’s a network outage, hardware failure or even software bugs?
While stateless workloads are usually immune to these kinds of failures (the cluster can always schedule more instances somewhere else), the real problem begins with stateful workloads.
When there’s a node failure we want to achieve two goals:
- Minimize application downtime — we want to reschedule affected applications on other healthy node(s) to minimize their downtime.
- Recover compute capacity — the failing node is malfunctioning, thus its compute power can’t be used to run any workloads. We want to recover it as fast as possible to recover compute capacity for the cluster.
Stateful workloads with run-once semantics can’t be immediately rescheduled on other nodes. The cluster doesn’t really know if that workload is still running on the unhealthy node (perhaps it’s just a network outage), and scheduling the workload into other nodes introduces a risk that two instances would run in parallel, which would violate the at-most-one semantics and might result in data corruption or diverging datasets.
Kubernetes won’t run that workload anywhere else until it can make sure the workload is no longer running on the unhealthy node. However, if we can’t communicate with the node, there is no way to know that.
We need to bring the node into a “safe” state (this process is called fencing) and one of the most common forms of fencing is to power the node off using IPMI. Once the host running the node is powered off we can definitely assume that there are no running workloads there. To signal that to the cluster, we delete the node.
To power if off we’ve got two options:
- Using the server's management interface, but this requires BMC credentials, and obviously network connectivity to the management interface.
- Causing the unhealthy node to reboot itself in case of a failure.
The first one was implemented as an opt-in feature in OpenShift and is out of scope for this post.
Taking the poison pill at the right time
To achieve this, each node will run an agent that will constantly monitor its health status.
An external controller, machine-healthcheck-controller (MHC), determines if nodes are unhealthy, after which it applies an annotation to the node/machine object.
The poison-pill agent periodically looks for the unhealthy annotation. Once detected, the node takes the poison pill and reboots itself.
But there are some challenges:
- What happens if the unhealthy node can’t reach the api-server, and can’t see if the unhealthy annotation is there?
- Deleting the node only when no workload is running on it
- Resource starvation, or bugs might prevent the poison pill agent to detect its health status
Peers to the rescue
If the unhealthy node can’t reach the api-server, we could assume that it’s unhealthy, but this will result in a false positive and cause a reboot storm in the cluster, if the failure exists only in the api-server nodes.
To distinguish between a real node failure and an api-server failure, the poison-pill agent will contact its peers. If all peers can’t access the api-server, it is safe to assume there is an api-server failure and doesn’t reboot itself.
If other peers can reach the API server and report that the node is unhealthy, or if none of the peers respond, the node will reboot itself.
Respecting Singleton-like semantics
Some Kubernetes objects such as StatefulSets and exclusive Volume mounts are designed around the idea that they are only ever active in at most one location. In order to respect these semantics, we must only delete the node when no workload is running there.
The existing workloads will stop running when the system is rebooting, so we need to remove the node object only after reboot has completed the ‘off’ phase .
How can other peers tell if the unhealthy node has been rebooted if it’s unresponsive?
The premise of kubernetes poison pill fencing is that within a known and finite period of time from having declared the node as unhealthy, either the node (directly or a peer) will see the unhealthy annotation in etcd, or it will terminate.
Having calculated and configured the period (based on heartbeat and retry intervals), peers that have waited for that amount of time can safely assume that either the node was already dead, or is so now.
To minimise this timeout, as well as make sure that it is bounded, we make use of a watchdog device to panic the node rather than perform a normal shutdown or reboot. This avoids the need to factor in every possible service on the machine and their worst case shutdown times, as well as protecting against bugs, hung services, and resource starvation.
Since the timeout is an upper bound for the reboot to commence, the node could come up by then and take new workloads before the Node object is deleted, which would result in us actively creating the exact conditions we’re trying to prevent - starting multiple copies of particular workloads.
This can happen if the reboot fixed the failure, or if the criteria for “healthy node” is different from MHC and the cluster scheduler perspective. (e.g. disk pressure could be defined as unhealthy in MHC but the scheduler can still assign workloads).
One possibility is for the watchdog to trigger a shutdown rather than a reboot, but then we lose compute capacity and require manual admin intervention to power on the host.
Instead, the poison pill agents (on all nodes) will try to mark the unhealthy node as unschedulable. This is a safety mechanism such that even if the node reboots, it won’t be assigned to run any workloads.
The poison pill agents will only delete the node if it’s marked as unschedulable.
For that matter, we need to handle 3 cases.
- Unhealthy node is able reach the api-server - the node can just verify the unschedulable taint is there before reboot
- Unhealthy node can’t reach the api-server but can contact other peers which will instruct a reboot only if the unschedulable taint exists
- Unhealthy node can’t reach the api-server, nor any of its peers - we assume that the healthy nodes will be able to add the unschedulable taint before the unhealthy machine contacts all of its peers and reboots.
After node deletion, the poison pill agents will re-create the node object without the unschedulable taint, allowing the scheduler to re-assign workloads to that node.
Since we’re relying on timeouts to assume that the machine has been rebooted, we need to make sure that the reboot takes place even if there’s resource starvation. The poison pill is using a watchdog timer, which should be fed once in a while. If there’s a resource starvation, the watchdog will be starved as well, and this will trigger a reboot.
Minimizing downtime is an important goal for everyone. Distributed computing and stateful applications introduce some challenges and difficulties. The poison pill controller is trying to solve some of these.
The implementation is at a PoC level and the source code is available on the poison pill repo.