Red Hat blog

OpenShift/Kubernetes Failure Stories at Scale - Lessons Learned from Large and Dense Deployments

December 3, 2020Naga Ravi Chaitanya Elluri

Microservices architecture is gaining popularity rapidly, and OpenShift is the go-to platform to run them. Here are a couple of questions that every engineer has on their mind after writing a piece of software: Will it scale? Will the API be able to handle millions of requests? How many concurrent requests can the service handle? How does it perform at load? This is exactly where we, the Performance and Scalability team, spend our time. We try to hit all the problems that might occur in a production environment in order for the users and customers to be able to run their applications reliably on top of OpenShift.

One of the many perks of being part of the Performance and Scalability team at Red Hat is that we get access to lots of hardware, both in data centers as well as public clouds, to scale and push the limits of OpenShift. We work on building tools, workloads, and automation to simulate the real world production environments and monitor the cluster's health.

We more often than expected find ourselves rescuing clusters that are unresponsive and are on fire with alerts firing at an alarming rate. For starters, here is a glimpse of one of the clusters on fire most recently:

One might ask: What was the state of the cluster, and what does it take to fix it? In this blog post, we will go over a couple of scenarios, including what happened, how we debugged it, and how we can avoid being in that situation. No one likes to experience downtime in their environment! Let’s jump on to the scenarios.

Scenario 1: Rogue DaemonSet Took Down the 2000 Node Cluster

We deployed an application that runs as a daemonSet with one replica running on each of the nodes. The application was issuing heavy requests to the API server on regular intervals. We did not really notice that there was a rogue process running on our cluster at smaller scale, as the cluster was stable with no alerts getting fired. However it took down the cluster in no time once we started scaling the cluster from lower node counts to higher nodes trying to get to 2000 nodes. The control plane, especially the API servers, encountered denial-of-service with the inflight requests at its peak, as the nodes and applications were hitting it hard. API server replicas memory and CPU usage spiked to 87.5 GB and 14 cores respectively:

OpenShift Failure Stories at Scale - Cluster on Fire (1)

OpenShift Failure Stories at Scale - Cluster on Fire (2)

As you can see, it is easy for Kubernetes workloads to accidentally DoS the API servers, causing other important traffic-like system controllers or leader elections to fail intermittently. In the worst cases, a few broken nodes or controllers can push a busy cluster over the edge, turning a local problem into a control plane outage.

How can we ensure the system component requests do not get hogged by the applications? Enter API priority and fairness! The API priority and fairness feature available starting with Kubernetes 1.18 prioritizes system requests over application requests, leading to clusters being more stable. We also need to ensure that there are alerts in place that monitor the API request rates (read and write ), such as in-flight requests and priority and fairness requests in queue to prevent overwhelming the API server. When using Prometheus as the monitoring solution, which is the default in OpenShift, Apiserver_request_total, apiserver_current_inflight_requests and apiserver_flowcontrol_request_queue_length_after_enqueue_bucket are the metrics to monitor.

Scenario 2: Too Many Objects in Etcd Lead to Writes Being Blocked

We load the cluster with thousands of namespaces, deployments, pods, services, secrets, and other objects to determine where the cluster starts experiencing performance degradation or its cluster limits. During the workload to create 10k namespaces with approximately 60k pods, nearly 190k secrets, 10k deployments, image streams, environment variables, and other objects, etcd suffered from poor performance as the keyspace grew excessively, leading to exceeding the default space quota, which was 2GB at that time. This led to the cluster being put in maintenance mode where only reads and deletes are allowed, meaning new applications cannot run.

How can we avoid hitting this problem? The default etcd backend space quota in OpenShift is now set to 7GiB, and it is good enough for large and dense clusters. Periodic maintenance of etcd, including defragmentation, must be performed to free up space in the data store. It is highly recommended that you monitor Prometheus for etcd metrics and defragment it when required before etcd raises a cluster-wide alarm that puts the cluster into a maintenance mode, which only accepts key reads and deletes. Some of the key metrics to monitor are: etcd_server_quota_backend_bytes, which is the current quota limit; etcd_mvcc_db_total_size_in_use_in_bytes, which indicates the actual database usage after a history compaction; and etcd_debugging_mvcc_db_total_size_in_bytes, which shows the database size, including free space waiting for defragmentation.

Scenario 3: Hosting Etcd on Slower Disks Created Havoc on the Cluster

With the increase in adoption of public cloud, it has never been easier to run applications on top of it. It is important to understand the infrastructure requirements and make sure the right disk, network, CPU, and memory are appropriate for various components of OpenShift/Kubernetes, especially for etcd. Etcd writes data to disk, so its performance strongly depends on disk performance. Etcd stores proposals on disk. Slow disks and disk activity from other processes might cause long fsync latencies, causing etcd to miss heartbeats, resulting in inability to commit new proposals to the disk on time, which can cause request timeouts and temporary leader loss. Here is a glimpse of the cluster when hosting etcd on a machine with less performant disk in terms of latency:

OpenShift Failure Stories at Scale - Cluster on Fire (3)

It is highly recommended to run etcd on machines backed by SSD/NVMe disks with low latency. Some of the key metrics to monitor on a deployed OpenShift cluster are p99 of etcd disk write ahead log duration and the number of etcd leader changes. Use Prometheus to track these metrics: etcd_disk_wal_fsync_duration_seconds_bucket reports the etcd disk fsync duration;, etcd_server_leader_changes_seen_total reports the leader changes. To rule out a slow disk and confirm that the disk is reasonably fast, 99th percentile of the etcd_disk_wal_fsync_duration_seconds_bucket should be less than 10ms.

Fio, a I/O benchmarking tool, can be used to validate the hardware for etcd before or after creating the OpenShift cluster. Run fio and analyze the results. Assuming container runtimes like podman or docker are installed on the machine under test and the path etcd writes, the data exists - /var/lib/etcd, follow the procedure:

Procedure

Run the following if using podman:

$ sudo podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf

Alternatively, run the following if using docker:

$ sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf

The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile of the fsync metric captured from the run to see if it is less than 10ms.

What’s Next?

Refer Scalability and Performance Guide for more information to help with planning the OpenShift environments to be more scalable and performant. We would love to hear your thoughts and stories from your experience with running OpenShift/Kubernetes at scale. Feel free to reach us out on Github, https://github.com/cloud-bulldozer, https://github.com/openshift-scale, or sig-scalability channel on Kubernetes Slack. Stay tuned for more stories around OpenShift performance, scalability and reliability!

About the author

Naga Ravi Chaitanya Elluri

Team Lead and Principal Software Engineer - Chaos Engineering for OpenShift and Managed Services

Naga Ravi Chaitanya Elluri leads the Chaos Engineering efforts at Red Hat with a focus on improving the resilience, performance and scalability of Kubernetes and making sure the platform and the applications running on it perform well under turbulent conditions. His interest lies in the cloud and distributed computing space and he has contributed to various open source projects.

Read full bio

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Services

Training & certification

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

OpenShift/Kubernetes Failure Stories at Scale - Lessons Learned from Large and Dense Deployments

Scenario 1: Rogue DaemonSet Took Down the 2000 Node Cluster

Scenario 2: Too Many Objects in Etcd Lead to Writes Being Blocked

Scenario 3: Hosting Etcd on Slower Disks Created Havoc on the Cluster

Procedure

What’s Next?

About the author

Naga Ravi Chaitanya Elluri

Deploying SAS Viya on HPE GreenLake and Red Hat OpenShift

SAS Viya on Red Hat OpenShift Service for AWS (ROSA)

Understanding the Red Hat security impact scale

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links