POLICIES

Understanding Availability for OpenShift Dedicated

Table of Contents

Availability and disaster avoidance are extremely important aspects of any application platform. OpenShift Dedicated provides many protections against failures at several levels, but customer-deployed applications must be appropriately configured for high availability. In addition, in order to account for cloud provider outages that may occur, other options are available, such as deploying a cluster across multiple availability zones or maintaining multiple clusters with failover mechanisms.

Potential points of failure in an OpenShift Dedicated cluster

OpenShift Container Platform provides many features and options for protecting your workloads against downtime, but applications must be architected appropriately to take advantage of these features. OpenShift Dedicated can help further protect you against many common kubernetes issues by adding Red Hat SRE support and the option to deploy a multi-zone cluster, but there are a number of ways in which a container or infrastructure can still fail. By understanding potential points of failure, you can understand risks and appropriately architect both your applications and your clusters to be as resilient as necessary at each specific level.

An outage can occur at several different levels of infrastructure and cluster components.

Container or pod failure

By design, pods are meant to be ephemeral. Appropriately scaling services so that multiple instances of your application pods are running will protect against issues with any individual pod or container. OpenShift’s node scheduler can also make sure these workloads are distributed across different worker nodes to further improve resiliency.

When accounting for possible pod failures, it is also important to understand how storage is attached to your applications. Single persistent volumes attached to single pods will not be able to leverage the full benefits of pod scaling, whereas replicated databases, database services, or shared storage will. Please see the Storage failure section for more information.

Worker node failure

Worker nodes are the virtual machines that contain your application pods. By default, an OpenShift Dedicated cluster will have a minimum of four worker nodes for a single availability-zone cluster. In the event of a worker node failure, pods will be relocated to functioning worker nodes, as long as there is enough capacity, until any issue with an existing node is resolved or the node is replaced. More worker nodes means more protection against single node outages, and ensures proper cluster capacity for rescheduled pods  in the event of a node failure.

When accounting for possible node failures, it is also important to understand how storage is affected. Please see the Storage failure section for more information.

Cluster failure

OpenShift Dedicated clusters have at least three master nodes and three infrastructure nodes that are preconfigured for high availability, either in a single zone or across multiple zones depending on the type of cluster you have selected. This means that master and infrastructure nodes have the same resiliency of worker nodes, with the added benefit of being managed completely by Red Hat.

In the event of a complete master outage, the OpenShift APIs will not function, and existing worker node pods will be unaffected. However, if there is also a pod or node outage at the same time, the masters will have to recover before new pods or nodes can be added or scheduled.

All services running on infrastructure nodes are configured by Red Hat to be highly available and distributed across infrastructure nodes. In the event of a complete infrastructure outage, these services will be unavailable until these nodes have been recovered.

Zone failure

A zone failure from a public cloud provider will affect all virtual components, such as worker nodes, block or shared storage, and load balancers that are specific to a single availability zone. To protect against a zone failure, OpenShift Dedicated provides the option for clusters that are distributed across three availability zones, called multi-AZ clusters. Existing stateless workloads will be redistributed to unaffected zones  in the event of an outage, as long as there is enough capacity.

Region failure

A full region failure from a public cloud provider is exceedingly rare, but will make all zones unavailable. A multi-AZ cluster is not enough to protect against outages at the region level. The best way to protect against a full region failure is to have multiple clusters in different regions, connected with an external global load balancer. Such an architecture can protect against both zone and region failures, and even across different cloud providers.

Storage failure

If you have deployed a stateful application, then storage is a critical component and must be accounted for when thinking about high availability. A single block storage PV is unable to withstand outages even at the pod level. The best ways to maintain availability of storage are to use replicated storage solutions, shared storage that is unaffected by outages, or a database service that is independent of the cluster.