There is an increasing pressure to deploy stateful applications in Red Hat OpenShift.These applications require a more sophisticated disaster recovery (DR) strategy than stateless applications, as state must also be taken into account as opposed to just traffic redirection.
Disaster recovery strategies become less generic and more application-specific as applications increase in complexity. That said, this document will attempt to illustrate high-level disaster recovery strategies that can be applied to common stateful applications.
It is also important to acknowledge that the disaster recovery strategy for the OpenShift platform itself is a separate topic aside from the disaster recovery for applications running within OpenShift.
For the purpose of this discussion, we can make the assumption that OpenShift is deployed in one of the topologies depicted below:
One could argue that “two independent clusters” and “three or more independent clusters” are the same architectural pattern. However, as we will see, when we have more than two datacenters at our disposal, additional options become available in terms of disaster recovery and so it behooves us to distinguish those two architectures.
Notably, the option of an OpenShift cluster stretching across two datacenters is being excluded from this discussion. With this type of OpenShift deployment, when a disaster strikes, there are some operations that are needed to recover OpenShift (exactly what depends on the details of the deployment). We don’t want to find ourselves in a situation where we need to recover OpenShift’s control plane and at the same time the applications running on it.
Disaster recovery strategies can be grouped into two main categories:
- Active/passive: with this approach, under normal circumstances, all the traffic goes to one datacenter. The second datacenter is in a standby mode in case of a disaster. If a disaster occurs, it is assumed that there will be downtime in order to perform tasks needed to recover the service in the other datacenter.
- Active/active: with this approach, load is spread across all available datacenters. If a datacenter is lost due to a disaster, there can be an expectation that there is no impact to the services.
For the remainder of the document, a sample application will be used to illustrate the alternative disaster recovery strategies. While this approach will make the discussion more opinionated, it will make for a more realistic and easy to understand use case.
A Sample Stateful Application
Our application can be depicted as the following:
A stateless front-end receives customer requests through a route and communicates to a stateful workload. In our example, this is a relational database. The database pod mounts a persistent volume for its data.
Active / Passive Strategies
Active / passive strategies are suitable for those scenarios where only two datacenters are available. For more insight on why two datacenter deployments only lend themselves to active / passive strategies, and to understand what types of compromises are possible to overcome this limitation, please see the following blog post.
In an active/passive scenario, the overall architecture is depicted below:
In the preceding diagram, a global load balancer (referred to in the diagram as a Global Traffic Manager [GTM]) directs traffic to one of the datacenters.
The application is configured to replicate its state to the passive site.
When a disaster strikes, the following needs to occur:
- The application is activated (either started, or configured to be master) in the passive site.
- The global load balancer needs to be switched to the passive site.
These actions can be automated and performed in a relatively timely fashion. However, the decision to trigger that automation depends on declaring a disaster on the primary site (a task that typically involves human interaction). As a result, downtime is typically experienced in the application.
Once the disaster has been resolved, we have to switch back to the primary site. Likely, the easiest way to accomplish this task is to perform the disaster procedure in the opposite direction of when the disaster occurred. So, again, this procedure can be automated, but it will likely imply some downtime.
State Synchronization Options
Previously, we have described a very generic process to design an active/passive disaster recovery scenario. The entire process hinges on the ability to replicate state from the active site to the passive site. The following are some ways to accomplish this task. Each workload is different, so these various approaches should be chosen according to their applicability to your environments.
With volume replication, state is replicated at the storage level. Volume replication can be synchronous (typically used on low latency scenarios) or asynchronous. In the latter case, the application must be designed to work in a way that guarantees storage is always consistent or at least recoverable.
Most storage products support volume replication. However, Kubernetes does not offer a standard primitive to setup volume replication between two different clusters. So, at least for now, we need to rely on non Kubernetes-standard extensions for this capability. For example Portworx supports this capability.
Configuring volume replication outside of the Kubernetes abstraction is always a possibility. However, since the static nature of this kind of configuration usually conflicts with dynamic volume provisioning, this type of configuration must be carefully designed.
Backups and restores
While taking backups can provide invaluable protection against application misconfiguration, bugs or human error, they are not a recommended approach for a DR strategy.
In the context of DR, backups and restores can be seen as a form of rudimentary asynchronous volume replication. The following are examples of issues found with backups and restore operations:
- Full backup and restore exercises are done too infrequently (or never). The primary risk is of not being able to recover data exactly when they actually need it.
- There is no abstraction layer in Kubernetes to issue or schedule a backup, or a restore. Proprietary extensions (such as Velero or Fossul) must be used, or configuration must occur directly at the storage product layer.
- For very large datastores, the restore process can take a long time, potentially longer than the acceptable downtime for an application.
With application level replication, the stateful application takes care of replicating the state. Again, the replication can be synchronous or asynchronous. Because the replication is application-driven, at least in this case, we can be certain that the storage will always be in a consistent state. Most traditional databases can be configured in this fashion with a master running in the active site, and a slave running in the passive site.
In order for the master to synchronize with the slave, it must be possible to establish a connection from the master to the slave (and vice-versa when recovering after a disaster). One way to establish the connection can be to expose the stateful workload via a Route or a LoadBalancer service and have the master connect to that endpoint.
While this is a possible approach, it has the drawback that our stateful application is now exposed outside of the cluster. Also, it can be complicated to configure egress and ingress paths while maintaining individual pods’ identity when there is more than one pod per cluster (horizontal scaling). This is due to the fact that typically stateful application instances need to contact peer instances individually (not via a load balancer) to be able to cluster up. When there is more than one instance per cluster, it’s not possible to use the usual ingress solutions (load balancer services, ingresses, routers) to load balance on the instances of each cluster.
A solution to this issue is to establish a network tunnel between the cluster in such a way that pods in one cluster can directly communicate to pods in the other clusters.
Unfortunately, Kubernetes does not offer a standard abstraction to create network tunnels between clusters. However, there are community projects that offer this functionality including Submariner and Cilium.
A third option to achieve replication is to create a proxy in front of the stateful workload and have the proxy be responsible for maintaining the state replication.
Such a proxy would have to be written for the specific network protocol used by the stateful workload, making this approach not always an option.
As for application-level replication, we need the ability to establish inter-cluster pod-to-pod communication (from the proxy to the stateful workload).
Examples of this approach include Vitess (MySQL) and Citus (PostgreSQL). These are relatively complex applications that were originally created to allow scale-out solutions for databases by intelligently sharding tables. So, while these solutions can be used as a disaster recovery strategy, they should be adopted only if they are also needed to meet other requirements (es: large scale deployments).
Active / Active Strategies
For active / active strategies, we assume we have the requirement of consistency and availability for our stateful workloads. As described in this article, in order to meet these requirements, we will need at least three datacenters and an application with a consensus protocol that allows it to determine which instances of the cluster are active and healthy.
In this approach, the application is responsible for synchronising the state across the various instances.
When OpenShift is introduced, we can deploy this type of architecture on one single cluster stretched across multiple datacenters (or availability zones [AZs] if in the cloud) or multiple independent clusters on multiple datacenters (or AZs).
Single Stretched Openshift Cluster across Multiple Datacenters
In order to achieve this strategy, the latency between datacenters must be relatively small (etcd requires latency to be max 10ms). Since organizations typically do not have three datacenters in the same metropolitan area (low latency), this approach is more likely to be setup in the cloud. Cloud regions are comprised of multiple datacenters called Availability Zones (AZ), with very low latency between them. It is therefore possible to stretch an OpenShift cluster across three or more AZs, and in fact, it is the recommended method of installing OpenShift in the cloud.
The resulting architecture is depicted as the following:
When a disaster hits one of the AZs, no action needs to occur as both OpenShift and the stateful workload will autonomously react to the situation. In particular, the stateful workload will sense the loss of one of the instances and will continue using the remaining instances.
The same is true when the affected AZ is recovered. When the stateful instance in the recovered AZ comes back online, before the instance is allowed to join the cluster, it will need to resync its state. Again, this is handled autonomously and is part of the clustering features of these kinds of stateful workloads.
Examples of databases that have these features include:
- CockroachDB (binary compatible with PostgreSQL)
- YugabyteDB (binary compatible with PostgreSQL)
- TiDB (binary compatible with MySQL)
This new generation of databases (as an offspring of Google Spanner) are slowly gaining popularity. As you can see, they are binary compatible with existing major open source databases, so the theory is that you will not need to change your client applications when migrating to them.
At the same time though, since these are relatively new products, they may present some operational risk (lack of skills, low product maturity, lack of management tools).
Multiple OpenShift Clusters in Multiple Datacenters
This deployment topology can be depicted as the following:
In this case, we have multiple datacenters (at least three) that are potentially geographically distributed. In each datacenter, we have independent OpenShift clusters. A global load balancer balances traffic between the datacenters (for design advice on how to configure a global load balancer using this approach, see this article). The stateful workload is deployed across the OpenShift clusters. This approach is more suitable than the previous for geographical, on premise and hybrid deployments.
Furthermore, it provides better availability because while we have the same assurances in terms of reaction to a disaster, in this configuration OpenShift does not act as a single failure domain.
When a disaster does occur, our global load balancer must be able to sense the unavailability of one of the datacenters and redirect all traffic to the remaining active datacenters. No action needs to occur on the stateful workload as it will self-reorganize to manage the loss of a cluster member.
In this configuration, the members of the stateful workload cluster need to be able to communicate with each other (pod to pod communication must be established). The same consideration as for the active/passive scenario with application-level replication applies.
Finally, deploying a stateful workload across multiple OpenShift clusters is not trivial. The process of standing up a complex stateful workload in a single OpenShift cluster is made simple by using operators. But, nearly all operators today are cluster-bound and are not capable of controlling the configuration across multiple clusters. Multi-CassKop is one (rare) example of multi-cluster operator for Cassandra. Moreover, this project showcases a possible framework for creating multi cluster controllers.
In this article, a set of disaster recovery strategies for applications running on OpenShift has been introduced. This list is certainly not complete, however it is the hope is that these alternatives will help Openshift practitioners get the disaster recovery design for their application started in the right direction. Also, it should be noted that the options presented are valid for generic stateful workloads. In a real world use case, knowing the specific features of a given stateful products may uncover additional options.
During the discussion, we have also identified a set of limitations that are found in the current Kubernetes feature set which make these types of deployments relatively complex. These limitations were centered around the following items:
- Standard way to create volume replication across two Kubernetes clusters
- Standard way to create network tunnels across two or more Kubernetes clusters
- Standard way of creating multicluster operators.
It is the hope that as more awareness about these limitations is created, they will be addressed by the Kubernetes community in the future.