At the conclusion of the previous post (part 5) in this series (part 1, part 2, part 3, and part 4), we presented a dashboard for making “manual” capacity management decisions. We reflected on the fact that the automation available to us is reactive and not always suitable to all of our needs. In this article, we will introduce an approach that attempts to mitigate this issue.
Elastic infrastructure is one of the promises and main characteristics of cloud computing. In Kubernetes, it has initially and predominantly taken the form of autoscaling pods based on several metrics.
But it also has always been possible to scale the nodes that comprise a Kubernetes cluster, thus increasing the total capacity of the cluster. The cluster autoscaler is a long-standing and battle-tested Kubernetes satellite project and features the ability to add new nodes by interacting directly with the underlying infrastructure. More recently, it has been combined with the cluster-api project.
The cluster-api project is now responsible for interacting with the underlying infrastructure, surfacing inside Kubernetes concepts belonging to the underlying infrastructure such as Machines and MachineSets, which represent virtual or physical machines and groups of such machines respectively.
So, at least in OpenShift, the job of the cluster autoscaler is to detect a need to scale up or scale down the number of nodes and change the requested number of replicas in the MachineSet that it is observing. This resembles very closely the relationship between the HorizontalPodAutoscaler controller and a pod Deployment.
The method by which the cluster autoscaler makes the decision to trigger a scaling event is based on detecting pods in pending state that cannot be scheduled to any of the available nodes due to a lack of capacity. As a result, any cluster-scaling event driven by the cluster autoscaler is reactive, waiting for a problem to present itself and then resolving it. This introduces a delay for the user workload being scheduled as it waits for new nodes to be added.
The ability to autoscale the cluster in this fashion is generally acceptable, but as more advanced user workloads are being brought into Kubernetes, there is a greater need for the cluster to scale up quickly, with minimal wait. These use cases often stem from batch scheduling operations, machine learning scenarios, or from teams who create large ephemeral environments to perform automated tests. In all of these cases, large amounts of capacity need to be made available in a timely fashion while also acknowledging that once the capacity is no longer needed, the cluster size can be reduced to contain costs.
While the goals previously described await enhancements in the upstream Kubernetes community, here we will illustrate a technique that enables proactive cluster autoscaling in a manner that allows you to trade spare resources for faster response times.
Proactive Cluster Autoscaling Design
The approach to enable autoscaling in a proactive fashion involves using low-priority pods to make the cluster autoscaler believe that the cluster is more utilized then it actually is. This idea is not new and has been explored in the past by the Kubernetes community (for example: here and here). The following sequence of diagrams below illustrates an autoscaling event:
In the first step, there are large and high-priority pods (represented by the dark blue circles) mixed with small low-priority pods (represented by the light blue circles). In this scenario, the dark blue circles represent the user scheduled workload, while the light blue circles represent pods running the pause container needed to trick the cluster autoscaler. The pause container is ideal in this situation as it has a very low resource footprint. In the center diagram, a new user workload pod is added. Since the workload pod has a higher priority, the scheduler evicts some of the low-priority pods to free up the needed resources. The new user pod is immediately scheduled and the user does not have to wait. At this point, the evicted low-priority pods return to the scheduler queue and go in pending state as there is a lack of cluster capacity. This then triggers the cluster autoscaler to add capacity.
To implement this idea, we need to perform the following steps:
- Define MachineSets with the desired machine types. Each cloud provider offers different instance types. All the desired instance types must be made available via MachineSets. Initially these MachineSets can be scaled to zero (tested on AWS, Azure, and GCP). Also notice that for AI/ML workload, these machines need to be enabled to run GPU workload.
- The autsocaler must be enabled for these MachineSets. Also, if you want your user to be able to select the machines in which their workload runs, proper labels should be set. Additionally, taints can be configured if normal cluster workload should not be allowed to land on these nodes.
- Pod priorities must be enabled in the cluster. One needs to prepare at least two priority levels: high priority for the user’s workload and low priority for the low-priority pods needed for proactive autoscaling.
- Some low-priority pods need to be scheduled on nodes controlled by MachineSet that can proactively autoscale.
Ideally, the aggregate requested capacity of the low-property pods should be an adjustable percentage of the aggregate requested capacity of the user workload. This allows us to tune the tradeoff between spare capacity and speed of response.
Therefore, the number of the low-priority pods is dynamic: It can change because the aggregate capacity of the user workload changes or because we need to tune the tradeoff. Some automation needs to be created to maintain the low-priority pods to user workload ratio at the right value.
Proactive Node Scaling Operator
As always, when there is an automation opportunity in and around Kubernetes, operators should be the first option to consider, and thus the proactive-node-scaling-operator project was created.
With this operator, you can define a constant ratio between low-priority pods and the user workload on a portion of nodes selected via a node selector. The following is an example of how this can be configured using the NodeScalineWatermark CRD:
- key: "workload"
This configuration implies that on nodes selected by the machine.openshift.io/cluster-api-machine-type: ai-ml, the low-priority pod aggregated capacity request should be at 20 percent of the user workload aggregated capacity request. This configuration also instructs the operator that low-priority pods should be created with the specified toleration.
In this article, we introduced an approach for configuring and setting up proactive autoscaling in an OpenShift cluster. This helps us address those situations where the cluster capacity needs to be elastic to be optimized for the current workload and minimizes the time it takes for user workload to wait for resources to become available.