One of the strengths of Kubernetes as a container orchestrator lies in its ability to manage and respond to dynamic environments. One example is Kubernetes’ native capability to perform effective autoscaling of resources. However, Kubernetes does not support just a single autoscaler or autoscaling approach. In this post, we discuss the three forms of Kubernetes capacity autoscaling.

1. Pod Replica Count

For many applications with usage that varies over time, you may want to add or remove pod replicas in response to changes in demand for those applications. The Horizontal Pod Autoscaler (HPA) can manage scaling these workloads for you automatically.

Use Cases

The HPA is ideal for scaling stateless applications, although it can also support scaling stateful sets. Using HPA in combination with cluster autoscaling (see below) can help you achieve cost savings for workloads that see regular changes in demand by reducing the number of active nodes as the number of pods decreases.

How It Works

For workloads with HPA configured, the HPA controller monitors the workload’s pods to determine if it needs to change the number of pod replicas. In most cases, where the controller takes the mean of a per-pod metric value, it calculates whether adding or removing replicas would move the current value closer to the target value. For example, a deployment might have a target CPU utilization of 50%. If five pods are currently running and the mean CPU utilization is 75%, the controller will add 3 replicas to move the pod average closer to 50%.

HPA scaling calculations can also use custom or external metrics. Custom metrics target a marker of pod usage other than CPU usage, such as network traffic, memory, or a value relating to the pod’s application. External metrics measure values that do not correlate to a pod. For example, an external metric could track the number of pending tasks in a queue.

How to Use It

To configure the HPA controller to manage a workload, create a HorizontalPodAutoscaler object. HPA can also be configured with the kubectl autoscale subcommand.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
name: hello-world
namespace: default
maxReplicas: 10
minReplicas: 1
apiVersion: extensions/v1beta1
kind: Deployment
name: hello-world
targetCPUUtilizationPercentage: 50

This manifest would configure the HPA controller to monitor the CPU utilization of pods in the hello-world deployment. The controller could add or remove pods, keeping a minimum of one and a maximum of ten, to target a mean CPU utilization of 50%.


The HPA controller runs, by default, as part of the standard kube-controller-manager daemon. It can manage only those pods created by a replication controller, such as deployments, replica sets, or stateful sets.

The controller requires a metrics source. For scaling based on CPU usage, it depends on the metrics-server. Scaling based on custom or external metrics requires deploying a service that implements the or API to provide an interface with the monitoring service or alternate metrics source.

For workloads using the standard CPU metric, containers must have CPU resource limits configured in the pod spec.

2. Cluster Autoscaler

While HPA scales the number of running pods in a cluster, the cluster autoscaler can change the number of nodes in a cluster.

Use Cases

Dynamically scaling the number of nodes to match current cluster utilization can help manage the costs of running Kubernetes clusters on a cloud provider platform, especially with workloads that are designed to scale to meet current demand.

How It Works

The cluster autoscaler loops through two main tasks: watching for unschedulable pods and computing if it could consolidate all currently deployed pods on a smaller number of nodes.

The autoscaler checks the cluster for pods that cannot be scheduled on any existing nodes because of inadequate CPU or memory resources or because the pod’s node affinity rules or taint tolerations do not match an existing node. If the cluster has unschedulable pods, the autoscaler will check its managed node pools to decide if adding a node would unblock the pod. If so, it will add a node if the node pool can be increased in size.

The autoscaler also scans the nodes in the node pools that it manages. If a node has pods that could be rescheduled on other available nodes in the cluster, the autoscaler will evict them and then remove the node. When deciding if a pod can be moved, the autoscaler takes into account pod priority and PodDisruptionBudgets.

How to Use It

The cluster autoscaler supports only a limited number of platforms. If you use a managed Kubernetes service, check if it provides the cluster autoscaler as an option. Otherwise, you will need to install and configure the autoscaler yourself.

Exact configuration steps vary by cloud provider, but generally, the cluster autoscaler must have the permissions and credentials to create and terminate virtual machines. Infrastructure tags or labels mark which node pools the autoscaler should manage.


The cluster autoscaler can manage nodes only on supported platforms, all of which are cloud providers, with the exception of OpenStack. Different platforms may have their own specific requirements or limitations.

Because the autoscaler controller requires permissions to add and delete infrastructure, the necessary credentials need to be managed securely, following the principle of least privilege. This requirement poses less of a risk in managed Kubernetes platforms which run the controller on a secure control plane.

3. Vertical Pod Autoscaling

The default Kubernetes scheduler overcommits CPU and memory reservations on a node, with the philosophy that most containers will stick closer to their initial requests than to their requested upper limit. The Vertical Pod Autoscaler (VPA) can increase and decrease the CPU and memory resource requests of pod containers to better match the allocated cluster resource allotment to actual usage.

Use Cases

The VPA service can set container resource limits based on live data, rather than human guesswork or benchmarks that are only occasionally run.

Alternatively, some workloads may be prone to occasional periods of very high utilization, but permanently increasing their request limits would waste mostly unused CPU or memory resources and limit the nodes that can run them. While Horizontal Pod Autoscaling can help in many of those cases, sometimes the workload cannot easily be spread across multiple instances of an application.

How It Works

A VPA deployment has three components: Recommender, which monitors resource utilization and computes target values; Updater, which evicts pods that need to be updated with new resource limits, and the Admission Controller, which uses a mutating admission webhook to overwrite the resource requests of pods at creation time.

Because Kubernetes does not support dynamically changing the resource limits of a running pod, the VPA cannot update existing pods with new limits. It terminates pods that are using outdated limits, and when the pod’s controller requests the replacement from the Kubernetes API service, the VPA admission controller injects the updated resource request and limit values into the new pod’s specification.

VPA can also be run in recommendation mode only. In this mode, the VPA Recommender will update the status field of the workload’s VerticalPodAutoscaler resource with its suggested values but will not terminate pods or alter pod API requests.

How To Use It

If your Kubernetes provider does not support VPA as an add-on, you can install it in your cluster directly.

VPA uses the custom resource VerticalPodAutoscaler to configure the scaling for a deployment or replica set. If I wanted VPA to manage resource requests for my hello-world deployment, I might apply this manifest:

apiVersion: ""
kind: VerticalPodAutoscaler
name: hello-world
apiVersion: "apps/v1"
kind: Deployment
name: hello-world
- containerName: '*'
controlledResources: ["cpu", "memory"]
cpu: 200m
memory: 50Mi
cpu: 500m
memory: 500Mi
updateMode: "Auto"

Now VPA will watch the pods in my deployment and try to set the appropriate resource value, between 200 and 500 milliCPUss with between 50 and 500 mebibytes of memory. If VPA computes new resource values for the deployment’s pods, it will evict existing pods sequentially and update the replacements with the new values.


VPA can replace only pods managed by a replication controller, such as deployments. It requires the Kubernetes metrics-server.

VPA and HPA should only be used simultaneously to manage a given workload if the HPA configuration does not use CPU or memory to determine scaling targets.

VPA also has some other limitations and caveats.

These autoscaling options demonstrate a small but powerful piece of the flexibility of Kubernetes. Using one or more forms of autoscaling removes a good deal of the overhead in managing dynamic production environments and improves efficiency of infrastructure utilization.


Kubernetes, How-tos, cloud scale, massive scale, StackRox

< Back to the blog