Learning Kubernetes - Introduction and Explanation of Horizontal Pod Autoscaler

#Introduction

In the previous episode, we learned about computational resources - CPU and memory requests and limits. In episode 30, we'll discuss Horizontal Pod Autoscaler (HPA), which automatically scales the number of Pods based on observed metrics like CPU utilization, memory usage, or custom metrics.

Note: Here I'll be using a Kubernetes Cluster installed through K3s.

Manual scaling works for predictable workloads, but real-world applications face variable traffic patterns. HPA automatically adjusts replica counts to match demand, ensuring optimal resource utilization and application performance without manual intervention.

#What Is Horizontal Pod Autoscaler?

Horizontal Pod Autoscaler (HPA) automatically scales the number of Pods in a Deployment, ReplicaSet, or StatefulSet based on observed metrics.

Think of HPA like an automatic thermostat - it monitors temperature (metrics) and adjusts heating/cooling (Pod count) to maintain your desired comfort level (target utilization). You set the target, HPA handles the adjustments.

Key characteristics of HPA:

Automatic scaling - Adjusts replicas without manual intervention
Metric-based - Scales based on CPU, memory, or custom metrics
Configurable targets - Set desired utilization thresholds
Min/max bounds - Define scaling limits
Cool-down periods - Prevents rapid scaling oscillations
Multiple metrics - Scale based on multiple conditions
Works with Deployments - Compatible with standard workloads

#Why Use HPA?

HPA solves critical scaling challenges:

Handle traffic spikes - Automatically add Pods during high load
Cost optimization - Scale down during low traffic periods
Maintain performance - Keep response times consistent
Reduce manual work - No need to watch metrics constantly
Predictable behavior - Consistent scaling logic
Resource efficiency - Right-size capacity automatically
24/7 operation - Scales even when you're sleeping
Business continuity - Handles unexpected load gracefully

Without HPA, you either over-provision (wasting money) or under-provision (risking outages).

#How HPA Works

#HPA Control Loop

HPA runs a control loop every 15 seconds (default):

Query metrics - Fetch current resource usage from Metrics Server
Calculate desired replicas - Compare current vs target utilization
Scale if needed - Adjust replica count if threshold exceeded
Wait for stabilization - Cool-down before next scaling decision

#Scaling Algorithm

HPA uses this formula to calculate desired replicas:

plaintext

desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]

Example:

Current replicas: 3
Current CPU usage: 80%
Target CPU usage: 50%

plaintext

desiredReplicas = ceil[3 * (80 / 50)] = ceil[4.8] = 5

HPA scales from 3 to 5 replicas.

#Metrics Server Requirement

HPA requires Metrics Server to collect resource metrics:

If not installed:

Important

Important: HPA cannot function without Metrics Server. Install it before creating HPA resources.

#Creating HPA

#Prerequisites

Before creating HPA, ensure:

Metrics Server installed - For resource metrics
Resource requests defined - HPA needs baseline for percentage calculations
Deployment exists - HPA targets existing workloads

#Basic HPA with CPU

Create a Deployment first:

Create HPA:

Apply:

Behavior:

Maintains 2-10 replicas
Scales up when CPU > 70%
Scales down when CPU < 70%

#HPA with kubectl

Create HPA using kubectl:

This creates the same HPA as the YAML above.

#Checking HPA Status

Output:

Columns:

TARGETS: Current/Target utilization
REPLICAS: Current number of Pods
MINPODS/MAXPODS: Scaling boundaries

Detailed view:

#HPA with Memory

Scale based on memory utilization:

Behavior:

Scales when memory usage > 80%
Based on memory requests in Pod spec

#HPA with Multiple Metrics

Scale based on CPU OR memory (whichever triggers first):

Behavior:

Scales up if CPU > 70% OR memory > 80%
Uses the metric requiring most replicas
Scales down when both metrics below target

#HPA with Custom Metrics

Scale based on application-specific metrics (requires custom metrics adapter):

Use cases:

HTTP requests per second
Queue length
Database connections
Custom business metrics

Note

Custom metrics require additional setup like Prometheus Adapter or custom metrics API server.

#HPA Behavior Configuration

Control scaling behavior with v2 API:

Scale Down Behavior:

Wait 5 minutes before scaling down (stabilization)
Remove max 50% of Pods OR 2 Pods per minute
Use most conservative policy (Min)

Scale Up Behavior:

Scale up immediately (no stabilization)
Add max 100% of Pods OR 4 Pods per 15 seconds
Use most aggressive policy (Max)

#Practical Examples

#Example 1: Web Application

#Example 2: API Service with Memory Scaling

#Example 3: Background Worker

#Testing HPA

#Generate Load

Create a load generator to test HPA:

#Watch HPA in Action

In another terminal, watch HPA:

Output shows scaling:

Watch Pods scaling:

#Stop Load

Stop the load generator (Ctrl+C), watch HPA scale down:

Note the 5-minute stabilization window before scale down.

#HPA Scaling Policies

#Default Behavior

Without explicit behavior configuration:

Scale Up:

Immediate (no stabilization)
Max 100% increase or 4 Pods per 15 seconds

Scale Down:

5-minute stabilization window
Max 50% decrease or 1 Pod per minute

#Conservative Scaling

Slow, steady scaling:

#Aggressive Scaling

Fast response to load:

#Disable Scale Down

Only scale up, never down:

#Common Mistakes and Pitfalls

#Mistake 1: No Resource Requests

Problem: HPA cannot calculate utilization without requests.

Solution: Always define requests:

#Mistake 2: Metrics Server Not Installed

Problem: HPA shows "unknown" for metrics.

Solution: Install Metrics Server:

#Mistake 3: Target Too Low

Problem: Constant scaling up and down (flapping).

Solution: Use reasonable targets (60-80%):

#Mistake 4: Min = Max Replicas

Problem: HPA cannot scale.

Solution: Provide scaling range:

#Mistake 5: Conflicting Manual Scaling

Problem: Manually scaling Deployment conflicts with HPA.

Solution: Let HPA manage replicas, or delete HPA:

#Best Practices

#Set Appropriate Targets

CPU targets:

Web apps: 60-70%
APIs: 70-80%
Batch jobs: 80-90%

Memory targets:

Generally: 75-85%
Leave headroom for spikes

#Define Min/Max Carefully

#Use Stabilization Windows

Prevent rapid scaling oscillations:

#Monitor HPA Decisions

Track HPA behavior:

#Combine with Resource Quotas

Prevent runaway scaling:

#Test Scaling Behavior

Always test before production:

Deploy with HPA
Generate realistic load
Observe scaling behavior
Adjust targets and policies
Repeat until satisfied

#Document Scaling Decisions

Add annotations:

#Troubleshooting HPA

#HPA Shows Unknown Metrics

Causes:

Metrics Server not installed
Metrics Server not ready
No resource requests defined
Pods not ready

Solutions:

#HPA Not Scaling

Solutions:

#Rapid Scaling (Flapping)

HPA scales up and down repeatedly.

Solutions:

Increase stabilization window
Adjust target utilization
Review scaling policies
Check application behavior

#HPA Reaches Max Replicas

Solutions:

Increase maxReplicas
Optimize application performance
Lower target utilization
Add more node capacity

#HPA vs VPA vs Cluster Autoscaler

#Horizontal Pod Autoscaler (HPA)

Scales number of Pods
Based on metrics
Fast response
For stateless workloads

#Vertical Pod Autoscaler (VPA)

Adjusts Pod resource requests/limits
Based on historical usage
Requires Pod restart
For right-sizing resources

#Cluster Autoscaler

Adds/removes nodes
Based on pending Pods
Slower response
For cluster capacity

Use together:

HPA: Scale Pods for traffic
VPA: Right-size Pod resources
Cluster Autoscaler: Adjust cluster size

#Viewing HPA Details

#Get HPA

#Describe HPA

#View HPA YAML

#Watch HPA

#Check Current Metrics

#Deleting HPA

Deployment continues running with current replica count.

#Conclusion

In episode 30, we've explored Horizontal Pod Autoscaler (HPA) in Kubernetes in depth. We've learned how HPA automatically scales Pods based on metrics, how to configure autoscaling policies, and best practices for production use.

Key takeaways:

HPA automatically scales Pod replicas based on metrics
Requires Metrics Server for resource metrics
Resource requests required for percentage-based scaling
Scales based on CPU, memory, or custom metrics
minReplicas/maxReplicas define scaling boundaries
Target utilization determines when to scale
Stabilization windows prevent rapid oscillations
Scale up typically immediate, scale down delayed
Behavior policies control scaling rate and timing
Multiple metrics use highest replica requirement
Default scale down: 5-minute stabilization
Default scale up: immediate response
Test thoroughly before production deployment
Monitor HPA decisions and adjust as needed
Combine with Resource Quotas for safety

Horizontal Pod Autoscaler is essential for running dynamic, cost-efficient workloads in Kubernetes. By understanding HPA configuration and behavior, you can ensure your applications automatically scale to meet demand while optimizing resource utilization and costs.

Are you getting a clearer understanding of Horizontal Pod Autoscaler in Kubernetes? Keep your learning momentum going and look forward to the next episode!