Scaling Infrastructure with Kubernetes and RKE2: A Production Deep Dive
A production engineer's guide to RKE2 cluster architecture, horizontal and vertical autoscaling, high availability, resource governance, and Prometheus/Grafana observability
Running a Kubernetes cluster in production is one thing. Running one that can absorb unpredictable traffic spikes, survive control-plane failures, enforce tenant isolation, and give your operations team clear visibility into every layer of the system — that is an entirely different challenge. RKE2, Rancher's next-generation Kubernetes distribution, is built specifically for environments where those requirements are non-negotiable.
This article works through the full lifecycle of a production-grade RKE2 deployment: initial cluster architecture, autoscaling at the pod and node level, high-availability control planes, resource governance, and observability with Prometheus and Grafana.
Why RKE2?
RKE2 distinguishes itself from upstream Kubernetes and from its predecessor RKE1 in three key areas. First, it ships with a CIS Kubernetes Benchmark-hardened configuration out of the box — admission controllers, audit logging, pod security, and TLS settings are pre-configured to pass a CIS Level 1 scan without manual intervention. Second, it is FIPS 140-2 compliant, making it suitable for government and regulated-industry deployments. Third, it embeds containerd directly and ships with its own CNI (Canal or Cilium depending on your configuration choice), reducing the surface area of external dependencies you need to manage.
RKE2 is also air-gap friendly. The installation bundle includes all required container images, which matters enormously in on-premises and edge deployments where internet access from cluster nodes is restricted or impossible.
Cluster Architecture
A production RKE2 cluster is divided into server nodes (which run the control plane and etcd) and agent nodes (which run workloads). The recommended topology for high availability is three or five server nodes and a variable number of agent nodes organised into node pools by workload class.
# /etc/rancher/rke2/config.yaml (server node)
token: <shared-cluster-token>
tls-san:
- 10.0.0.10 # VIP or load balancer address
- k8s.internal.example.com
cni: cilium
cluster-cidr: 10.42.0.0/16
service-cidr: 10.43.0.0/16
etcd-expose-metrics: true
kube-apiserver-arg:
- "audit-log-path=/var/log/kubernetes/audit.log"
- "audit-log-maxage=30"
- "audit-log-maxsize=100"
# /etc/rancher/rke2/config.yaml (agent node)
server: https://10.0.0.10:9345
token: <shared-cluster-token>
node-label:
- "workload-class=general"
- "topology.kubernetes.io/zone=eu-west-1a"
Install the server on your first control-plane node, then join the remaining server nodes and all agent nodes using the same token and the VIP address. RKE2 automatically elects etcd leaders and manages quorum.
# Install and start RKE2 server
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=server sh -
systemctl enable --now rke2-server.service
# Retrieve the node token for joining additional nodes
cat /var/lib/rancher/rke2/server/node-token
# Install and start RKE2 agent (on worker nodes)
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent sh -
systemctl enable --now rke2-agent.service
Node Pools and Workload Placement
Not all workloads have the same resource profile. Stateless web services have different requirements from GPU inference jobs, memory-intensive analytics workloads, or latency-sensitive databases. Organising agent nodes into pools with distinct labels and taints lets Kubernetes schedule each workload class onto appropriately-sized hardware.
# Label a node pool for memory-intensive workloads
kubectl label nodes worker-mem-{1..4} workload-class=memory-optimised
kubectl taint nodes worker-mem-{1..4} workload-class=memory-optimised:NoSchedule
# Label a separate pool for general compute
kubectl label nodes worker-gen-{1..8} workload-class=general
# Deployment targeting the memory-optimised pool
apiVersion: apps/v1
kind: Deployment
metadata:
name: analytics-engine
spec:
template:
spec:
nodeSelector:
workload-class: memory-optimised
tolerations:
- key: workload-class
operator: Equal
value: memory-optimised
effect: NoSchedule
containers:
- name: analytics
image: registry.internal/analytics:v2.3.1
resources:
requests:
memory: "8Gi"
cpu: "2"
limits:
memory: "16Gi"
cpu: "4"
Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) adjusts the replica count of a Deployment or StatefulSet based on observed metrics. CPU utilisation is the classic trigger, but modern HPA configurations can also scale on custom metrics exposed by your application or on external metrics from sources like a message queue depth.
First, ensure the Metrics Server is running — RKE2 does not bundle it by default.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 minutes before scaling down
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
The behavior block is critical for stability. Without a scale-down stabilisation window, a brief traffic drop will remove pods prematurely, leaving you under-provisioned when load returns. The asymmetric policy — aggressive scale-up, conservative scale-down — is the right default for most production workloads.
Vertical Pod Autoscaler
The Vertical Pod Autoscaler (VPA) right-sizes the CPU and memory requests on individual pods based on observed usage. It addresses a common problem: developers set initial resource requests based on guesswork, and those values never get updated, leading either to wasteful over-provisioning or to OOMKilled pods under load.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: worker-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: background-worker
updatePolicy:
updateMode: "Auto" # or "Off" to only view recommendations
resourcePolicy:
containerPolicies:
- containerName: worker
minAllowed:
cpu: 100m
memory: 256Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
Note that VPA in Auto mode will evict and restart pods to apply new resource values. For services where in-flight requests cannot be interrupted, run VPA in Off mode to generate recommendations that you apply manually or through a GitOps workflow during maintenance windows.
Important: HPA and VPA should not manage the same resource (CPU or memory) on the same deployment simultaneously. Use HPA for CPU-driven horizontal scaling and VPA in Off mode for memory right-sizing, or use KEDA for event-driven scaling where fine-grained control is needed.
Cluster Autoscaler
Pod autoscalers work within the existing node capacity. When that capacity is exhausted — pods are stuck in Pending because no node has sufficient resources — you need the Cluster Autoscaler to provision new nodes. Conversely, when nodes are significantly under-utilised, the Cluster Autoscaler can drain and decommission them to reduce infrastructure cost.
On bare-metal or on-premises deployments, the Cluster Autoscaler integrates with your infrastructure provisioning layer. For cloud deployments, providers such as AWS, GCP, and Azure offer native node group integrations. The following example shows the core configuration for an AWS Auto Scaling Group.
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --nodes=2:10:k8s-general-worker-asg
- --nodes=1:4:k8s-memory-worker-asg
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --skip-nodes-with-local-storage=false
- --expander=least-waste
env:
- name: AWS_REGION
value: eu-west-1
The --expander=least-waste option tells the autoscaler to prefer the node group that would have the smallest amount of unused resource after accommodating the pending pod, which minimises cost. Alternative expanders include random, most-pods, and priority.
High Availability Control Plane
A three-node control plane with embedded etcd is the minimum viable HA topology. etcd requires quorum — a majority of members must be healthy for the cluster to accept writes. With three members you can tolerate one failure; with five members you can tolerate two.
The control-plane nodes must sit behind a load balancer. For cloud deployments, a TCP load balancer targeting port 6443 (kube-apiserver) and 9345 (RKE2 registration) works well. On-premises deployments commonly use keepalived with a virtual IP address.
# keepalived.conf on control-plane nodes
vrrp_instance VI_1 {
state MASTER # BACKUP on the other two nodes
interface eth0
virtual_router_id 51
priority 100 # 90 and 80 on the other two nodes
advert_int 1
authentication {
auth_type PASS
auth_pass securepassword
}
virtual_ipaddress {
10.0.0.10/24 # VIP used in tls-san and agent server address
}
}
Validate that etcd is healthy after any control-plane operation. RKE2 bundles etcdctl at /var/lib/rancher/rke2/bin/etcdctl.
ETCDCTL_API=3 /var/lib/rancher/rke2/bin/etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/client.key \
endpoint health --cluster
Resource Quotas and Limit Ranges
In multi-tenant clusters — where different teams or applications share the same physical infrastructure — ResourceQuotas and LimitRanges are essential guardrails. ResourceQuotas set hard caps on total resource consumption within a namespace. LimitRanges set default and maximum values for individual containers, preventing a misconfigured deployment from requesting unbounded resources.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
count/deployments.apps: "20"
count/services: "15"
persistentvolumeclaims: "10"
requests.storage: 500Gi
apiVersion: v1
kind: LimitRange
metadata:
name: team-alpha-limits
namespace: team-alpha
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "8"
memory: 16Gi
- type: PersistentVolumeClaim
max:
storage: 100Gi
Applying LimitRanges ensures that developers who forget to specify resource requests still get sensible defaults rather than requesting zero CPU, which would cause the scheduler to place the pod anywhere and potentially starve other workloads on the same node.
Monitoring with Prometheus and Grafana
Observability in a Kubernetes cluster has three pillars: metrics, logs, and traces. Prometheus handles metrics collection; Grafana handles visualisation. The kube-prometheus-stack Helm chart deploys the entire stack — Prometheus Operator, Alert Manager, Grafana, node exporters, and a comprehensive set of pre-built dashboards — in a single command.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.storageClassName=longhorn \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi \
--set grafana.adminPassword=<secure-password> \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi
RKE2 exposes etcd metrics when etcd-expose-metrics: true is set in the server config. Add a ServiceMonitor so Prometheus scrapes them.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: rke2-etcd
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
namespaceSelector:
matchNames: [kube-system]
selector:
matchLabels:
app.kubernetes.io/name: rke2-etcd
endpoints:
- port: metrics
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-client-cert/ca.crt
certFile: /etc/prometheus/secrets/etcd-client-cert/client.crt
keyFile: /etc/prometheus/secrets/etcd-client-cert/client.key
Essential Alerting Rules
Pre-built dashboards are a starting point, but custom alerting rules tuned to your environment are what allow on-call engineers to act before users notice a problem.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: workload-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: pod-health
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[10m]) > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash-looping"
- alert: NodeMemoryPressure
expr: |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} memory below 10%"
- alert: HPAMaxedOut
expr: |
kube_horizontalpodautoscaler_status_current_replicas
== kube_horizontalpodautoscaler_spec_max_replicas
for: 15m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }} at maximum replicas"
The HPAMaxedOut alert is particularly valuable in practice. When an HPA is pinned at its maximum for an extended period it means traffic has outgrown your current ceiling. You either need to raise the maximum or add capacity to the node pool — and you want to know about that before the next spike, not during it.
Production Best Practices
Pod Disruption Budgets
A PodDisruptionBudget (PDB) constrains how many pods in a deployment can be simultaneously unavailable during voluntary disruptions like node drains. Without PDBs, draining a node for maintenance can take an entire deployment offline.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
namespace: production
spec:
minAvailable: 2 # or use maxUnavailable: 1
selector:
matchLabels:
app: api-server
Topology Spread Constraints
By default, the scheduler spreads replicas across nodes using a best-effort algorithm. Topology spread constraints give you hard guarantees that replicas are distributed across availability zones or racks.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
Upgrade Strategy
RKE2 supports rolling upgrades via the System Upgrade Controller. You define a Plan that targets server or agent nodes and specifies the target version; the controller drains, upgrades, and uncordons nodes sequentially.
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: rke2-server-upgrade
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- { key: node-role.kubernetes.io/control-plane, operator: In, values: ["true"] }
serviceAccountName: system-upgrade
upgrade:
image: rancher/rke2-upgrade
version: v1.29.4+rke2r1
etcd Backup and Restore
RKE2 can take scheduled etcd snapshots automatically. Ensure they are written to durable storage outside the cluster — an S3 bucket or a remote NFS mount — rather than local disk on the control-plane nodes.
# /etc/rancher/rke2/config.yaml additions for automated snapshots
etcd-snapshot-schedule-cron: "0 */6 * * *" # every 6 hours
etcd-snapshot-retention: 10
etcd-snapshot-dir: /mnt/nfs/etcd-snapshots
# Manual snapshot
rke2 etcd-snapshot save --name pre-upgrade-$(date +%Y%m%d)
# Restore from snapshot (run on a single server node with cluster stopped)
rke2 server --cluster-reset --cluster-reset-restore-path=/path/to/snapshot.db
Conclusion
Scaling infrastructure with RKE2 is not a single configuration change — it is a system of interlocking capabilities that must be designed and operated together. Horizontal pod autoscaling handles short-lived traffic bursts at the workload level. Vertical pod autoscaling keeps resource requests honest over time. The Cluster Autoscaler ensures the underlying node capacity tracks the aggregate demand of your pod autoscalers. Node pools and topology constraints ensure workloads land on the right hardware. Resource quotas and limit ranges protect tenants from each other. PodDisruptionBudgets and topology spread constraints harden availability. And Prometheus with Grafana gives your team the visibility to detect degradation before it becomes an outage.
RKE2 earns its place in production precisely because it ships a substantial portion of this stack pre-hardened and pre-integrated. Your responsibility is to understand the knobs, tune them to your workload characteristics, and build the operational discipline — runbooks, alert routing, upgrade cadence, backup validation — that turns a well-configured cluster into a genuinely reliable platform.