PostgreSQL High Availability with Zalando Postgres Operator: Multi-Cloud Kubernetes Deployment Guide
Deploy production-grade PostgreSQL HA with Zalando Operator on any Kubernetes platform
Introduction: Why PostgreSQL High Availability on Kubernetes Matters
Running PostgreSQL in production demands high availability (HA). Downtime measured in minutes can cost enterprises millions of dollars in lost revenue, eroded customer trust, and violated service-level agreements. Kubernetes has become the de facto platform for orchestrating containerized workloads, but running stateful services like PostgreSQL on Kubernetes introduces unique challenges: persistent storage management, leader election, automatic failover, backup orchestration, and connection pooling.
The Zalando Postgres Operator is an open-source, battle-tested solution that Zalando—Europe's largest online fashion retailer—built to manage hundreds of PostgreSQL clusters in production. It leverages Patroni for consensus-based leader election, Spilo as the PostgreSQL container image, WAL-G for continuous archiving and point-in-time recovery, and PgBouncer for connection pooling. Together, these components deliver a fully automated, self-healing PostgreSQL deployment that works across any Kubernetes distribution—from managed cloud services like AWS EKS, Azure AKS, and Google GKE to bare-metal clusters running k3s with Rancher.
In this comprehensive guide, we will explore the architecture of the Zalando Postgres Operator, walk through installation and configuration on multiple Kubernetes platforms, deep-dive into replication, backups, disaster recovery, monitoring, and production tuning. By the end, you will have the knowledge to deploy and operate production-grade PostgreSQL HA clusters on any Kubernetes infrastructure.
Zalando Postgres Operator Architecture
Understanding the architecture is essential before deploying. The Zalando Postgres Operator follows the Kubernetes operator pattern: it watches for Custom Resource Definitions (CRDs) of type postgresql and reconciles the desired state into actual Kubernetes resources. Here is how the components fit together:
Spilo: The PostgreSQL Container Image
Spilo is Zalando's Docker image that bundles PostgreSQL with Patroni, WAL-G, and essential extensions. Each pod in the StatefulSet runs a Spilo container. Spilo handles:
- PostgreSQL server — the database engine itself, supporting versions 13 through 16
- Patroni — the HA agent that manages leader election, replication, and failover
- WAL-G — continuous WAL archiving and base backup tool
- pg_cron, pg_stat_statements, PostGIS — commonly needed extensions pre-installed
Patroni: Leader Election and Automatic Failover
Patroni is the heart of the HA mechanism. It uses a Distributed Configuration Store (DCS) to maintain cluster state and perform leader election. In the Zalando operator context, Patroni uses the Kubernetes API itself as the DCS (via Endpoints or ConfigMaps), eliminating the need for an external etcd or ZooKeeper cluster.
Here is how Patroni's failover process works:
- Health checks — Each Patroni agent continuously monitors its local PostgreSQL instance and reports health to the DCS.
- Leader lock — The primary holds a leader lock in the DCS (a Kubernetes Endpoint object). The lock has a TTL (default 30 seconds).
- Failure detection — If the primary fails to renew its lock within the TTL, replicas detect the absence.
- Election — Eligible replicas compete for the leader lock. The replica with the least replication lag wins.
- Promotion — The winning replica promotes itself to primary, updates the DCS, and the Kubernetes
masterService endpoint updates automatically. - Fencing — The old primary is fenced (stopped or demoted to replica) to prevent split-brain.
This entire failover process typically completes in 15-30 seconds, ensuring minimal downtime for your applications.
WAL-G: Continuous Archiving and Backup
WAL-G is a next-generation archival tool for PostgreSQL that supports backup to S3, Google Cloud Storage (GCS), and Azure Blob Storage. It provides:
- Base backups — Full physical backups using
pg_basebackup - WAL archiving — Continuous Write-Ahead Log shipping for point-in-time recovery
- Delta backups — Incremental backups that only store changed pages
- Encryption — AES-256 encryption of backups at rest
- Compression — LZ4 or ZSTD compression for reduced storage costs
Kubernetes Resource Hierarchy
When you create a postgresql Custom Resource, the operator creates a comprehensive set of Kubernetes resources to manage the cluster. Understanding this hierarchy is important for troubleshooting and monitoring:
Resources Created by the Operator
- StatefulSet — Manages the PostgreSQL pods with stable network identities and ordered deployment
- Services — Two ClusterIP services:
<cluster-name>for the primary and<cluster-name>-replfor read replicas - Endpoints — Patroni updates endpoints to point to the current leader for seamless failover
- PodDisruptionBudgets (PDB) — Ensures at least one instance remains available during voluntary disruptions
- Secrets — PostgreSQL superuser, replication, and application credentials stored as Kubernetes Secrets
- PersistentVolumeClaims (PVCs) — One PVC per pod for PostgreSQL data storage
- PgBouncer Deployment — Optional connection pooler deployed as a separate Deployment with its own Service
Installation on Multiple Kubernetes Platforms
Prerequisites
Before installing the Zalando Postgres Operator, ensure you have:
- A running Kubernetes cluster (v1.25+)
kubectlconfigured with cluster admin accesshelmv3 installed- A default StorageClass configured
Operator Installation via Helm
The recommended installation method uses Helm. This works consistently across all Kubernetes platforms:
# Add the Zalando Postgres Operator Helm repository
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm repo add postgres-operator-ui-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator-ui
helm repo update
# Create a dedicated namespace
kubectl create namespace postgres-operator
# Install the operator with custom values
helm install postgres-operator postgres-operator-charts/postgres-operator \
--namespace postgres-operator \
--set configKubernetes.enable_pod_antiaffinity=true \
--set configKubernetes.pod_environment_configmap=postgres-pod-config \
--set configAwsOrGcp.aws_region=us-east-1 \
--set configLoadBalancer.db_hosted_zone=db.example.com \
--set configConnectionPooler.connection_pooler_default_cpu_request=500m \
--set configConnectionPooler.connection_pooler_default_memory_request=100Mi
# Optionally install the operator UI for visual management
helm install postgres-operator-ui postgres-operator-ui-charts/postgres-operator-ui \
--namespace postgres-operator
Verify the operator is running:
kubectl get pods -n postgres-operator
# Expected output:
# NAME READY STATUS RESTARTS AGE
# postgres-operator-7f8b9c6d4-x2k9j 1/1 Running 0 2m
AWS EKS Deployment Specifics
Amazon EKS requires specific configuration for optimal PostgreSQL performance:
# EKS-specific Helm values (eks-values.yaml)
configAwsOrGcp:
aws_region: us-east-1
enable_ebs_gp3_migration: true
additional_secret_mount: "aws-iam-token"
configKubernetes:
enable_pod_antiaffinity: true
pod_environment_configmap: "postgres-pod-config"
spilo_privileged: false
storage_resize_mode: pvc
# Use EBS gp3 StorageClass for better performance
# Create the StorageClass first:
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3-postgres
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "6000"
throughput: "400"
encrypted: "true"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
For WAL-G backups on EKS, configure IAM roles for Service Accounts (IRSA):
# Create IAM policy for WAL-G S3 access
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::my-pg-backups",
"arn:aws:s3:::my-pg-backups/*"
]
}
]
}
Azure AKS Deployment
Azure AKS uses Azure Disk for persistent storage and Managed Identity for backup authentication:
# AKS-specific StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: managed-premium-postgres
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_LRS
cachingMode: ReadOnly
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# WAL-G backup to Azure Blob Storage environment variables
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-pod-config
namespace: default
data:
AZURE_STORAGE_ACCOUNT: "pgbackupsstorage"
AZURE_STORAGE_ACCESS_KEY: "" # Use Managed Identity instead
WALG_AZ_PREFIX: "azure://pg-wal-backups/$(SCOPE)"
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
BACKUP_SCHEDULE: "0 2 * * *"
BACKUP_NUM_TO_RETAIN: "7"
Google GKE Deployment
Google Kubernetes Engine uses Persistent Disk and Workload Identity for GCS backup access:
# GKE StorageClass for SSD Persistent Disks
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ssd-postgres
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# GCS backup configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-pod-config
namespace: default
data:
WALG_GS_PREFIX: "gs://my-pg-backups/$(SCOPE)"
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
GOOGLE_APPLICATION_CREDENTIALS: "/var/secrets/google/key.json"
BACKUP_SCHEDULE: "0 2 * * *"
BACKUP_NUM_TO_RETAIN: "7"
Bare Metal k3s/Rancher with Longhorn Storage
For on-premises deployments, k3s provides a lightweight Kubernetes distribution and Longhorn offers distributed block storage. This combination is ideal when you need full control over your infrastructure without cloud vendor lock-in.
# Install k3s on all nodes
# Master node:
curl -sfL https://get.k3s.io | sh -s - server \
--cluster-init \
--disable traefik \
--write-kubeconfig-mode 644
# Worker nodes:
curl -sfL https://get.k3s.io | sh -s - agent \
--server https://master-ip:6443 \
--token $(cat /var/lib/rancher/k3s/server/node-token)
# Install Longhorn for persistent storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
--create-namespace \
--set defaultSettings.defaultReplicaCount=3 \
--set defaultSettings.defaultDataPath=/mnt/longhorn
# Install MetalLB for LoadBalancer services
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.5/config/manifests/metallb-native.yaml
# Configure MetalLB IP pool
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: postgres-pool
namespace: metallb-system
spec:
addresses:
- 192.168.1.200-192.168.1.210
PostgreSQL Cluster CRD Specification
The core of deploying a PostgreSQL cluster with the Zalando operator is the postgresql Custom Resource. This YAML manifest declares the desired state of your cluster, and the operator reconciles it into reality. Below is a comprehensive, production-ready CRD specification:
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: pg-production-cluster
namespace: databases
labels:
team: platform
environment: production
spec:
teamId: "platform"
volume:
size: 100Gi
storageClass: ebs-gp3-postgres
numberOfInstances: 3
enableConnectionPooler: true
enableReplicaConnectionPooler: true
connectionPooler:
numberOfInstances: 2
mode: transaction
schema: pooler
user: pooler
resources:
requests:
cpu: 500m
memory: 100Mi
limits:
cpu: "1"
memory: 256Mi
users:
app_user:
- superuser
- createdb
readonly_user: []
databases:
app_database: app_user
postgresql:
version: "16"
parameters:
shared_buffers: "2GB"
max_connections: "200"
work_mem: "64MB"
maintenance_work_mem: "512MB"
effective_cache_size: "6GB"
random_page_cost: "1.1"
effective_io_concurrency: "200"
wal_buffers: "64MB"
max_wal_size: "4GB"
min_wal_size: "1GB"
checkpoint_completion_target: "0.9"
default_statistics_target: "100"
log_statement: "ddl"
log_min_duration_statement: "1000"
idle_in_transaction_session_timeout: "600000"
lock_timeout: "30000"
statement_timeout: "60000"
patroni:
initdb:
encoding: "UTF8"
locale: "en_US.UTF-8"
data-checksums: "true"
pg_hba:
- hostssl all all 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
ttl: 30
loop_wait: 10
retry_timeout: 10
synchronous_mode: false
synchronous_mode_strict: false
maximum_lag_on_failover: 33554432
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9187"
tolerations:
- key: "database"
operator: "Equal"
value: "postgres"
effect: "NoSchedule"
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: In
values:
- database
enableShmVolume: true
spiloRunAsUser: 101
spiloRunAsGroup: 103
spiloFSGroup: 103
Key CRD Fields Explained
- numberOfInstances — Total number of pods. The operator automatically designates one as primary and the rest as streaming replicas.
- enableConnectionPooler — Deploys PgBouncer sidecar for the primary service, reducing connection overhead.
- enableReplicaConnectionPooler — Deploys a separate PgBouncer for the replica service, essential for read-heavy workloads.
- postgresql.parameters — Direct PostgreSQL configuration parameters passed to
postgresql.conf. - patroni — Configures Patroni behavior including TTL, loop wait, retry timeout, and synchronous replication mode.
- volume.storageClass — Maps to the platform-specific StorageClass (EBS gp3 on AWS, Premium SSD on Azure, SSD PD on GCP, Longhorn on k3s).
- enableShmVolume — Mounts a
tmpfsat/dev/shmfor PostgreSQL shared memory, critical for performance.
Connection Pooling with PgBouncer
PostgreSQL's process-per-connection model makes it expensive to handle large numbers of client connections. Each connection consumes approximately 10MB of RAM. PgBouncer solves this by multiplexing thousands of client connections over a small pool of actual PostgreSQL connections.
The Zalando operator natively supports PgBouncer deployment. When you set enableConnectionPooler: true in the CRD, the operator creates:
- A PgBouncer Deployment with configurable replica count
- A dedicated Service (
<cluster-name>-pooler) for pooled connections - Automatic credential synchronization with PostgreSQL
PgBouncer Configuration Modes
# PgBouncer connection pooler modes:
#
# transaction (recommended for most workloads)
# - Connection returned to pool after each transaction
# - Best balance of efficiency and compatibility
# - Cannot use session-level features (prepared statements, temp tables)
#
# session
# - Connection held for entire client session
# - Full PostgreSQL compatibility
# - Lower pooling efficiency
#
# statement
# - Connection returned after each statement
# - Most efficient but most restrictive
# - Only works with autocommit queries
# Custom PgBouncer configuration via operator
spec:
connectionPooler:
numberOfInstances: 3
mode: transaction
schema: pooler
user: pooler
defaultPoolSize: 25
maxDBConnections: 100
resources:
requests:
cpu: 250m
memory: 128Mi
limits:
cpu: "1"
memory: 256Mi
For applications that need prepared statements or session-level features, connect directly to the PostgreSQL services bypassing PgBouncer, or use session pooling mode with the tradeoff of lower connection efficiency.
Multi-Region PostgreSQL Replication
For global applications that require low-latency reads from multiple geographic locations or disaster recovery across regions, multi-region replication is essential. The Zalando operator supports this through standby clusters that replicate from a primary cluster via streaming replication or WAL-G archives.
Streaming Replication Configuration
PostgreSQL streaming replication is the foundation of HA in the Zalando operator. It works by shipping Write-Ahead Log (WAL) records from the primary to replicas in near-real-time. The operator configures this automatically, but understanding the details helps with tuning and troubleshooting.
- Synchronous replication — The primary waits for at least one replica to confirm WAL receipt before committing a transaction. This guarantees zero data loss (RPO=0) but adds latency. Enable with
patroni.synchronous_mode: true. - Asynchronous replication — The primary commits immediately and ships WAL asynchronously. Slightly lower latency but potential data loss during failover. This is the default.
- Cascading replication — Replicas can replicate from other replicas instead of the primary, reducing load on the primary in large clusters.
# Enable synchronous replication for zero data loss
spec:
patroni:
synchronous_mode: true
synchronous_mode_strict: false # Allow async if no sync replica available
synchronous_node_count: 1 # Number of sync replicas required
postgresql:
parameters:
synchronous_commit: "on" # Matches Patroni synchronous_mode
max_wal_senders: "10" # Maximum WAL sender processes
wal_keep_size: "1GB" # WAL retention for replica catch-up
hot_standby: "on" # Allow queries on replicas
hot_standby_feedback: "on" # Reduce query conflicts on replicas
Backup and Recovery with WAL-G
WAL-G Backup Configuration
Proper backup configuration is critical for disaster recovery. The Zalando operator integrates WAL-G for continuous backup to object storage. Here is a complete configuration for S3-compatible storage:
# ConfigMap for WAL-G backup configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-pod-config
namespace: databases
data:
# S3 backup configuration
AWS_ENDPOINT: "https://s3.us-east-1.amazonaws.com"
AWS_S3_FORCE_PATH_STYLE: "false"
AWS_REGION: "us-east-1"
WALG_S3_PREFIX: "s3://my-pg-backups/$(SCOPE)"
WALG_DISABLE_S3_SSE: "false"
WALG_S3_SSE: "aws:kms"
WALG_S3_SSE_KMS_ID: "arn:aws:kms:us-east-1:123456789:key/mrk-abcdef"
# Backup scheduling and retention
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
BACKUP_SCHEDULE: "0 1 * * *" # Daily at 1 AM UTC
BACKUP_NUM_TO_RETAIN: "14" # Keep 14 daily backups
# WAL archiving
WALG_COMPRESSION_METHOD: "zstd" # Better compression than lz4
WALG_DELTA_MAX_STEPS: "6" # Delta backups between full backups
WALG_UPLOAD_CONCURRENCY: "4" # Parallel upload streams
WALG_DOWNLOAD_CONCURRENCY: "4" # Parallel download for restore
WALG_UPLOAD_DISK_CONCURRENCY: "4" # Disk read concurrency
# Clone configuration
CLONE_AWS_ENDPOINT: "https://s3.us-east-1.amazonaws.com"
CLONE_AWS_REGION: "us-east-1"
CLONE_WALG_S3_PREFIX: "s3://my-pg-backups/$(CLONE_SCOPE)"
CLONE_METHOD: "CLONE_WITH_WALG"
CLONE_USE_WALG_RESTORE: "true"
Point-in-Time Recovery (PITR)
PITR allows you to restore your database to any specific moment in time—critical for recovering from accidental data deletion or corruption. The Zalando operator supports PITR through the clone mechanism:
# Clone a cluster with PITR
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: pg-restored-cluster
namespace: databases
spec:
teamId: "platform"
volume:
size: 100Gi
storageClass: ebs-gp3-postgres
numberOfInstances: 3
postgresql:
version: "16"
clone:
cluster: "pg-production-cluster"
timestamp: "2026-04-11T14:30:00+00:00" # Restore to this exact moment
s3_wal_path: "s3://my-pg-backups/wal/16/pg-production-cluster"
s3_endpoint: "https://s3.us-east-1.amazonaws.com"
s3_access_key_id: "" # Use IAM role instead
s3_secret_access_key: "" # Use IAM role instead
When this CRD is applied, the operator performs the following steps:
- Finds the latest base backup before the target timestamp
- Restores the base backup to the new StatefulSet's primary pod
- Replays WAL segments up to the specified timestamp
- Opens the database for read-write operations
- Sets up streaming replication to the replica pods
Standby Cluster for Disaster Recovery
A standby cluster continuously replicates from a primary cluster, providing a warm standby that can be promoted during a disaster. This is different from replicas within a cluster—a standby cluster is a completely independent Kubernetes resource that can run in a different namespace, cluster, or even region.
# Standby cluster in a different region
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: pg-standby-eu
namespace: databases
spec:
teamId: "platform"
volume:
size: 100Gi
storageClass: managed-premium-postgres
numberOfInstances: 2
postgresql:
version: "16"
standby:
standby_host: "pg-production-cluster.databases.svc.cluster.local"
standby_port: "5432"
# Alternative: replicate from S3 WAL archive
# s3_wal_path: "s3://my-pg-backups/wal/16/pg-production-cluster"
enableConnectionPooler: true
enableReplicaConnectionPooler: true
To promote a standby cluster to an independent primary (during disaster recovery), simply remove the standby section from the CRD and apply:
# Edit the standby cluster CRD to remove standby section
kubectl patch postgresql pg-standby-eu -n databases --type json \
-p '[{"op": "remove", "path": "/spec/standby"}]'
# The operator will promote the standby to primary
# Update your application DNS/service mesh to point to the new primary
Monitoring with Prometheus and Grafana
Comprehensive monitoring is non-negotiable for production PostgreSQL deployments. The Zalando operator supports Prometheus metrics export via the postgres_exporter sidecar. Here is how to set up a complete monitoring stack:
ServiceMonitor for Prometheus
# ServiceMonitor for PostgreSQL metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: postgres-monitor
namespace: databases
labels:
team: platform
release: prometheus
spec:
selector:
matchLabels:
team: platform
namespaceSelector:
matchNames:
- databases
endpoints:
- port: exporter
interval: 15s
scrapeTimeout: 10s
path: /metrics
relabelings:
- sourceLabels: [__meta_kubernetes_pod_label_spilo_role]
targetLabel: role
- sourceLabels: [__meta_kubernetes_pod_label_cluster_name]
targetLabel: cluster
---
# PodMonitor alternative (scrapes pods directly)
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: postgres-pod-monitor
namespace: databases
spec:
selector:
matchLabels:
application: spilo
podMetricsEndpoints:
- port: exporter
interval: 15s
path: /metrics
Key Metrics to Monitor
These are the most critical PostgreSQL metrics to track in your Grafana dashboards:
- pg_stat_replication_lag — Replication lag in bytes and seconds. Alert if lag exceeds your RPO threshold.
- pg_stat_activity_count — Active connections by state. Alert on connection pool exhaustion.
- pg_stat_database_tup_fetched/returned/inserted/updated/deleted — Query throughput metrics.
- pg_stat_bgwriter_buffers_checkpoint/clean/backend — Buffer management efficiency.
- pg_database_size_bytes — Database size growth over time for capacity planning.
- pg_locks_count — Lock contention. Alert on excessive waiting locks.
- pg_stat_statements_calls/mean_time — Query performance statistics for optimization.
- patroni_postgres_running — Patroni health status (1 = running, 0 = down).
- patroni_master — Which pod is the current primary (1 = master, 0 = replica).
- pg_up — Basic PostgreSQL availability probe.
Alerting Rules
# PrometheusRule for PostgreSQL alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: postgres-alerts
namespace: databases
spec:
groups:
- name: postgresql.rules
rules:
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL instance {{ $labels.instance }} is down"
- alert: PostgreSQLReplicationLag
expr: pg_stat_replication_pg_wal_lsn_diff > 100000000
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag is {{ $value }} bytes on {{ $labels.instance }}"
- alert: PostgreSQLHighConnections
expr: sum(pg_stat_activity_count) by (instance) > 180
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $value }} active connections on {{ $labels.instance }}"
- alert: PostgreSQLDeadlocks
expr: rate(pg_stat_database_deadlocks[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Deadlocks detected on {{ $labels.datname }}"
- alert: PatroniFailover
expr: changes(patroni_master[5m]) > 0
labels:
severity: critical
annotations:
summary: "Patroni failover occurred in cluster {{ $labels.cluster }}"
Production Tuning Guide
Proper tuning is essential to extract maximum performance from PostgreSQL on Kubernetes. The following parameters should be adjusted based on your pod resource limits and workload characteristics.
Memory Configuration
# For a pod with 16GB memory limit:
postgresql:
parameters:
# shared_buffers: 25% of total memory
shared_buffers: "4GB"
# effective_cache_size: 75% of total memory
# (tells planner how much OS cache to expect)
effective_cache_size: "12GB"
# work_mem: shared_buffers / (max_connections * 2)
# Conservative to prevent OOM
work_mem: "10MB"
# maintenance_work_mem: 5-10% of total memory
# Used for VACUUM, CREATE INDEX, ALTER TABLE
maintenance_work_mem: "1GB"
# temp_buffers: memory for temp tables per session
temp_buffers: "32MB"
# huge_pages: try to use huge pages (requires OS config)
huge_pages: "try"
WAL and Checkpoint Configuration
postgresql:
parameters:
# WAL settings
wal_buffers: "64MB" # 1/32 of shared_buffers, max 64MB
wal_compression: "zstd" # Compress WAL (PG 15+)
max_wal_size: "8GB" # Before forced checkpoint
min_wal_size: "2GB" # WAL disk reservation
wal_level: "replica" # Required for replication
# Checkpoint settings
checkpoint_completion_target: "0.9" # Spread I/O over 90% of interval
checkpoint_timeout: "15min" # Max time between checkpoints
Query Planner and I/O
postgresql:
parameters:
# Cost parameters for SSD storage
random_page_cost: "1.1" # SSD: close to seq_page_cost
seq_page_cost: "1.0" # Sequential I/O baseline
effective_io_concurrency: "200" # Concurrent I/O for SSD
# Planner behavior
default_statistics_target: "200" # More accurate statistics
from_collapse_limit: 12 # JOIN planning threshold
join_collapse_limit: 12 # JOIN planning threshold
# Parallel queries
max_parallel_workers_per_gather: "4"
max_parallel_workers: "8"
max_parallel_maintenance_workers: "4"
parallel_tuple_cost: "0.01"
parallel_setup_cost: "1000"
Connection and Logging
postgresql:
parameters:
# Connection limits
max_connections: "200" # Keep low, use PgBouncer
superuser_reserved_connections: "5" # Reserve for admin access
# Logging for troubleshooting
log_statement: "ddl" # Log DDL statements
log_min_duration_statement: "500" # Log queries > 500ms
log_checkpoints: "on" # Log checkpoint activity
log_connections: "off" # Too noisy in production
log_disconnections: "off" # Too noisy in production
log_lock_waits: "on" # Log lock waits
log_temp_files: "0" # Log all temp file usage
log_autovacuum_min_duration: "1000" # Log slow autovacuum
# Statement timeouts
statement_timeout: "60000" # 60 second query timeout
lock_timeout: "30000" # 30 second lock timeout
idle_in_transaction_session_timeout: "600000" # 10 min idle txn timeout
Rolling Updates and Version Upgrades
Minor Version Upgrades
Minor version upgrades (e.g., 16.2 to 16.3) are handled automatically by the operator when you update the Spilo image tag. The operator performs a rolling restart:
# Update the operator configuration to use a new Spilo image
helm upgrade postgres-operator postgres-operator-charts/postgres-operator \
--namespace postgres-operator \
--set configGeneral.docker_image=ghcr.io/zalando/spilo-16:3.1-p1 \
--reuse-values
# The operator will perform rolling updates:
# 1. Restart replicas one at a time
# 2. Failover the primary to a freshly updated replica
# 3. Restart the old primary (now a replica)
Major Version Upgrades
Major version upgrades (e.g., PostgreSQL 15 to 16) require more careful planning. The Zalando operator supports in-place major upgrades using pg_upgrade:
# Step 1: Update the CRD to the new major version
spec:
postgresql:
version: "16" # Changed from "15"
# Step 2: The operator detects the version change and:
# a) Scales down the StatefulSet to 1 replica
# b) Runs pg_upgrade on the primary pod
# c) Scales back up to the desired numberOfInstances
# d) Replicas are rebuilt from the upgraded primary
# Step 3: Verify the upgrade
kubectl exec -it pg-production-cluster-0 -n databases -- \
su postgres -c "psql -c 'SELECT version()'"
# Step 4: Run ANALYZE to update statistics after upgrade
kubectl exec -it pg-production-cluster-0 -n databases -- \
su postgres -c "psql -c 'ANALYZE VERBOSE'"
Important considerations for major upgrades:
- Always take a fresh backup before upgrading
- Test the upgrade on a cloned cluster first
- Major upgrades require downtime (typically 5-30 minutes depending on database size)
- Review the PostgreSQL release notes for breaking changes
- Monitor replication lag closely after replica rebuild
- Run
ANALYZEon all databases to regenerate query planner statistics
Advanced Operational Patterns
Logical Backups
In addition to physical WAL-G backups, the operator supports logical backups using pg_dump. Logical backups are useful for cross-version migrations and selective table restores:
# Enable logical backups in the operator configuration
helm upgrade postgres-operator postgres-operator-charts/postgres-operator \
--namespace postgres-operator \
--set configLogicalBackup.logical_backup_schedule="0 3 * * *" \
--set configLogicalBackup.logical_backup_s3_bucket="my-pg-logical-backups" \
--set configLogicalBackup.logical_backup_s3_region="us-east-1" \
--set configLogicalBackup.logical_backup_s3_sse="AES256" \
--reuse-values
# The operator creates a CronJob for each cluster that:
# 1. Connects to the primary PostgreSQL instance
# 2. Runs pg_dumpall (or pg_dump per database)
# 3. Compresses and uploads to S3
Custom Pod Environment Variables
You can inject environment variables into Spilo pods using a ConfigMap. This is useful for configuring WAL-G, custom scripts, or tuning OS-level parameters:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-pod-config
namespace: databases
data:
# Custom Spilo configurations
SPILO_CONFIGURATION: |
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 33554432
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
archive_mode: "on"
archive_timeout: 1800s
# Enable pg_stat_statements
POSTGRESQL_SHARED_PRELOAD_LIBRARIES: "bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache"
# Cron jobs inside PostgreSQL
ENABLE_PG_CRON: "true"
Network Policies for Security
In production, restrict network access to PostgreSQL pods using Kubernetes NetworkPolicies:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: postgres-network-policy
namespace: databases
spec:
podSelector:
matchLabels:
application: spilo
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: app-namespace
- podSelector:
matchLabels:
app: backend
ports:
- protocol: TCP
port: 5432
- protocol: TCP
port: 8008 # Patroni REST API
- from:
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 9187 # Prometheus exporter
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 443 # S3/GCS/Azure for backups
Troubleshooting Common Issues
Even with automation, issues arise. Here are the most common problems and their solutions when running PostgreSQL with the Zalando operator:
1. Pod Stuck in Pending State
# Check events for the pod
kubectl describe pod pg-production-cluster-0 -n databases
# Common causes:
# - No nodes with matching tolerations/affinity
# - Insufficient CPU or memory on nodes
# - PVC cannot be provisioned (check StorageClass)
# Fix: Check node resources and storage availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory
kubectl get pvc -n databases
2. Replication Lag Growing
# Check replication status on the primary
kubectl exec -it pg-production-cluster-0 -n databases -- \
su postgres -c "psql -c 'SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag FROM pg_stat_replication'"
# Common causes:
# - Replica CPU/IO saturation
# - Network bandwidth limits
# - Long-running queries on replica blocking WAL replay
# - Insufficient wal_keep_size
# Fix: Check replica resources and cancel blocking queries
kubectl exec -it pg-production-cluster-1 -n databases -- \
su postgres -c "psql -c 'SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = \"active\" AND query_start < now() - interval \"5 minutes\"'"
3. Failover Not Triggering
# Check Patroni cluster status
kubectl exec -it pg-production-cluster-0 -n databases -- \
su postgres -c "patronictl list"
# Check Patroni logs
kubectl logs pg-production-cluster-0 -n databases -c postgres | grep -i patroni
# Manual failover (if auto-failover is stuck)
kubectl exec -it pg-production-cluster-0 -n databases -- \
su postgres -c "patronictl failover --candidate pg-production-cluster-1 --force"
4. Backup Failures
# Check WAL-G backup status
kubectl exec -it pg-production-cluster-0 -n databases -- \
su postgres -c "envdir /run/etc/wal-e.d/env wal-g backup-list"
# Check backup CronJob logs
kubectl logs -l application=spilo,cluster-name=pg-production-cluster -n databases | grep -i wal-g
# Verify S3/GCS credentials
kubectl exec -it pg-production-cluster-0 -n databases -- \
su postgres -c "envdir /run/etc/wal-e.d/env aws s3 ls s3://my-pg-backups/"
Security Best Practices
Securing PostgreSQL on Kubernetes requires a defense-in-depth approach:
- TLS encryption — Enable SSL for all client connections. The operator can automatically provision certificates using cert-manager.
- Secret management — Use Kubernetes Secrets (or external secret managers like Vault) for database credentials. Never store passwords in ConfigMaps.
- RBAC — Limit operator ServiceAccount permissions. Use least-privilege access for application database users.
- NetworkPolicies — Restrict pod-to-pod communication as shown in the previous section.
- pg_hba.conf — Configure host-based authentication to restrict which IPs and users can connect.
- Audit logging — Enable
pgauditextension for SQL audit logging in regulated environments. - Encrypted storage — Use encrypted StorageClasses (EBS encryption, Azure Disk encryption, etc.).
- Pod Security Standards — Run Spilo pods as non-root with restricted security contexts.
# Security-hardened CRD settings
spec:
spiloRunAsUser: 101
spiloRunAsGroup: 103
spiloFSGroup: 103
enableShmVolume: true
patroni:
pg_hba:
- hostssl all all 0.0.0.0/0 md5
- host replication standby all md5
- hostssl replication standby all md5
postgresql:
parameters:
ssl: "on"
ssl_min_protocol_version: "TLSv1.3"
password_encryption: "scram-sha-256"
additionalVolumes:
- name: postgres-tls
mountPath: /tls
secret:
secretName: pg-tls-cert
defaultMode: 0640
Capacity Planning and Sizing
Proper sizing ensures stable performance and cost efficiency. Use these guidelines as a starting point and adjust based on your workload monitoring data:
| Workload | CPU | Memory | Storage | Instances |
|---|---|---|---|---|
| Development | 500m | 1Gi | 10Gi | 1 |
| Small Production | 2 cores | 8Gi | 50Gi SSD | 3 |
| Medium Production | 4 cores | 16Gi | 200Gi SSD | 3 |
| Large Production | 8 cores | 32Gi | 500Gi SSD | 5 |
| Enterprise / Analytics | 16+ cores | 64Gi+ | 1Ti+ SSD | 5+ |
Complete End-to-End Deployment Example
Let us put everything together with a complete deployment from scratch on a fresh Kubernetes cluster:
# Step 1: Create namespace and configure storage
kubectl create namespace databases
kubectl create namespace postgres-operator
# Step 2: Install the Zalando Postgres Operator
helm repo add postgres-operator-charts \
https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm repo update
helm install postgres-operator postgres-operator-charts/postgres-operator \
--namespace postgres-operator \
--set configKubernetes.enable_pod_antiaffinity=true \
--set configKubernetes.pod_environment_configmap=databases/postgres-pod-config
# Step 3: Create backup configuration
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-pod-config
namespace: databases
data:
USE_WALG_BACKUP: "true"
USE_WALG_RESTORE: "true"
WALG_S3_PREFIX: "s3://my-pg-backups/\$(SCOPE)"
BACKUP_SCHEDULE: "0 1 * * *"
BACKUP_NUM_TO_RETAIN: "14"
EOF
# Step 4: Deploy the PostgreSQL cluster
kubectl apply -f - <<EOF
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: pg-app-cluster
namespace: databases
labels:
team: platform
spec:
teamId: "platform"
volume:
size: 50Gi
numberOfInstances: 3
enableConnectionPooler: true
enableReplicaConnectionPooler: true
users:
app_user:
- superuser
- createdb
databases:
app_db: app_user
postgresql:
version: "16"
parameters:
shared_buffers: "2GB"
work_mem: "64MB"
effective_cache_size: "6GB"
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
EOF
# Step 5: Wait for cluster readiness
kubectl wait --for=condition=Running postgresql/pg-app-cluster \
-n databases --timeout=300s
# Step 6: Verify cluster status
kubectl get postgresql -n databases
kubectl get pods -n databases -l cluster-name=pg-app-cluster
kubectl get svc -n databases -l cluster-name=pg-app-cluster
# Step 7: Get connection credentials
export PGPASSWORD=$(kubectl get secret app-user.pg-app-cluster.credentials.postgresql.acid.zalan.do \
-n databases -o jsonpath='{.data.password}' | base64 -d)
# Step 8: Connect and verify
kubectl run pg-client --rm -it --image=postgres:16 -n databases -- \
psql -h pg-app-cluster-pooler -U app_user -d app_db -c "SELECT version();"
Conclusion
The Zalando Postgres Operator transforms PostgreSQL on Kubernetes from a complex operational challenge into a manageable, automated deployment. By leveraging Patroni for consensus-based failover, Spilo for a batteries-included container image, WAL-G for continuous backup and recovery, and PgBouncer for efficient connection pooling, you get a production-grade database platform that works consistently across AWS EKS, Azure AKS, Google GKE, and bare-metal k3s clusters.
The key takeaways from this guide are:
- Automate everything — Let the operator handle StatefulSet management, failover, and backup scheduling. Manual intervention should be the exception.
- Monitor aggressively — Deploy Prometheus and Grafana from day one. Replication lag, connection counts, and lock contention are your early warning signals.
- Plan for disaster — Configure WAL-G backups, test PITR regularly, and maintain a standby cluster for critical workloads.
- Tune for your workload — Default PostgreSQL parameters are conservative. Adjust shared_buffers, work_mem, and checkpoint settings based on your resource allocation and query patterns.
- Secure by default — Enable TLS, use SCRAM-SHA-256 authentication, restrict network access, and encrypt storage at rest.
- Test upgrades — Always clone your cluster and test major version upgrades before applying them to production.
With this comprehensive foundation, you are well-equipped to deploy and operate highly available PostgreSQL clusters on any Kubernetes platform, serving applications that demand reliability, performance, and data integrity at scale.