PostgreSQL Native High Availability: Patroni, Streaming Replication, and Production Failover Strategies
PostgreSQL Native HA with Patroni, Streaming Replication, and HAProxy
Running a single PostgreSQL instance is straightforward until the first unplanned outage reminds you that a database is only as valuable as its availability. Disk failures, kernel panics, network partitions, and botched upgrades are not theoretical risks — they are operational certainties on a long enough timeline. PostgreSQL does not ship with built-in automatic failover, but it does provide all the replication primitives needed to build a highly available cluster. Patroni, an open-source HA framework maintained by Zalando, orchestrates those primitives into a production-grade failover system that has been battle-tested at scale across thousands of PostgreSQL clusters worldwide.
This article is a deep-dive engineering guide. We will cover PostgreSQL streaming replication (synchronous and asynchronous), Patroni architecture and configuration, etcd as a distributed configuration store, HAProxy for connection routing with read-write splitting, PgBouncer for connection pooling, WAL archiving and point-in-time recovery, logical replication for selective data synchronisation, pg_basebackup for initial standby provisioning, repmgr as an alternative to Patroni, cloud-specific deployment patterns for AWS, Azure, and GCP, bare metal k3s deployment with Rancher and Longhorn, monitoring with pg_stat_replication and Prometheus/Grafana, switchover versus failover procedures, split-brain prevention, production tuning, and chaos engineering for failover validation.
PostgreSQL Streaming Replication Fundamentals
Streaming replication is the backbone of PostgreSQL high availability. It works by continuously shipping Write-Ahead Log (WAL) records from a primary server to one or more standby servers. The standby applies those WAL records in real time, maintaining a near-identical copy of the primary's data. This mechanism was introduced in PostgreSQL 9.0 and has been refined in every subsequent release.
There are two modes of streaming replication: asynchronous and synchronous. In asynchronous mode, the primary does not wait for standbys to confirm receipt of WAL records before committing a transaction. This gives maximum write performance but introduces a window of potential data loss — if the primary fails before a standby has received the most recent WAL, those transactions are lost. In synchronous mode, the primary waits for at least one standby to confirm that WAL records have been written to durable storage before reporting a transaction as committed. This eliminates data loss at the cost of increased commit latency, since every write must round-trip to a standby.
The choice between synchronous and asynchronous replication is not binary. PostgreSQL supports synchronous_commit at the session level, so latency-sensitive workloads can opt into asynchronous commits while critical financial transactions use synchronous commits within the same cluster.
Configuring the Primary for Replication
The primary server must be configured to generate WAL records at a level sufficient for replication and to allow standby connections. The following postgresql.conf settings are essential.
# postgresql.conf on the primary
wal_level = replica # minimum for streaming replication
max_wal_senders = 10 # max concurrent replication connections
max_replication_slots = 10 # prevent WAL removal before standby consumption
wal_keep_size = 2GB # retain WAL as fallback if slots are unused
hot_standby = on # allow read queries on standbys
synchronous_commit = on # 'on' for sync, 'off' for pure async
synchronous_standby_names = 'ANY 1 (standby1, standby2)' # sync replication targets
archive_mode = on # enable WAL archiving for PITR
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
listening_addresses = '*'
port = 5432
Authentication for replication connections is handled in pg_hba.conf. Replication connections use a dedicated connection type.
# pg_hba.conf — replication entries
# TYPE DATABASE USER ADDRESS METHOD
host replication replicator 10.0.1.0/24 scram-sha-256
host replication replicator 10.0.2.0/24 scram-sha-256
host all all 10.0.0.0/16 scram-sha-256
Create the replication user on the primary.
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'strong_replication_password';
Provisioning a Standby with pg_basebackup
The pg_basebackup utility creates a physical copy of the primary's data directory, which becomes the starting point for a new standby. It handles the base backup and WAL streaming atomically, so the resulting copy is consistent.
# On the standby server
pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/16/main \
-Fp -Xs -P -R
# -Fp: plain format
# -Xs: stream WAL during backup
# -P: show progress
# -R: create standby.signal and configure primary_conninfo in postgresql.auto.conf
The -R flag is critical — it writes primary_conninfo into postgresql.auto.conf and creates standby.signal, which tells PostgreSQL to start in standby mode. In PostgreSQL 12 and later, recovery.conf is replaced by these two mechanisms.
# postgresql.auto.conf (generated by pg_basebackup -R)
primary_conninfo = 'host=primary-host port=5432 user=replicator password=strong_replication_password application_name=standby1'
primary_slot_name = 'standby1_slot'
Create the replication slot on the primary before starting the standby, to prevent WAL from being cleaned up before the standby can consume it.
SELECT pg_create_physical_replication_slot('standby1_slot');
SELECT pg_create_physical_replication_slot('standby2_slot');
Patroni: Automated HA Orchestration
Streaming replication gives you data redundancy, but it does not give you automatic failover. If the primary crashes, someone — a human operator or an automation system — must promote a standby to primary, reconfigure the remaining standbys to follow the new primary, and update connection routing. Patroni automates all of this.
Patroni is a Python daemon that runs alongside each PostgreSQL instance. It uses a Distributed Configuration Store (DCS) — typically etcd, but also ZooKeeper or Consul — to coordinate leader election and cluster state. Every Patroni node continuously writes its health status to the DCS. When the leader (primary) fails to renew its DCS key within the configured TTL, Patroni initiates a leader election among the healthy standbys. The winner is promoted to primary, and the remaining nodes reconfigure themselves as standbys of the new primary — all automatically, typically within 10-30 seconds.
Patroni YAML Configuration
Patroni is configured via a YAML file that defines the DCS connection, PostgreSQL parameters, replication behaviour, and bootstrap settings. The following is a production-grade configuration for the primary node.
# /etc/patroni/patroni.yml — Node 1 (Primary)
scope: pg-ha-cluster
namespace: /postgresql-ha/
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.10:8008
etcd3:
hosts:
- 10.0.2.10:2379
- 10.0.2.11:2379
- 10.0.2.12:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB — only promote standbys within this lag
synchronous_mode: true
synchronous_mode_strict: false
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
wal_level: replica
hot_standby: 'on'
max_connections: 200
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 2GB
synchronous_commit: 'on'
archive_mode: 'on'
archive_command: 'test ! -f /archive/%f && cp %p /archive/%f'
archive_timeout: 60
wal_log_hints: 'on'
shared_preload_libraries: 'pg_stat_statements'
track_commit_timestamp: 'on'
pg_hba:
- host replication replicator 10.0.0.0/16 scram-sha-256
- host all all 10.0.0.0/16 scram-sha-256
- host all all 0.0.0.0/0 scram-sha-256
initdb:
- encoding: UTF8
- data-checksums
users:
admin:
password: 'admin_secure_password'
options:
- createrole
- createdb
replicator:
password: 'repl_secure_password'
options:
- replication
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.10:5432
data_dir: /var/lib/postgresql/16/main
bin_dir: /usr/lib/postgresql/16/bin
config_dir: /var/lib/postgresql/16/main
pgpass: /tmp/pgpass0
authentication:
superuser:
username: postgres
password: 'postgres_secure_password'
replication:
username: replicator
password: 'repl_secure_password'
rewind:
username: postgres
password: 'postgres_secure_password'
parameters:
unix_socket_directories: '/var/run/postgresql'
create_replica_methods:
- basebackup
basebackup:
max-rate: 100M
checkpoint: fast
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
The standby nodes use an identical configuration with their own name, connect_address, and listen values. Patroni handles the rest — it detects whether a node should be the leader or a replica based on the DCS state and configures PostgreSQL accordingly.
etcd as Distributed Configuration Store
etcd is the nervous system of a Patroni cluster. It stores the current leader identity, the cluster topology, the desired configuration, and the health status of each node. A three-node etcd cluster is the minimum for production, tolerating one node failure while maintaining quorum.
# Install and configure etcd on three dedicated nodes
# /etc/etcd/etcd.conf.yml — Node etcd1 (10.0.2.10)
name: etcd1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://0.0.0.0:2380
advertise-client-urls: http://10.0.2.10:2379
initial-advertise-peer-urls: http://10.0.2.10:2380
initial-cluster: etcd1=http://10.0.2.10:2380,etcd2=http://10.0.2.11:2380,etcd3=http://10.0.2.12:2380
initial-cluster-state: new
initial-cluster-token: patroni-etcd-cluster
# Start etcd
systemctl enable --now etcd
# Verify cluster health
etcdctl endpoint health --cluster \
--endpoints=http://10.0.2.10:2379,http://10.0.2.11:2379,http://10.0.2.12:2379
For production deployments, enable TLS between etcd peers and between etcd and Patroni clients. Unencrypted etcd traffic exposes cluster credentials and configuration to network-level attackers.
HAProxy for Connection Routing
Patroni exposes a REST API on each node (port 8008 by default) that reports whether the node is the current leader or a replica. HAProxy uses these health check endpoints to route traffic: writes go to the leader, reads go to healthy replicas. This gives you automatic read-write splitting without any application-level changes.
# /etc/haproxy/haproxy.cfg
global
maxconn 2000
log /dev/log local0
stats socket /var/run/haproxy.sock mode 660 level admin
defaults
mode tcp
log global
retries 3
timeout client 30m
timeout connect 4s
timeout server 30m
timeout check 5s
maxconn 1000
listen pg_write
bind *:5000
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server node1 10.0.1.10:5432 check port 8008
server node2 10.0.1.11:5432 check port 8008
server node3 10.0.1.12:5432 check port 8008
listen pg_read
bind *:5001
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server node1 10.0.1.10:5432 check port 8008
server node2 10.0.1.11:5432 check port 8008
server node3 10.0.1.12:5432 check port 8008
listen stats
bind *:7000
mode http
stats enable
stats uri /
stats refresh 10s
The /primary endpoint returns HTTP 200 only on the current Patroni leader. The /replica endpoint returns 200 on healthy standbys. When a failover occurs, the new primary starts returning 200 on /primary, and HAProxy automatically redirects write traffic — typically within a single health check interval (3 seconds). The on-marked-down shutdown-sessions directive immediately terminates existing connections to a failed primary, forcing clients to reconnect to the new leader.
PgBouncer for Connection Pooling
PostgreSQL creates a new backend process for each client connection. At scale — hundreds or thousands of microservices each maintaining connection pools — the overhead of process creation and memory consumption becomes significant. PgBouncer sits between the application and PostgreSQL, maintaining a pool of server-side connections and multiplexing client connections onto them.
# /etc/pgbouncer/pgbouncer.ini
[databases]
* = host=127.0.0.1 port=5432 dbname=appdb
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 3
server_lifetime = 3600
server_idle_timeout = 600
server_connect_timeout = 5
server_login_retry = 3
log_connections = 1
log_disconnections = 1
stats_period = 60
When used with Patroni, PgBouncer is typically co-located on each PostgreSQL node or on the HAProxy nodes. The transaction pool mode is the best choice for most workloads — it assigns a server connection for the duration of a transaction and returns it to the pool between transactions. This is far more efficient than session mode, which holds a connection for the entire client session.
WAL Archiving and Point-in-Time Recovery
Streaming replication protects against server failure, but it does not protect against logical errors — an accidental DROP TABLE or a bad application migration is replicated to all standbys immediately. WAL archiving combined with point-in-time recovery (PITR) lets you restore to any moment before the error occurred.
WAL archiving copies completed WAL segments to a durable archive — typically an S3 bucket, an NFS mount, or a dedicated backup server. Tools like pgBackRest and WAL-G handle archiving efficiently with compression, encryption, and parallel transfer.
# pgBackRest configuration — /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=pg-wal-archive
repo1-s3-endpoint=s3.eu-west-1.amazonaws.com
repo1-s3-region=eu-west-1
repo1-s3-key=AKIAIOSFODNN7EXAMPLE
repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=7
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=strong_encryption_passphrase
compress-type=zst
compress-level=3
process-max=4
[pg-ha-cluster]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432
# In postgresql.conf
archive_command = 'pgbackrest --stanza=pg-ha-cluster archive-push %p'
restore_command = 'pgbackrest --stanza=pg-ha-cluster archive-get %f "%p"'
To perform a point-in-time recovery, specify a target timestamp.
# Restore to a specific point in time
pgbackrest --stanza=pg-ha-cluster --type=time \
--target="2026-04-12 11:25:00" \
--target-action=promote \
restore
Logical Replication for Selective Data Sync
While streaming replication creates an exact physical copy of the entire database cluster, logical replication operates at the table level, replicating individual tables or subsets of data between independent PostgreSQL instances. This is useful for zero-downtime major version upgrades, cross-region read replicas that only need specific tables, data warehouse feeding, and multi-tenant data distribution.
# On the publisher (source database)
wal_level = logical # must be 'logical' — higher than 'replica'
CREATE PUBLICATION app_pub FOR TABLE orders, customers, products;
# On the subscriber (target database)
CREATE SUBSCRIPTION app_sub
CONNECTION 'host=publisher-host port=5432 dbname=appdb user=replicator password=pass'
PUBLICATION app_pub;
Logical replication can run alongside streaming replication. A common pattern is to use streaming replication for HA (fast physical failover) and logical replication for cross-region analytics replicas that only need a subset of tables.
Multi-Region Streaming Replication
For disaster recovery and global read performance, PostgreSQL clusters can span multiple regions. The standard pattern is synchronous replication within a region (for zero data loss on local failover) and asynchronous replication across regions (to avoid cross-region latency penalties on every write). Each region has its own HAProxy for local read routing.
Failover and Switchover Procedures
Understanding the difference between failover and switchover is critical. A failover is an unplanned promotion triggered by the failure of the current primary. A switchover is a planned, graceful role change — typically performed before maintenance. Patroni supports both.
Planned Switchover
# List cluster members
patronictlctl -c /etc/patroni/patroni.yml list
# Perform switchover to a specific node
patronictlctl -c /etc/patroni/patroni.yml switchover \
--master node1 --candidate node2 --force
# Or use the Patroni REST API
curl -s http://10.0.1.10:8008/switchover -XPOST \
-d '{"leader": "node1", "candidate": "node2"}'
During a switchover, Patroni demotes the current primary to a standby, promotes the target candidate, and reconfigures all other standbys to follow the new primary. The process takes 5-15 seconds. HAProxy detects the change through health checks and redirects traffic automatically.
Automatic Failover Sequence
When the primary fails unexpectedly, Patroni follows a precise sequence to restore service. The following diagram illustrates the steps.
Split-Brain Prevention
Split-brain — where two nodes simultaneously believe they are the primary — is the most dangerous failure mode in any HA system. Patroni prevents split-brain through several mechanisms:
- DCS-based leader lock: Only one node can hold the leader key in etcd at any time. The key has a TTL, and the leader must renew it continuously. If a network partition isolates the leader from etcd, the key expires, and the leader demotes itself.
- Watchdog: Patroni can configure a Linux watchdog device (
/dev/watchdog). If Patroni loses access to the DCS and cannot confirm it should remain leader, the watchdog will reboot or power-off the node — a hard fencing mechanism that guarantees the old primary does not continue accepting writes. - pg_rewind: When a former primary comes back online, it may have WAL records that were never replicated.
pg_rewindrewinds the timeline to the point of divergence, allowing the node to rejoin as a standby without a full base backup. Patroni'suse_pg_rewind: truesetting automates this.
# Enable watchdog in Patroni config
bootstrap:
dcs:
postgresql:
use_pg_rewind: true
parameters:
wal_log_hints: 'on' # required for pg_rewind
# Watchdog configuration
watchdog:
mode: required # 'off', 'automatic', or 'required'
device: /dev/watchdog
safety_margin: 5 # seconds before TTL expiry to trigger watchdog
repmgr as an Alternative to Patroni
repmgr is another popular HA tool for PostgreSQL. It provides standby management, automatic failover, and switchover capabilities. However, it takes a fundamentally different approach from Patroni. repmgr uses a witness node and a daemon (repmgrd) for failure detection rather than a distributed consensus store. This makes it simpler to deploy but more susceptible to split-brain in complex network partition scenarios.
# repmgr.conf on the primary
node_id=1
node_name='node1'
conninfo='host=10.0.1.10 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/16/main'
failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=5
reconnect_attempts=6
reconnect_interval=10
For new deployments, Patroni is the recommended choice due to its stronger split-brain prevention guarantees and more active development community. repmgr remains a reasonable option for simpler setups or organisations already invested in the tool.
Cloud Deployment Patterns
AWS Deployment: EC2, EBS, and Route53
On AWS, deploy each PostgreSQL + Patroni node on an EC2 instance with EBS gp3 or io2 volumes. Use separate instances across multiple Availability Zones for HA. etcd nodes should also span AZs.
# Terraform sketch for PostgreSQL HA on AWS
resource "aws_instance" "pg_node" {
count = 3
ami = "ami-0abcdef1234567890" # Ubuntu 22.04
instance_type = "r6g.2xlarge" # 8 vCPU, 64GB RAM
subnet_id = aws_subnet.private[count.index].id
vpc_security_group_ids = [aws_security_group.pg_sg.id]
availability_zone = element(["eu-west-1a", "eu-west-1b", "eu-west-1c"], count.index)
root_block_device {
volume_size = 50
volume_type = "gp3"
}
tags = {
Name = "pg-node-${count.index + 1}"
Role = "patroni"
}
}
resource "aws_ebs_volume" "pg_data" {
count = 3
availability_zone = element(["eu-west-1a", "eu-west-1b", "eu-west-1c"], count.index)
size = 500
type = "gp3"
iops = 6000
throughput = 250
encrypted = true
tags = {
Name = "pg-data-${count.index + 1}"
}
}
resource "aws_route53_health_check" "pg_primary" {
count = 3
ip_address = aws_instance.pg_node[count.index].private_ip
port = 8008
type = "HTTP"
resource_path = "/primary"
failure_threshold = 3
request_interval = 10
}
Use a Network Load Balancer (NLB) instead of HAProxy if you prefer an AWS-managed solution. The NLB can use target group health checks against the Patroni REST API to route traffic to the current primary.
Azure Deployment: VMs, Managed Disks, and Azure LB
On Azure, use Standard_E8s_v5 VMs (memory-optimised) with Premium SSD Managed Disks for data volumes. Deploy across Availability Zones. Azure Load Balancer provides the equivalent of HAProxy with health probes against the Patroni REST API.
# Azure CLI — create PostgreSQL VM with Managed Disk
az vm create \
--resource-group pg-ha-rg \
--name pg-node-1 \
--image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
--size Standard_E8s_v5 \
--zone 1 \
--vnet-name pg-vnet \
--subnet pg-subnet \
--nsg pg-nsg \
--admin-username pgadmin \
--ssh-key-value ~/.ssh/id_rsa.pub
az disk create \
--resource-group pg-ha-rg \
--name pg-data-1 \
--size-gb 512 \
--sku Premium_LRS \
--zone 1
az vm disk attach \
--resource-group pg-ha-rg \
--vm-name pg-node-1 \
--name pg-data-1
# Azure Load Balancer health probe for Patroni
az network lb probe create \
--resource-group pg-ha-rg \
--lb-name pg-lb \
--name patroni-primary-probe \
--protocol Http \
--port 8008 \
--path /primary \
--interval 5 \
--threshold 3
GCP Deployment: Compute Engine and Cloud Load Balancing
On GCP, use n2-highmem-8 instances (8 vCPU, 64GB RAM) with SSD Persistent Disks. Distribute across zones within a region. Use an internal TCP/UDP load balancer with Patroni health checks.
# GCP — create instance and persistent disk
gcloud compute instances create pg-node-1 \
--zone=europe-west1-b \
--machine-type=n2-highmem-8 \
--image-family=ubuntu-2204-lts \
--image-project=ubuntu-os-cloud \
--boot-disk-size=50GB \
--network=pg-network \
--subnet=pg-subnet
gcloud compute disks create pg-data-1 \
--zone=europe-west1-b \
--size=500GB \
--type=pd-ssd
gcloud compute instances attach-disk pg-node-1 \
--disk=pg-data-1 \
--zone=europe-west1-b
# Health check for Patroni primary endpoint
gcloud compute health-checks create http patroni-primary-check \
--port=8008 \
--request-path=/primary \
--check-interval=5s \
--timeout=5s \
--unhealthy-threshold=3 \
--healthy-threshold=2
Bare Metal k3s Deployment with Rancher and Longhorn
For organisations that operate their own hardware, deploying PostgreSQL HA on bare metal with k3s, Rancher, and Longhorn provides a fully open-source, cloud-independent infrastructure. k3s is a lightweight Kubernetes distribution that runs efficiently on bare metal servers without the overhead of a full Kubernetes distribution.
k3s and Longhorn Setup
# Install k3s on the first server node
curl -sfL https://get.k3s.io | K3S_TOKEN=my-cluster-token \
INSTALL_K3S_EXEC="server --cluster-init --disable traefik --disable servicelb" sh -
# Join additional server nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=my-cluster-token \
K3S_URL=https://10.0.0.1:6443 \
INSTALL_K3S_EXEC="server" sh -
# Install Longhorn for distributed block storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
--namespace longhorn-system --create-namespace \
--set defaultSettings.defaultDataPath=/mnt/longhorn \
--set defaultSettings.replicaCount=3 \
--set defaultSettings.storageMinimalAvailablePercentage=15
# Deploy PostgreSQL with Patroni using the Zalando Postgres Operator
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm install postgres-operator postgres-operator-charts/postgres-operator \
--namespace postgres-system --create-namespace
# PostgreSQL cluster manifest for the Zalando Postgres Operator
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
name: pg-ha-cluster
namespace: production
spec:
teamId: "platform"
numberOfInstances: 3
volume:
size: 500Gi
storageClass: longhorn
users:
appuser:
- superuser
- createdb
replicator: []
databases:
appdb: appuser
postgresql:
version: "16"
parameters:
shared_buffers: "16GB"
effective_cache_size: "48GB"
work_mem: "256MB"
maintenance_work_mem: "2GB"
max_connections: "200"
max_wal_senders: "10"
wal_level: replica
synchronous_commit: "on"
wal_keep_size: "2GB"
archive_mode: "on"
track_commit_timestamp: "on"
patroni:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
synchronous_mode: true
resources:
requests:
cpu: "4"
memory: 32Gi
limits:
cpu: "8"
memory: 64Gi
Keepalived for HAProxy VIP
# /etc/keepalived/keepalived.conf on lb1
vrrp_script chk_haproxy {
script "killall -0 haproxy"
interval 2
weight 2
}
vrrp_instance VI_PG {
state MASTER
interface eth0
virtual_router_id 52
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass pgha_vip_pass
}
virtual_ipaddress {
10.0.0.100/24
}
track_script {
chk_haproxy
}
}
Monitoring PostgreSQL Replication
Monitoring replication health is non-negotiable in production. PostgreSQL provides several built-in views for this purpose, and Prometheus with Grafana provides the long-term visibility and alerting you need.
Built-in Monitoring Queries
-- Check replication status on the primary
SELECT
client_addr,
application_name,
state,
sync_state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS replication_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
-- Check replication slot status
SELECT
slot_name,
slot_type,
active,
wal_status,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS slot_lag
FROM pg_replication_slots;
-- Check standby recovery status (run on standby)
SELECT
pg_is_in_recovery() AS is_standby,
pg_last_wal_receive_lsn() AS last_received,
pg_last_wal_replay_lsn() AS last_replayed,
pg_last_xact_replay_timestamp() AS last_replayed_timestamp,
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int AS replay_lag_seconds;
-- Monitor WAL generation rate
SELECT
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal_generated,
pg_size_pretty(sum(size)) AS wal_directory_size
FROM pg_ls_waldir();
-- Check for long-running queries that could block replication
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
AND state != 'idle'
ORDER BY duration DESC;
Prometheus and Grafana Stack
The postgres_exporter exposes PostgreSQL metrics in Prometheus format. Combined with the patroni_exporter, you get full visibility into both database performance and HA cluster state.
# Deploy postgres_exporter as a sidecar or standalone
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# Custom queries for postgres_exporter
# /etc/postgres_exporter/queries.yaml
pg_replication_lag:
query: |
SELECT
CASE WHEN pg_is_in_recovery() THEN
EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::float
ELSE 0 END AS lag_seconds
master: true
metrics:
- lag_seconds:
usage: "GAUGE"
description: "Replication lag in seconds"
pg_replication_slots:
query: |
SELECT
slot_name,
active::int AS active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)::float AS slot_lag_bytes
FROM pg_replication_slots
master: true
metrics:
- slot_name:
usage: "LABEL"
- active:
usage: "GAUGE"
description: "Whether the slot is active"
- slot_lag_bytes:
usage: "GAUGE"
description: "Slot lag in bytes"
# PrometheusRule for PostgreSQL HA alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: postgresql-ha-alerts
namespace: monitoring
spec:
groups:
- name: postgresql-replication
rules:
- alert: PostgreSQLReplicationLagHigh
expr: pg_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL replication lag exceeds 30s on {{ $labels.instance }}"
- alert: PostgreSQLReplicationSlotInactive
expr: pg_replication_slots_active == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Replication slot {{ $labels.slot_name }} is inactive"
- alert: PostgreSQLReplicationSlotLagHigh
expr: pg_replication_slots_slot_lag_bytes > 1073741824
for: 10m
labels:
severity: warning
annotations:
summary: "Replication slot lag exceeds 1GB on {{ $labels.slot_name }}"
- alert: PatroniClusterUnhealthy
expr: patroni_cluster_members_count < 3
for: 2m
labels:
severity: critical
annotations:
summary: "Patroni cluster has fewer than 3 members"
Production Tuning Parameters
PostgreSQL's default configuration is conservative, tuned for a small shared hosting environment. Production HA clusters require careful tuning of replication, memory, and WAL parameters. The following table summarises the most important settings for a 64GB RAM server with NVMe storage.
# postgresql.conf — Production HA tuning
# === Replication ===
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 4GB
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
track_commit_timestamp = on
wal_log_hints = on
# === WAL ===
min_wal_size = 1GB
max_wal_size = 8GB
wal_buffers = 64MB
wal_compression = zstd
archive_mode = on
archive_timeout = 300
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min
# === Memory ===
shared_buffers = 16GB # 25% of RAM
effective_cache_size = 48GB # 75% of RAM
work_mem = 256MB # per-operation sort/hash memory
maintenance_work_mem = 2GB # for VACUUM, CREATE INDEX
huge_pages = try
# === Connections ===
max_connections = 200 # use PgBouncer for higher client counts
superuser_reserved_connections = 5
# === Query Performance ===
random_page_cost = 1.1 # SSD storage
effective_io_concurrency = 200 # NVMe SSD
default_statistics_target = 500
jit = on
# === Logging ===
log_min_duration_statement = 500 # log queries > 500ms
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
# === Autovacuum ===
autovacuum_max_workers = 4
autovacuum_naptime = 30s
autovacuum_vacuum_cost_limit = 2000
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.02
Tuning Patroni DCS Parameters
The relationship between Patroni's ttl, loop_wait, and retry_timeout parameters directly impacts failover speed and false-positive risk. A shorter TTL means faster failover detection but increases the risk of unnecessary failovers during brief network blips.
# Conservative (production default)
ttl: 30
loop_wait: 10
retry_timeout: 10
# Failover detection: ~30-40 seconds
# Aggressive (low-latency failover)
ttl: 15
loop_wait: 5
retry_timeout: 5
# Failover detection: ~15-20 seconds
# Warning: Higher risk of false failovers in unstable networks
Chaos Engineering and Failover Testing
A failover system that has never been tested is a system that does not work. Chaos engineering applies controlled failures to validate that your HA setup behaves correctly under real failure conditions. Every Patroni cluster should be subjected to regular failover drills.
Failover Test Playbook
# 1. Verify cluster health before testing
patronictlctl -c /etc/patroni/patroni.yml list
+----------+---------+---------+----+-----------+
| Member | Host | Role | TL | Lag in MB |
+----------+---------+---------+----+-----------+
| node1 | 10.0.1.10| Leader | 5 | |
| node2 | 10.0.1.11| Replica | 5 | 0 |
| node3 | 10.0.1.12| Replica | 5 | 0 |
+----------+---------+---------+----+-----------+
# 2. Simulate primary crash (on node1)
sudo systemctl stop patroni
# Or more aggressive: sudo kill -9 $(pgrep -f patroni)
# 3. Monitor failover (from any node with patronictl)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'
# 4. Verify new leader is elected (within 30-45 seconds)
# Expected: node2 or node3 promoted to Leader
# 5. Test write availability through HAProxy
PGPASSWORD=app_password psql -h haproxy-host -p 5000 -U appuser -d appdb \
-c "INSERT INTO health_check (ts) VALUES (now()) RETURNING *;"
# 6. Restart the former primary
sudo systemctl start patroni
# Patroni will use pg_rewind to rejoin as a replica
# 7. Verify the former primary rejoins as replica
patronictlctl -c /etc/patroni/patroni.yml list
Network Partition Testing
# Simulate network partition on the primary using iptables
# Block all traffic to etcd from the primary
sudo iptables -A OUTPUT -d 10.0.2.10 -j DROP
sudo iptables -A OUTPUT -d 10.0.2.11 -j DROP
sudo iptables -A OUTPUT -d 10.0.2.12 -j DROP
# Expected behaviour:
# 1. Primary loses DCS access
# 2. Leader key TTL expires
# 3. Primary demotes itself (with watchdog, node may reboot)
# 4. Standby acquires leader lock and promotes
# 5. After clearing iptables rules, former primary rejoins as replica
# Clean up
sudo iptables -D OUTPUT -d 10.0.2.10 -j DROP
sudo iptables -D OUTPUT -d 10.0.2.11 -j DROP
sudo iptables -D OUTPUT -d 10.0.2.12 -j DROP
Automated Chaos Testing with Toxiproxy
# Run Toxiproxy alongside your Patroni cluster
# Create proxies for etcd and replication connections
toxiproxy-cli create etcd_proxy -l 0.0.0.0:12379 -u 10.0.2.10:2379
toxiproxy-cli create pg_repl_proxy -l 0.0.0.0:15432 -u 10.0.1.10:5432
# Add latency to etcd connections (simulates degraded network)
toxiproxy-cli toxic add etcd_proxy -t latency -a latency=500 -a jitter=200
# Add bandwidth limit to replication (simulates WAN replication)
toxiproxy-cli toxic add pg_repl_proxy -t bandwidth -a rate=1024
# Completely sever the connection (simulates network partition)
toxiproxy-cli toxic add etcd_proxy -t timeout -a timeout=0
# Monitor Patroni behaviour and verify correct failover
watch -n 2 'curl -s http://10.0.1.10:8008/patroni | python3 -m json.tool'
Continuous Validation Script
#!/bin/bash
# continuous_ha_check.sh — Run during chaos tests to measure availability
HAPROXY_HOST="10.0.0.100"
WRITE_PORT=5000
READ_PORT=5001
DATABASE="appdb"
USER="appuser"
LOGFILE="/var/log/ha_test_$(date +%Y%m%d_%H%M%S).log"
write_count=0
write_fail=0
read_count=0
read_fail=0
while true; do
ts=$(date '+%Y-%m-%d %H:%M:%S.%3N')
# Test write path
if PGPASSWORD=app_password psql -h $HAPROXY_HOST -p $WRITE_PORT \
-U $USER -d $DATABASE -c "SELECT 1" &>/dev/null; then
((write_count++))
else
((write_fail++))
echo "$ts WRITE_FAIL total_fails=$write_fail" >> $LOGFILE
fi
# Test read path
if PGPASSWORD=app_password psql -h $HAPROXY_HOST -p $READ_PORT \
-U $USER -d $DATABASE -c "SELECT 1" &>/dev/null; then
((read_count++))
else
((read_fail++))
echo "$ts READ_FAIL total_fails=$read_fail" >> $LOGFILE
fi
total=$((write_count + write_fail))
if (( total % 100 == 0 )); then
write_avail=$(echo "scale=2; $write_count * 100 / $total" | bc)
read_total=$((read_count + read_fail))
read_avail=$(echo "scale=2; $read_count * 100 / $read_total" | bc)
echo "$ts Writes: ${write_avail}% ($write_count/$total) Reads: ${read_avail}% ($read_count/$read_total)"
fi
sleep 0.5
done
Advanced: Cascading Replication and Delayed Standbys
For large clusters, cascading replication reduces the load on the primary. Instead of all standbys replicating directly from the primary, some standbys replicate from other standbys. This creates a tree topology where the primary feeds two standbys, and those standbys feed additional downstream standbys.
# postgresql.auto.conf on a cascading standby
primary_conninfo = 'host=standby1-host port=5432 user=replicator application_name=cascade1'
primary_slot_name = 'cascade1_slot'
A delayed standby intentionally applies WAL records with a time delay — typically 1-4 hours. This provides a defence against logical errors (accidental deletes, bad migrations) that are immediately replicated to synchronous standbys. If a disaster occurs, you can stop WAL replay on the delayed standby and recover data from before the error.
# postgresql.conf on delayed standby
recovery_min_apply_delay = '1h'
Connection String Strategies
Applications connecting to a Patroni-managed cluster should always connect through HAProxy or use PostgreSQL's built-in multi-host connection string with target_session_attrs. This provides client-side failover without depending on a load balancer.
# Multi-host connection string with target_session_attrs
# The client tries each host in order and connects to the one matching the target attribute
postgresql://appuser:password@node1:5432,node2:5432,node3:5432/appdb?target_session_attrs=read-write&sslmode=require
# For read-only connections
postgresql://appuser:password@node1:5432,node2:5432,node3:5432/appdb?target_session_attrs=prefer-standby&sslmode=require
This approach works well for applications that cannot easily be reconfigured to point to an HAProxy VIP. The PostgreSQL client library (libpq) handles the failover transparently.
Security Hardening
A production PostgreSQL HA cluster must enforce encryption in transit and at rest, use strong authentication, and limit network exposure.
# Enable TLS in postgresql.conf
ssl = on
ssl_cert_file = '/etc/postgresql/certs/server.crt'
ssl_key_file = '/etc/postgresql/certs/server.key'
ssl_ca_file = '/etc/postgresql/certs/ca.crt'
ssl_min_protocol_version = 'TLSv1.3'
# Require TLS for all connections in pg_hba.conf
hostssl replication replicator 10.0.0.0/16 scram-sha-256
hostssl all all 10.0.0.0/16 scram-sha-256
# etcd TLS
# In Patroni config
etcd3:
hosts:
- 10.0.2.10:2379
- 10.0.2.11:2379
- 10.0.2.12:2379
protocol: https
cacert: /etc/patroni/certs/etcd-ca.crt
cert: /etc/patroni/certs/etcd-client.crt
key: /etc/patroni/certs/etcd-client.key
Backup Strategy for HA Clusters
A comprehensive backup strategy for Patroni clusters should include continuous WAL archiving, regular full backups, and differential or incremental backups between full backups. pgBackRest is the recommended tool for production PostgreSQL backup management.
# Schedule backups via cron
# Full backup weekly (Sunday 2 AM)
0 2 * * 0 pgbackrest --stanza=pg-ha-cluster --type=full backup
# Differential backup daily (2 AM, Mon-Sat)
0 2 * * 1-6 pgbackrest --stanza=pg-ha-cluster --type=diff backup
# Verify backup integrity
pgbackrest --stanza=pg-ha-cluster --set=latest info
# Verify backup can be restored (dry run)
pgbackrest --stanza=pg-ha-cluster --set=latest verify
# List all backups
pgbackrest --stanza=pg-ha-cluster info
full backup: 20260412-020000F
timestamp: 2026-04-12 02:00:00 +0000
wal start/stop: 000000050000000000000040 / 000000050000000000000042
database size: 150GB, backup size: 150GB
repository size: 45GB (compressed)
diff backup: 20260412-020000F_20260413-020000D
timestamp: 2026-04-13 02:00:00 +0000
database size: 151GB, backup size: 2.1GB
repository size: 650MB (compressed)
Operational Runbook Summary
Every team running a Patroni cluster should maintain a runbook covering the following scenarios. Having documented, tested procedures transforms a stressful outage into a routine operation.
# === Quick Reference Commands ===
# Cluster status
patronictlctl -c /etc/patroni/patroni.yml list
patronictlctl -c /etc/patroni/patroni.yml history
# Planned switchover
patronictlctl -c /etc/patroni/patroni.yml switchover --master node1 --candidate node2
# Restart PostgreSQL on a specific node (rolling restart)
patronictlctl -c /etc/patroni/patroni.yml restart pg-ha-cluster node2
# Reload PostgreSQL configuration without restart
patronictlctl -c /etc/patroni/patroni.yml reload pg-ha-cluster
# Pause automatic failover (during maintenance)
patronictlctl -c /etc/patroni/patroni.yml pause
# Resume automatic failover
patronictlctl -c /etc/patroni/patroni.yml resume
# Edit DCS configuration (applies to all nodes)
patronictlctl -c /etc/patroni/patroni.yml edit-config
# Reinitialise a failed replica
patronictlctl -c /etc/patroni/patroni.yml reinit pg-ha-cluster node3
# Check Patroni REST API directly
curl -s http://10.0.1.10:8008/patroni | python3 -m json.tool
curl -s http://10.0.1.10:8008/cluster | python3 -m json.tool
Performance Benchmarking
Before going to production, benchmark your HA cluster to establish baseline performance and verify that synchronous replication latency is acceptable for your workload.
# Benchmark with pgbench — initialise test data
pgbench -i -s 100 -h haproxy-host -p 5000 -U appuser appdb
# Run write-heavy benchmark (measures sync replication impact)
pgbench -h haproxy-host -p 5000 -U appuser -c 32 -j 8 -T 300 appdb
# Compare with async: temporarily set synchronous_commit = off
# Run read-only benchmark through read replica port
pgbench -h haproxy-host -p 5001 -U appuser -c 64 -j 16 -T 300 -S appdb
# Measure failover impact on transactions
# Run pgbench in background, then trigger a failover
pgbench -h haproxy-host -p 5000 -U appuser -c 8 -j 4 -T 600 appdb &
sleep 60 && patronictl switchover --master node1 --candidate node2 --force
Conclusion
PostgreSQL native high availability with Patroni is not a single-tool solution — it is an integrated system of streaming replication, distributed consensus, connection routing, connection pooling, WAL archiving, monitoring, and operational discipline. Each layer addresses a specific failure mode: streaming replication handles data redundancy, Patroni handles automatic failover coordination, etcd provides the distributed consensus needed for leader election without split-brain, HAProxy routes connections to the correct leader, PgBouncer manages connection overhead at scale, and WAL archiving with pgBackRest provides the last line of defence against logical errors and disaster recovery.
The deployment patterns vary across environments — AWS with NLB and Route53, Azure with Availability Zones and Azure Load Balancer, GCP with regional managed instance groups, or bare metal with k3s, Longhorn, and Keepalived — but the core architecture remains the same. Three or more PostgreSQL nodes managed by Patroni, backed by a three-node etcd cluster, fronted by a load balancer that follows Patroni's health check endpoints.
The most important investment you can make is not in configuration — it is in testing. Run failover drills monthly. Inject network partitions. Kill processes unexpectedly. Measure recovery time and data loss. Build dashboards that show replication lag, WAL generation rate, connection pool saturation, and DCS health in real time. The confidence you gain from systematic testing is what separates a cluster that survives its first real outage from one that turns a server failure into a business-impacting incident.
PostgreSQL gives you all the replication primitives. Patroni gives you the orchestration. etcd gives you the consensus. Your job is to wire them together correctly, tune them for your workload, and validate them continuously. This guide has given you the blueprints — now build, test, and operate with confidence.