PostgreSQL Native HA with Patroni | Streaming Replication Guide

Running a single PostgreSQL instance is straightforward until the first unplanned outage reminds you that a database is only as valuable as its availability. Disk failures, kernel panics, network partitions, and botched upgrades are not theoretical risks — they are operational certainties on a long enough timeline. PostgreSQL does not ship with built-in automatic failover, but it does provide all the replication primitives needed to build a highly available cluster. Patroni, an open-source HA framework maintained by Zalando, orchestrates those primitives into a production-grade failover system that has been battle-tested at scale across thousands of PostgreSQL clusters worldwide.

This article is a deep-dive engineering guide. We will cover PostgreSQL streaming replication (synchronous and asynchronous), Patroni architecture and configuration, etcd as a distributed configuration store, HAProxy for connection routing with read-write splitting, PgBouncer for connection pooling, WAL archiving and point-in-time recovery, logical replication for selective data synchronisation, pg_basebackup for initial standby provisioning, repmgr as an alternative to Patroni, cloud-specific deployment patterns for AWS, Azure, and GCP, bare metal k3s deployment with Rancher and Longhorn, monitoring with pg_stat_replication and Prometheus/Grafana, switchover versus failover procedures, split-brain prevention, production tuning, and chaos engineering for failover validation.

PostgreSQL Streaming Replication Fundamentals

Streaming replication is the backbone of PostgreSQL high availability. It works by continuously shipping Write-Ahead Log (WAL) records from a primary server to one or more standby servers. The standby applies those WAL records in real time, maintaining a near-identical copy of the primary's data. This mechanism was introduced in PostgreSQL 9.0 and has been refined in every subsequent release.

There are two modes of streaming replication: asynchronous and synchronous. In asynchronous mode, the primary does not wait for standbys to confirm receipt of WAL records before committing a transaction. This gives maximum write performance but introduces a window of potential data loss — if the primary fails before a standby has received the most recent WAL, those transactions are lost. In synchronous mode, the primary waits for at least one standby to confirm that WAL records have been written to durable storage before reporting a transaction as committed. This eliminates data loss at the cost of increased commit latency, since every write must round-trip to a standby.

The choice between synchronous and asynchronous replication is not binary. PostgreSQL supports synchronous_commit at the session level, so latency-sensitive workloads can opt into asynchronous commits while critical financial transactions use synchronous commits within the same cluster.

Configuring the Primary for Replication

The primary server must be configured to generate WAL records at a level sufficient for replication and to allow standby connections. The following postgresql.conf settings are essential.

# postgresql.conf on the primary
wal_level = replica                    # minimum for streaming replication
max_wal_senders = 10                   # max concurrent replication connections
max_replication_slots = 10             # prevent WAL removal before standby consumption
wal_keep_size = 2GB                    # retain WAL as fallback if slots are unused
hot_standby = on                       # allow read queries on standbys
synchronous_commit = on                # 'on' for sync, 'off' for pure async
synchronous_standby_names = 'ANY 1 (standby1, standby2)'  # sync replication targets
archive_mode = on                      # enable WAL archiving for PITR
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
listening_addresses = '*'
port = 5432

Authentication for replication connections is handled in pg_hba.conf. Replication connections use a dedicated connection type.

# pg_hba.conf — replication entries
# TYPE   DATABASE        USER            ADDRESS              METHOD
host     replication     replicator      10.0.1.0/24          scram-sha-256
host     replication     replicator      10.0.2.0/24          scram-sha-256
host     all             all             10.0.0.0/16          scram-sha-256

Create the replication user on the primary.

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'strong_replication_password';

Provisioning a Standby with pg_basebackup

The pg_basebackup utility creates a physical copy of the primary's data directory, which becomes the starting point for a new standby. It handles the base backup and WAL streaming atomically, so the resulting copy is consistent.

# On the standby server
pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/16/main \
  -Fp -Xs -P -R

# -Fp: plain format
# -Xs: stream WAL during backup
# -P:  show progress
# -R:  create standby.signal and configure primary_conninfo in postgresql.auto.conf

The -R flag is critical — it writes primary_conninfo into postgresql.auto.conf and creates standby.signal, which tells PostgreSQL to start in standby mode. In PostgreSQL 12 and later, recovery.conf is replaced by these two mechanisms.

# postgresql.auto.conf (generated by pg_basebackup -R)
primary_conninfo = 'host=primary-host port=5432 user=replicator password=strong_replication_password application_name=standby1'
primary_slot_name = 'standby1_slot'

Create the replication slot on the primary before starting the standby, to prevent WAL from being cleaned up before the standby can consume it.

SELECT pg_create_physical_replication_slot('standby1_slot');
SELECT pg_create_physical_replication_slot('standby2_slot');

Patroni: Automated HA Orchestration

Streaming replication gives you data redundancy, but it does not give you automatic failover. If the primary crashes, someone — a human operator or an automation system — must promote a standby to primary, reconfigure the remaining standbys to follow the new primary, and update connection routing. Patroni automates all of this.

Patroni is a Python daemon that runs alongside each PostgreSQL instance. It uses a Distributed Configuration Store (DCS) — typically etcd, but also ZooKeeper or Consul — to coordinate leader election and cluster state. Every Patroni node continuously writes its health status to the DCS. When the leader (primary) fails to renew its DCS key within the configured TTL, Patroni initiates a leader election among the healthy standbys. The winner is promoted to primary, and the remaining nodes reconfigure themselves as standbys of the new primary — all automatically, typically within 10-30 seconds.

Patroni YAML Configuration

Patroni is configured via a YAML file that defines the DCS connection, PostgreSQL parameters, replication behaviour, and bootstrap settings. The following is a production-grade configuration for the primary node.

# /etc/patroni/patroni.yml — Node 1 (Primary)
scope: pg-ha-cluster
namespace: /postgresql-ha/
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.10:8008

etcd3:
  hosts:
    - 10.0.2.10:2379
    - 10.0.2.11:2379
    - 10.0.2.12:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — only promote standbys within this lag
    synchronous_mode: true
    synchronous_mode_strict: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: 'on'
        max_connections: 200
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 2GB
        synchronous_commit: 'on'
        archive_mode: 'on'
        archive_command: 'test ! -f /archive/%f && cp %p /archive/%f'
        archive_timeout: 60
        wal_log_hints: 'on'
        shared_preload_libraries: 'pg_stat_statements'
        track_commit_timestamp: 'on'
      pg_hba:
        - host replication replicator 10.0.0.0/16 scram-sha-256
        - host all all 10.0.0.0/16 scram-sha-256
        - host all all 0.0.0.0/0 scram-sha-256

  initdb:
    - encoding: UTF8
    - data-checksums

  users:
    admin:
      password: 'admin_secure_password'
      options:
        - createrole
        - createdb
    replicator:
      password: 'repl_secure_password'
      options:
        - replication

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.10:5432
  data_dir: /var/lib/postgresql/16/main
  bin_dir: /usr/lib/postgresql/16/bin
  config_dir: /var/lib/postgresql/16/main
  pgpass: /tmp/pgpass0
  authentication:
    superuser:
      username: postgres
      password: 'postgres_secure_password'
    replication:
      username: replicator
      password: 'repl_secure_password'
    rewind:
      username: postgres
      password: 'postgres_secure_password'
  parameters:
    unix_socket_directories: '/var/run/postgresql'
  create_replica_methods:
    - basebackup
  basebackup:
    max-rate: 100M
    checkpoint: fast

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

The standby nodes use an identical configuration with their own name, connect_address, and listen values. Patroni handles the rest — it detects whether a node should be the leader or a replica based on the DCS state and configures PostgreSQL accordingly.

etcd as Distributed Configuration Store

etcd is the nervous system of a Patroni cluster. It stores the current leader identity, the cluster topology, the desired configuration, and the health status of each node. A three-node etcd cluster is the minimum for production, tolerating one node failure while maintaining quorum.

# Install and configure etcd on three dedicated nodes
# /etc/etcd/etcd.conf.yml — Node etcd1 (10.0.2.10)
name: etcd1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://0.0.0.0:2380
advertise-client-urls: http://10.0.2.10:2379
initial-advertise-peer-urls: http://10.0.2.10:2380
initial-cluster: etcd1=http://10.0.2.10:2380,etcd2=http://10.0.2.11:2380,etcd3=http://10.0.2.12:2380
initial-cluster-state: new
initial-cluster-token: patroni-etcd-cluster

# Start etcd
systemctl enable --now etcd

# Verify cluster health
etcdctl endpoint health --cluster \
  --endpoints=http://10.0.2.10:2379,http://10.0.2.11:2379,http://10.0.2.12:2379

For production deployments, enable TLS between etcd peers and between etcd and Patroni clients. Unencrypted etcd traffic exposes cluster credentials and configuration to network-level attackers.

HAProxy for Connection Routing

Patroni exposes a REST API on each node (port 8008 by default) that reports whether the node is the current leader or a replica. HAProxy uses these health check endpoints to route traffic: writes go to the leader, reads go to healthy replicas. This gives you automatic read-write splitting without any application-level changes.

# /etc/haproxy/haproxy.cfg
global
    maxconn 2000
    log /dev/log local0
    stats socket /var/run/haproxy.sock mode 660 level admin

defaults
    mode tcp
    log global
    retries 3
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s
    maxconn 1000

listen pg_write
    bind *:5000
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 10.0.1.10:5432 check port 8008
    server node2 10.0.1.11:5432 check port 8008
    server node3 10.0.1.12:5432 check port 8008

listen pg_read
    bind *:5001
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 10.0.1.10:5432 check port 8008
    server node2 10.0.1.11:5432 check port 8008
    server node3 10.0.1.12:5432 check port 8008

listen stats
    bind *:7000
    mode http
    stats enable
    stats uri /
    stats refresh 10s

The /primary endpoint returns HTTP 200 only on the current Patroni leader. The /replica endpoint returns 200 on healthy standbys. When a failover occurs, the new primary starts returning 200 on /primary, and HAProxy automatically redirects write traffic — typically within a single health check interval (3 seconds). The on-marked-down shutdown-sessions directive immediately terminates existing connections to a failed primary, forcing clients to reconnect to the new leader.

PgBouncer for Connection Pooling

PostgreSQL creates a new backend process for each client connection. At scale — hundreds or thousands of microservices each maintaining connection pools — the overhead of process creation and memory consumption becomes significant. PgBouncer sits between the application and PostgreSQL, maintaining a pool of server-side connections and multiplexing client connections onto them.

# /etc/pgbouncer/pgbouncer.ini
[databases]
* = host=127.0.0.1 port=5432 dbname=appdb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 3
server_lifetime = 3600
server_idle_timeout = 600
server_connect_timeout = 5
server_login_retry = 3

log_connections = 1
log_disconnections = 1
stats_period = 60

When used with Patroni, PgBouncer is typically co-located on each PostgreSQL node or on the HAProxy nodes. The transaction pool mode is the best choice for most workloads — it assigns a server connection for the duration of a transaction and returns it to the pool between transactions. This is far more efficient than session mode, which holds a connection for the entire client session.

WAL Archiving and Point-in-Time Recovery

Streaming replication protects against server failure, but it does not protect against logical errors — an accidental DROP TABLE or a bad application migration is replicated to all standbys immediately. WAL archiving combined with point-in-time recovery (PITR) lets you restore to any moment before the error occurred.

WAL archiving copies completed WAL segments to a durable archive — typically an S3 bucket, an NFS mount, or a dedicated backup server. Tools like pgBackRest and WAL-G handle archiving efficiently with compression, encryption, and parallel transfer.

# pgBackRest configuration — /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=pg-wal-archive
repo1-s3-endpoint=s3.eu-west-1.amazonaws.com
repo1-s3-region=eu-west-1
repo1-s3-key=AKIAIOSFODNN7EXAMPLE
repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=7
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=strong_encryption_passphrase
compress-type=zst
compress-level=3
process-max=4

[pg-ha-cluster]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

# In postgresql.conf
archive_command = 'pgbackrest --stanza=pg-ha-cluster archive-push %p'
restore_command = 'pgbackrest --stanza=pg-ha-cluster archive-get %f "%p"'

To perform a point-in-time recovery, specify a target timestamp.

# Restore to a specific point in time
pgbackrest --stanza=pg-ha-cluster --type=time \
  --target="2026-04-12 11:25:00" \
  --target-action=promote \
  restore

Logical Replication for Selective Data Sync

While streaming replication creates an exact physical copy of the entire database cluster, logical replication operates at the table level, replicating individual tables or subsets of data between independent PostgreSQL instances. This is useful for zero-downtime major version upgrades, cross-region read replicas that only need specific tables, data warehouse feeding, and multi-tenant data distribution.

# On the publisher (source database)
wal_level = logical  # must be 'logical' — higher than 'replica'

CREATE PUBLICATION app_pub FOR TABLE orders, customers, products;

# On the subscriber (target database)
CREATE SUBSCRIPTION app_sub
  CONNECTION 'host=publisher-host port=5432 dbname=appdb user=replicator password=pass'
  PUBLICATION app_pub;

Logical replication can run alongside streaming replication. A common pattern is to use streaming replication for HA (fast physical failover) and logical replication for cross-region analytics replicas that only need a subset of tables.

Multi-Region Streaming Replication

For disaster recovery and global read performance, PostgreSQL clusters can span multiple regions. The standard pattern is synchronous replication within a region (for zero data loss on local failover) and asynchronous replication across regions (to avoid cross-region latency penalties on every write). Each region has its own HAProxy for local read routing.

Failover and Switchover Procedures

Understanding the difference between failover and switchover is critical. A failover is an unplanned promotion triggered by the failure of the current primary. A switchover is a planned, graceful role change — typically performed before maintenance. Patroni supports both.

Planned Switchover

# List cluster members
patronictlctl -c /etc/patroni/patroni.yml list

# Perform switchover to a specific node
patronictlctl -c /etc/patroni/patroni.yml switchover \
  --master node1 --candidate node2 --force

# Or use the Patroni REST API
curl -s http://10.0.1.10:8008/switchover -XPOST \
  -d '{"leader": "node1", "candidate": "node2"}'

During a switchover, Patroni demotes the current primary to a standby, promotes the target candidate, and reconfigures all other standbys to follow the new primary. The process takes 5-15 seconds. HAProxy detects the change through health checks and redirects traffic automatically.

Automatic Failover Sequence

When the primary fails unexpectedly, Patroni follows a precise sequence to restore service. The following diagram illustrates the steps.

Split-Brain Prevention

Split-brain — where two nodes simultaneously believe they are the primary — is the most dangerous failure mode in any HA system. Patroni prevents split-brain through several mechanisms:

DCS-based leader lock: Only one node can hold the leader key in etcd at any time. The key has a TTL, and the leader must renew it continuously. If a network partition isolates the leader from etcd, the key expires, and the leader demotes itself.
Watchdog: Patroni can configure a Linux watchdog device (/dev/watchdog). If Patroni loses access to the DCS and cannot confirm it should remain leader, the watchdog will reboot or power-off the node — a hard fencing mechanism that guarantees the old primary does not continue accepting writes.
pg_rewind: When a former primary comes back online, it may have WAL records that were never replicated. pg_rewind rewinds the timeline to the point of divergence, allowing the node to rejoin as a standby without a full base backup. Patroni's use_pg_rewind: true setting automates this.

# Enable watchdog in Patroni config
bootstrap:
  dcs:
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_log_hints: 'on'  # required for pg_rewind

# Watchdog configuration
watchdog:
  mode: required        # 'off', 'automatic', or 'required'
  device: /dev/watchdog
  safety_margin: 5      # seconds before TTL expiry to trigger watchdog

repmgr as an Alternative to Patroni

repmgr is another popular HA tool for PostgreSQL. It provides standby management, automatic failover, and switchover capabilities. However, it takes a fundamentally different approach from Patroni. repmgr uses a witness node and a daemon (repmgrd) for failure detection rather than a distributed consensus store. This makes it simpler to deploy but more susceptible to split-brain in complex network partition scenarios.

# repmgr.conf on the primary
node_id=1
node_name='node1'
conninfo='host=10.0.1.10 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/16/main'
failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=5
reconnect_attempts=6
reconnect_interval=10

For new deployments, Patroni is the recommended choice due to its stronger split-brain prevention guarantees and more active development community. repmgr remains a reasonable option for simpler setups or organisations already invested in the tool.

Cloud Deployment Patterns

AWS Deployment: EC2, EBS, and Route53

On AWS, deploy each PostgreSQL + Patroni node on an EC2 instance with EBS gp3 or io2 volumes. Use separate instances across multiple Availability Zones for HA. etcd nodes should also span AZs.

# Terraform sketch for PostgreSQL HA on AWS
resource "aws_instance" "pg_node" {
  count                = 3
  ami                  = "ami-0abcdef1234567890"  # Ubuntu 22.04
  instance_type        = "r6g.2xlarge"             # 8 vCPU, 64GB RAM
  subnet_id            = aws_subnet.private[count.index].id
  vpc_security_group_ids = [aws_security_group.pg_sg.id]
  availability_zone    = element(["eu-west-1a", "eu-west-1b", "eu-west-1c"], count.index)

  root_block_device {
    volume_size = 50
    volume_type = "gp3"
  }

  tags = {
    Name = "pg-node-${count.index + 1}"
    Role = "patroni"
  }
}

resource "aws_ebs_volume" "pg_data" {
  count             = 3
  availability_zone = element(["eu-west-1a", "eu-west-1b", "eu-west-1c"], count.index)
  size              = 500
  type              = "gp3"
  iops              = 6000
  throughput        = 250
  encrypted         = true

  tags = {
    Name = "pg-data-${count.index + 1}"
  }
}

resource "aws_route53_health_check" "pg_primary" {
  count             = 3
  ip_address        = aws_instance.pg_node[count.index].private_ip
  port              = 8008
  type              = "HTTP"
  resource_path     = "/primary"
  failure_threshold = 3
  request_interval  = 10
}

Use a Network Load Balancer (NLB) instead of HAProxy if you prefer an AWS-managed solution. The NLB can use target group health checks against the Patroni REST API to route traffic to the current primary.

Azure Deployment: VMs, Managed Disks, and Azure LB

On Azure, use Standard_E8s_v5 VMs (memory-optimised) with Premium SSD Managed Disks for data volumes. Deploy across Availability Zones. Azure Load Balancer provides the equivalent of HAProxy with health probes against the Patroni REST API.

# Azure CLI — create PostgreSQL VM with Managed Disk
az vm create \
  --resource-group pg-ha-rg \
  --name pg-node-1 \
  --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
  --size Standard_E8s_v5 \
  --zone 1 \
  --vnet-name pg-vnet \
  --subnet pg-subnet \
  --nsg pg-nsg \
  --admin-username pgadmin \
  --ssh-key-value ~/.ssh/id_rsa.pub

az disk create \
  --resource-group pg-ha-rg \
  --name pg-data-1 \
  --size-gb 512 \
  --sku Premium_LRS \
  --zone 1

az vm disk attach \
  --resource-group pg-ha-rg \
  --vm-name pg-node-1 \
  --name pg-data-1

# Azure Load Balancer health probe for Patroni
az network lb probe create \
  --resource-group pg-ha-rg \
  --lb-name pg-lb \
  --name patroni-primary-probe \
  --protocol Http \
  --port 8008 \
  --path /primary \
  --interval 5 \
  --threshold 3

GCP Deployment: Compute Engine and Cloud Load Balancing

On GCP, use n2-highmem-8 instances (8 vCPU, 64GB RAM) with SSD Persistent Disks. Distribute across zones within a region. Use an internal TCP/UDP load balancer with Patroni health checks.

# GCP — create instance and persistent disk
gcloud compute instances create pg-node-1 \
  --zone=europe-west1-b \
  --machine-type=n2-highmem-8 \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=50GB \
  --network=pg-network \
  --subnet=pg-subnet

gcloud compute disks create pg-data-1 \
  --zone=europe-west1-b \
  --size=500GB \
  --type=pd-ssd

gcloud compute instances attach-disk pg-node-1 \
  --disk=pg-data-1 \
  --zone=europe-west1-b

# Health check for Patroni primary endpoint
gcloud compute health-checks create http patroni-primary-check \
  --port=8008 \
  --request-path=/primary \
  --check-interval=5s \
  --timeout=5s \
  --unhealthy-threshold=3 \
  --healthy-threshold=2

Bare Metal k3s Deployment with Rancher and Longhorn

For organisations that operate their own hardware, deploying PostgreSQL HA on bare metal with k3s, Rancher, and Longhorn provides a fully open-source, cloud-independent infrastructure. k3s is a lightweight Kubernetes distribution that runs efficiently on bare metal servers without the overhead of a full Kubernetes distribution.

k3s and Longhorn Setup

# Install k3s on the first server node
curl -sfL https://get.k3s.io | K3S_TOKEN=my-cluster-token \
  INSTALL_K3S_EXEC="server --cluster-init --disable traefik --disable servicelb" sh -

# Join additional server nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=my-cluster-token \
  K3S_URL=https://10.0.0.1:6443 \
  INSTALL_K3S_EXEC="server" sh -

# Install Longhorn for distributed block storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
  --namespace longhorn-system --create-namespace \
  --set defaultSettings.defaultDataPath=/mnt/longhorn \
  --set defaultSettings.replicaCount=3 \
  --set defaultSettings.storageMinimalAvailablePercentage=15

# Deploy PostgreSQL with Patroni using the Zalando Postgres Operator
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm install postgres-operator postgres-operator-charts/postgres-operator \
  --namespace postgres-system --create-namespace

# PostgreSQL cluster manifest for the Zalando Postgres Operator
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: pg-ha-cluster
  namespace: production
spec:
  teamId: "platform"
  numberOfInstances: 3
  volume:
    size: 500Gi
    storageClass: longhorn
  users:
    appuser:
      - superuser
      - createdb
    replicator: []
  databases:
    appdb: appuser
  postgresql:
    version: "16"
    parameters:
      shared_buffers: "16GB"
      effective_cache_size: "48GB"
      work_mem: "256MB"
      maintenance_work_mem: "2GB"
      max_connections: "200"
      max_wal_senders: "10"
      wal_level: replica
      synchronous_commit: "on"
      wal_keep_size: "2GB"
      archive_mode: "on"
      track_commit_timestamp: "on"
  patroni:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    synchronous_mode: true
  resources:
    requests:
      cpu: "4"
      memory: 32Gi
    limits:
      cpu: "8"
      memory: 64Gi

Keepalived for HAProxy VIP

# /etc/keepalived/keepalived.conf on lb1
vrrp_script chk_haproxy {
    script "killall -0 haproxy"
    interval 2
    weight 2
}

vrrp_instance VI_PG {
    state MASTER
    interface eth0
    virtual_router_id 52
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass pgha_vip_pass
    }
    virtual_ipaddress {
        10.0.0.100/24
    }
    track_script {
        chk_haproxy
    }
}

Monitoring PostgreSQL Replication

Monitoring replication health is non-negotiable in production. PostgreSQL provides several built-in views for this purpose, and Prometheus with Grafana provides the long-term visibility and alerting you need.

Built-in Monitoring Queries

-- Check replication status on the primary
SELECT
    client_addr,
    application_name,
    state,
    sync_state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) AS replication_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- Check replication slot status
SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS slot_lag
FROM pg_replication_slots;

-- Check standby recovery status (run on standby)
SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_replayed,
    pg_last_xact_replay_timestamp() AS last_replayed_timestamp,
    EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int AS replay_lag_seconds;

-- Monitor WAL generation rate
SELECT
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal_generated,
    pg_size_pretty(sum(size)) AS wal_directory_size
FROM pg_ls_waldir();

-- Check for long-running queries that could block replication
SELECT
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
  AND state != 'idle'
ORDER BY duration DESC;

Prometheus and Grafana Stack

The postgres_exporter exposes PostgreSQL metrics in Prometheus format. Combined with the patroni_exporter, you get full visibility into both database performance and HA cluster state.

# Deploy postgres_exporter as a sidecar or standalone
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# Custom queries for postgres_exporter
# /etc/postgres_exporter/queries.yaml
pg_replication_lag:
  query: |
    SELECT
      CASE WHEN pg_is_in_recovery() THEN
        EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::float
      ELSE 0 END AS lag_seconds
  master: true
  metrics:
    - lag_seconds:
        usage: "GAUGE"
        description: "Replication lag in seconds"

pg_replication_slots:
  query: |
    SELECT
      slot_name,
      active::int AS active,
      pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)::float AS slot_lag_bytes
    FROM pg_replication_slots
  master: true
  metrics:
    - slot_name:
        usage: "LABEL"
    - active:
        usage: "GAUGE"
        description: "Whether the slot is active"
    - slot_lag_bytes:
        usage: "GAUGE"
        description: "Slot lag in bytes"

# PrometheusRule for PostgreSQL HA alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgresql-ha-alerts
  namespace: monitoring
spec:
  groups:
    - name: postgresql-replication
      rules:
        - alert: PostgreSQLReplicationLagHigh
          expr: pg_replication_lag_seconds > 30
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "PostgreSQL replication lag exceeds 30s on {{ $labels.instance }}"

        - alert: PostgreSQLReplicationSlotInactive
          expr: pg_replication_slots_active == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Replication slot {{ $labels.slot_name }} is inactive"

        - alert: PostgreSQLReplicationSlotLagHigh
          expr: pg_replication_slots_slot_lag_bytes > 1073741824
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Replication slot lag exceeds 1GB on {{ $labels.slot_name }}"

        - alert: PatroniClusterUnhealthy
          expr: patroni_cluster_members_count < 3
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Patroni cluster has fewer than 3 members"

Production Tuning Parameters

PostgreSQL's default configuration is conservative, tuned for a small shared hosting environment. Production HA clusters require careful tuning of replication, memory, and WAL parameters. The following table summarises the most important settings for a 64GB RAM server with NVMe storage.

# postgresql.conf — Production HA tuning

# === Replication ===
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 4GB
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
track_commit_timestamp = on
wal_log_hints = on

# === WAL ===
min_wal_size = 1GB
max_wal_size = 8GB
wal_buffers = 64MB
wal_compression = zstd
archive_mode = on
archive_timeout = 300
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min

# === Memory ===
shared_buffers = 16GB              # 25% of RAM
effective_cache_size = 48GB        # 75% of RAM
work_mem = 256MB                   # per-operation sort/hash memory
maintenance_work_mem = 2GB         # for VACUUM, CREATE INDEX
huge_pages = try

# === Connections ===
max_connections = 200              # use PgBouncer for higher client counts
superuser_reserved_connections = 5

# === Query Performance ===
random_page_cost = 1.1             # SSD storage
effective_io_concurrency = 200     # NVMe SSD
default_statistics_target = 500
jit = on

# === Logging ===
log_min_duration_statement = 500   # log queries > 500ms
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0

# === Autovacuum ===
autovacuum_max_workers = 4
autovacuum_naptime = 30s
autovacuum_vacuum_cost_limit = 2000
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.02

Tuning Patroni DCS Parameters

The relationship between Patroni's ttl, loop_wait, and retry_timeout parameters directly impacts failover speed and false-positive risk. A shorter TTL means faster failover detection but increases the risk of unnecessary failovers during brief network blips.

# Conservative (production default)
ttl: 30
loop_wait: 10
retry_timeout: 10
# Failover detection: ~30-40 seconds

# Aggressive (low-latency failover)
ttl: 15
loop_wait: 5
retry_timeout: 5
# Failover detection: ~15-20 seconds
# Warning: Higher risk of false failovers in unstable networks

Chaos Engineering and Failover Testing

A failover system that has never been tested is a system that does not work. Chaos engineering applies controlled failures to validate that your HA setup behaves correctly under real failure conditions. Every Patroni cluster should be subjected to regular failover drills.

Failover Test Playbook

# 1. Verify cluster health before testing
patronictlctl -c /etc/patroni/patroni.yml list
+----------+---------+---------+----+-----------+
| Member   | Host    | Role    | TL | Lag in MB |
+----------+---------+---------+----+-----------+
| node1    | 10.0.1.10| Leader |  5 |           |
| node2    | 10.0.1.11| Replica |  5 |         0 |
| node3    | 10.0.1.12| Replica |  5 |         0 |
+----------+---------+---------+----+-----------+

# 2. Simulate primary crash (on node1)
sudo systemctl stop patroni
# Or more aggressive: sudo kill -9 $(pgrep -f patroni)

# 3. Monitor failover (from any node with patronictl)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# 4. Verify new leader is elected (within 30-45 seconds)
# Expected: node2 or node3 promoted to Leader

# 5. Test write availability through HAProxy
PGPASSWORD=app_password psql -h haproxy-host -p 5000 -U appuser -d appdb \
  -c "INSERT INTO health_check (ts) VALUES (now()) RETURNING *;"

# 6. Restart the former primary
sudo systemctl start patroni
# Patroni will use pg_rewind to rejoin as a replica

# 7. Verify the former primary rejoins as replica
patronictlctl -c /etc/patroni/patroni.yml list

Network Partition Testing

# Simulate network partition on the primary using iptables
# Block all traffic to etcd from the primary
sudo iptables -A OUTPUT -d 10.0.2.10 -j DROP
sudo iptables -A OUTPUT -d 10.0.2.11 -j DROP
sudo iptables -A OUTPUT -d 10.0.2.12 -j DROP

# Expected behaviour:
# 1. Primary loses DCS access
# 2. Leader key TTL expires
# 3. Primary demotes itself (with watchdog, node may reboot)
# 4. Standby acquires leader lock and promotes
# 5. After clearing iptables rules, former primary rejoins as replica

# Clean up
sudo iptables -D OUTPUT -d 10.0.2.10 -j DROP
sudo iptables -D OUTPUT -d 10.0.2.11 -j DROP
sudo iptables -D OUTPUT -d 10.0.2.12 -j DROP

Automated Chaos Testing with Toxiproxy

# Run Toxiproxy alongside your Patroni cluster
# Create proxies for etcd and replication connections
toxiproxy-cli create etcd_proxy -l 0.0.0.0:12379 -u 10.0.2.10:2379
toxiproxy-cli create pg_repl_proxy -l 0.0.0.0:15432 -u 10.0.1.10:5432

# Add latency to etcd connections (simulates degraded network)
toxiproxy-cli toxic add etcd_proxy -t latency -a latency=500 -a jitter=200

# Add bandwidth limit to replication (simulates WAN replication)
toxiproxy-cli toxic add pg_repl_proxy -t bandwidth -a rate=1024

# Completely sever the connection (simulates network partition)
toxiproxy-cli toxic add etcd_proxy -t timeout -a timeout=0

# Monitor Patroni behaviour and verify correct failover
watch -n 2 'curl -s http://10.0.1.10:8008/patroni | python3 -m json.tool'

Continuous Validation Script

#!/bin/bash
# continuous_ha_check.sh — Run during chaos tests to measure availability

HAPROXY_HOST="10.0.0.100"
WRITE_PORT=5000
READ_PORT=5001
DATABASE="appdb"
USER="appuser"
LOGFILE="/var/log/ha_test_$(date +%Y%m%d_%H%M%S).log"

write_count=0
write_fail=0
read_count=0
read_fail=0

while true; do
    ts=$(date '+%Y-%m-%d %H:%M:%S.%3N')

    # Test write path
    if PGPASSWORD=app_password psql -h $HAPROXY_HOST -p $WRITE_PORT \
       -U $USER -d $DATABASE -c "SELECT 1" &>/dev/null; then
        ((write_count++))
    else
        ((write_fail++))
        echo "$ts WRITE_FAIL total_fails=$write_fail" >> $LOGFILE
    fi

    # Test read path
    if PGPASSWORD=app_password psql -h $HAPROXY_HOST -p $READ_PORT \
       -U $USER -d $DATABASE -c "SELECT 1" &>/dev/null; then
        ((read_count++))
    else
        ((read_fail++))
        echo "$ts READ_FAIL total_fails=$read_fail" >> $LOGFILE
    fi

    total=$((write_count + write_fail))
    if (( total % 100 == 0 )); then
        write_avail=$(echo "scale=2; $write_count * 100 / $total" | bc)
        read_total=$((read_count + read_fail))
        read_avail=$(echo "scale=2; $read_count * 100 / $read_total" | bc)
        echo "$ts Writes: ${write_avail}% ($write_count/$total) Reads: ${read_avail}% ($read_count/$read_total)"
    fi

    sleep 0.5
done

Advanced: Cascading Replication and Delayed Standbys

For large clusters, cascading replication reduces the load on the primary. Instead of all standbys replicating directly from the primary, some standbys replicate from other standbys. This creates a tree topology where the primary feeds two standbys, and those standbys feed additional downstream standbys.

# postgresql.auto.conf on a cascading standby
primary_conninfo = 'host=standby1-host port=5432 user=replicator application_name=cascade1'
primary_slot_name = 'cascade1_slot'

A delayed standby intentionally applies WAL records with a time delay — typically 1-4 hours. This provides a defence against logical errors (accidental deletes, bad migrations) that are immediately replicated to synchronous standbys. If a disaster occurs, you can stop WAL replay on the delayed standby and recover data from before the error.

# postgresql.conf on delayed standby
recovery_min_apply_delay = '1h'

Connection String Strategies

Applications connecting to a Patroni-managed cluster should always connect through HAProxy or use PostgreSQL's built-in multi-host connection string with target_session_attrs. This provides client-side failover without depending on a load balancer.

# Multi-host connection string with target_session_attrs
# The client tries each host in order and connects to the one matching the target attribute
postgresql://appuser:password@node1:5432,node2:5432,node3:5432/appdb?target_session_attrs=read-write&sslmode=require

# For read-only connections
postgresql://appuser:password@node1:5432,node2:5432,node3:5432/appdb?target_session_attrs=prefer-standby&sslmode=require

This approach works well for applications that cannot easily be reconfigured to point to an HAProxy VIP. The PostgreSQL client library (libpq) handles the failover transparently.

Security Hardening

A production PostgreSQL HA cluster must enforce encryption in transit and at rest, use strong authentication, and limit network exposure.

# Enable TLS in postgresql.conf
ssl = on
ssl_cert_file = '/etc/postgresql/certs/server.crt'
ssl_key_file = '/etc/postgresql/certs/server.key'
ssl_ca_file = '/etc/postgresql/certs/ca.crt'
ssl_min_protocol_version = 'TLSv1.3'

# Require TLS for all connections in pg_hba.conf
hostssl replication replicator 10.0.0.0/16 scram-sha-256
hostssl all         all        10.0.0.0/16 scram-sha-256

# etcd TLS
# In Patroni config
etcd3:
  hosts:
    - 10.0.2.10:2379
    - 10.0.2.11:2379
    - 10.0.2.12:2379
  protocol: https
  cacert: /etc/patroni/certs/etcd-ca.crt
  cert: /etc/patroni/certs/etcd-client.crt
  key: /etc/patroni/certs/etcd-client.key

Backup Strategy for HA Clusters

A comprehensive backup strategy for Patroni clusters should include continuous WAL archiving, regular full backups, and differential or incremental backups between full backups. pgBackRest is the recommended tool for production PostgreSQL backup management.

# Schedule backups via cron
# Full backup weekly (Sunday 2 AM)
0 2 * * 0 pgbackrest --stanza=pg-ha-cluster --type=full backup

# Differential backup daily (2 AM, Mon-Sat)
0 2 * * 1-6 pgbackrest --stanza=pg-ha-cluster --type=diff backup

# Verify backup integrity
pgbackrest --stanza=pg-ha-cluster --set=latest info

# Verify backup can be restored (dry run)
pgbackrest --stanza=pg-ha-cluster --set=latest verify

# List all backups
pgbackrest --stanza=pg-ha-cluster info
            full backup: 20260412-020000F
                timestamp: 2026-04-12 02:00:00 +0000
                wal start/stop: 000000050000000000000040 / 000000050000000000000042
                database size: 150GB, backup size: 150GB
                repository size: 45GB (compressed)
            diff backup: 20260412-020000F_20260413-020000D
                timestamp: 2026-04-13 02:00:00 +0000
                database size: 151GB, backup size: 2.1GB
                repository size: 650MB (compressed)

Operational Runbook Summary

Every team running a Patroni cluster should maintain a runbook covering the following scenarios. Having documented, tested procedures transforms a stressful outage into a routine operation.

# === Quick Reference Commands ===

# Cluster status
patronictlctl -c /etc/patroni/patroni.yml list
patronictlctl -c /etc/patroni/patroni.yml history

# Planned switchover
patronictlctl -c /etc/patroni/patroni.yml switchover --master node1 --candidate node2

# Restart PostgreSQL on a specific node (rolling restart)
patronictlctl -c /etc/patroni/patroni.yml restart pg-ha-cluster node2

# Reload PostgreSQL configuration without restart
patronictlctl -c /etc/patroni/patroni.yml reload pg-ha-cluster

# Pause automatic failover (during maintenance)
patronictlctl -c /etc/patroni/patroni.yml pause

# Resume automatic failover
patronictlctl -c /etc/patroni/patroni.yml resume

# Edit DCS configuration (applies to all nodes)
patronictlctl -c /etc/patroni/patroni.yml edit-config

# Reinitialise a failed replica
patronictlctl -c /etc/patroni/patroni.yml reinit pg-ha-cluster node3

# Check Patroni REST API directly
curl -s http://10.0.1.10:8008/patroni | python3 -m json.tool
curl -s http://10.0.1.10:8008/cluster | python3 -m json.tool

Performance Benchmarking

Before going to production, benchmark your HA cluster to establish baseline performance and verify that synchronous replication latency is acceptable for your workload.

# Benchmark with pgbench — initialise test data
pgbench -i -s 100 -h haproxy-host -p 5000 -U appuser appdb

# Run write-heavy benchmark (measures sync replication impact)
pgbench -h haproxy-host -p 5000 -U appuser -c 32 -j 8 -T 300 appdb
# Compare with async: temporarily set synchronous_commit = off

# Run read-only benchmark through read replica port
pgbench -h haproxy-host -p 5001 -U appuser -c 64 -j 16 -T 300 -S appdb

# Measure failover impact on transactions
# Run pgbench in background, then trigger a failover
pgbench -h haproxy-host -p 5000 -U appuser -c 8 -j 4 -T 600 appdb &
sleep 60 && patronictl switchover --master node1 --candidate node2 --force

Conclusion

PostgreSQL native high availability with Patroni is not a single-tool solution — it is an integrated system of streaming replication, distributed consensus, connection routing, connection pooling, WAL archiving, monitoring, and operational discipline. Each layer addresses a specific failure mode: streaming replication handles data redundancy, Patroni handles automatic failover coordination, etcd provides the distributed consensus needed for leader election without split-brain, HAProxy routes connections to the correct leader, PgBouncer manages connection overhead at scale, and WAL archiving with pgBackRest provides the last line of defence against logical errors and disaster recovery.

The deployment patterns vary across environments — AWS with NLB and Route53, Azure with Availability Zones and Azure Load Balancer, GCP with regional managed instance groups, or bare metal with k3s, Longhorn, and Keepalived — but the core architecture remains the same. Three or more PostgreSQL nodes managed by Patroni, backed by a three-node etcd cluster, fronted by a load balancer that follows Patroni's health check endpoints.

The most important investment you can make is not in configuration — it is in testing. Run failover drills monthly. Inject network partitions. Kill processes unexpectedly. Measure recovery time and data loss. Build dashboards that show replication lag, WAL generation rate, connection pool saturation, and DCS health in real time. The confidence you gain from systematic testing is what separates a cluster that survives its first real outage from one that turns a server failure into a business-impacting incident.

PostgreSQL gives you all the replication primitives. Patroni gives you the orchestration. etcd gives you the consensus. Your job is to wire them together correctly, tune them for your workload, and validate them continuously. This guide has given you the blueprints — now build, test, and operate with confidence.

PostgreSQL Streaming Replication Fundamentals

Configuring the Primary for Replication

The primary server must be configured to generate WAL records at a level sufficient for replication and to allow standby connections. The following postgresql.conf settings are essential.

# postgresql.conf on the primary
wal_level = replica                    # minimum for streaming replication
max_wal_senders = 10                   # max concurrent replication connections
max_replication_slots = 10             # prevent WAL removal before standby consumption
wal_keep_size = 2GB                    # retain WAL as fallback if slots are unused
hot_standby = on                       # allow read queries on standbys
synchronous_commit = on                # 'on' for sync, 'off' for pure async
synchronous_standby_names = 'ANY 1 (standby1, standby2)'  # sync replication targets
archive_mode = on                      # enable WAL archiving for PITR
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
listening_addresses = '*'
port = 5432

Authentication for replication connections is handled in pg_hba.conf. Replication connections use a dedicated connection type.

# pg_hba.conf — replication entries
# TYPE   DATABASE        USER            ADDRESS              METHOD
host     replication     replicator      10.0.1.0/24          scram-sha-256
host     replication     replicator      10.0.2.0/24          scram-sha-256
host     all             all             10.0.0.0/16          scram-sha-256

Create the replication user on the primary.

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'strong_replication_password';

Provisioning a Standby with pg_basebackup

# On the standby server
pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/16/main \
  -Fp -Xs -P -R

# -Fp: plain format
# -Xs: stream WAL during backup
# -P:  show progress
# -R:  create standby.signal and configure primary_conninfo in postgresql.auto.conf

# postgresql.auto.conf (generated by pg_basebackup -R)
primary_conninfo = 'host=primary-host port=5432 user=replicator password=strong_replication_password application_name=standby1'
primary_slot_name = 'standby1_slot'

Create the replication slot on the primary before starting the standby, to prevent WAL from being cleaned up before the standby can consume it.

SELECT pg_create_physical_replication_slot('standby1_slot');
SELECT pg_create_physical_replication_slot('standby2_slot');

Patroni: Automated HA Orchestration

Patroni YAML Configuration

# /etc/patroni/patroni.yml — Node 1 (Primary)
scope: pg-ha-cluster
namespace: /postgresql-ha/
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.10:8008

etcd3:
  hosts:
    - 10.0.2.10:2379
    - 10.0.2.11:2379
    - 10.0.2.12:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — only promote standbys within this lag
    synchronous_mode: true
    synchronous_mode_strict: false
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: 'on'
        max_connections: 200
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 2GB
        synchronous_commit: 'on'
        archive_mode: 'on'
        archive_command: 'test ! -f /archive/%f && cp %p /archive/%f'
        archive_timeout: 60
        wal_log_hints: 'on'
        shared_preload_libraries: 'pg_stat_statements'
        track_commit_timestamp: 'on'
      pg_hba:
        - host replication replicator 10.0.0.0/16 scram-sha-256
        - host all all 10.0.0.0/16 scram-sha-256
        - host all all 0.0.0.0/0 scram-sha-256

  initdb:
    - encoding: UTF8
    - data-checksums

  users:
    admin:
      password: 'admin_secure_password'
      options:
        - createrole
        - createdb
    replicator:
      password: 'repl_secure_password'
      options:
        - replication

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.10:5432
  data_dir: /var/lib/postgresql/16/main
  bin_dir: /usr/lib/postgresql/16/bin
  config_dir: /var/lib/postgresql/16/main
  pgpass: /tmp/pgpass0
  authentication:
    superuser:
      username: postgres
      password: 'postgres_secure_password'
    replication:
      username: replicator
      password: 'repl_secure_password'
    rewind:
      username: postgres
      password: 'postgres_secure_password'
  parameters:
    unix_socket_directories: '/var/run/postgresql'
  create_replica_methods:
    - basebackup
  basebackup:
    max-rate: 100M
    checkpoint: fast

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

etcd as Distributed Configuration Store

# Install and configure etcd on three dedicated nodes
# /etc/etcd/etcd.conf.yml — Node etcd1 (10.0.2.10)
name: etcd1
data-dir: /var/lib/etcd
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://0.0.0.0:2380
advertise-client-urls: http://10.0.2.10:2379
initial-advertise-peer-urls: http://10.0.2.10:2380
initial-cluster: etcd1=http://10.0.2.10:2380,etcd2=http://10.0.2.11:2380,etcd3=http://10.0.2.12:2380
initial-cluster-state: new
initial-cluster-token: patroni-etcd-cluster

# Start etcd
systemctl enable --now etcd

# Verify cluster health
etcdctl endpoint health --cluster \
  --endpoints=http://10.0.2.10:2379,http://10.0.2.11:2379,http://10.0.2.12:2379

For production deployments, enable TLS between etcd peers and between etcd and Patroni clients. Unencrypted etcd traffic exposes cluster credentials and configuration to network-level attackers.

HAProxy for Connection Routing

# /etc/haproxy/haproxy.cfg
global
    maxconn 2000
    log /dev/log local0
    stats socket /var/run/haproxy.sock mode 660 level admin

defaults
    mode tcp
    log global
    retries 3
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s
    maxconn 1000

listen pg_write
    bind *:5000
    option httpchk GET /primary
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 10.0.1.10:5432 check port 8008
    server node2 10.0.1.11:5432 check port 8008
    server node3 10.0.1.12:5432 check port 8008

listen pg_read
    bind *:5001
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 10.0.1.10:5432 check port 8008
    server node2 10.0.1.11:5432 check port 8008
    server node3 10.0.1.12:5432 check port 8008

listen stats
    bind *:7000
    mode http
    stats enable
    stats uri /
    stats refresh 10s

PgBouncer for Connection Pooling

# /etc/pgbouncer/pgbouncer.ini
[databases]
* = host=127.0.0.1 port=5432 dbname=appdb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 3
server_lifetime = 3600
server_idle_timeout = 600
server_connect_timeout = 5
server_login_retry = 3

log_connections = 1
log_disconnections = 1
stats_period = 60

WAL Archiving and Point-in-Time Recovery

# pgBackRest configuration — /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=pg-wal-archive
repo1-s3-endpoint=s3.eu-west-1.amazonaws.com
repo1-s3-region=eu-west-1
repo1-s3-key=AKIAIOSFODNN7EXAMPLE
repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=7
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=strong_encryption_passphrase
compress-type=zst
compress-level=3
process-max=4

[pg-ha-cluster]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

# In postgresql.conf
archive_command = 'pgbackrest --stanza=pg-ha-cluster archive-push %p'
restore_command = 'pgbackrest --stanza=pg-ha-cluster archive-get %f "%p"'

To perform a point-in-time recovery, specify a target timestamp.

# Restore to a specific point in time
pgbackrest --stanza=pg-ha-cluster --type=time \
  --target="2026-04-12 11:25:00" \
  --target-action=promote \
  restore

Logical Replication for Selective Data Sync

# On the publisher (source database)
wal_level = logical  # must be 'logical' — higher than 'replica'

CREATE PUBLICATION app_pub FOR TABLE orders, customers, products;

# On the subscriber (target database)
CREATE SUBSCRIPTION app_sub
  CONNECTION 'host=publisher-host port=5432 dbname=appdb user=replicator password=pass'
  PUBLICATION app_pub;

Multi-Region Streaming Replication

Failover and Switchover Procedures

Planned Switchover

# List cluster members
patronictlctl -c /etc/patroni/patroni.yml list

# Perform switchover to a specific node
patronictlctl -c /etc/patroni/patroni.yml switchover \
  --master node1 --candidate node2 --force

# Or use the Patroni REST API
curl -s http://10.0.1.10:8008/switchover -XPOST \
  -d '{"leader": "node1", "candidate": "node2"}'

Automatic Failover Sequence

When the primary fails unexpectedly, Patroni follows a precise sequence to restore service. The following diagram illustrates the steps.

Split-Brain Prevention

Split-brain — where two nodes simultaneously believe they are the primary — is the most dangerous failure mode in any HA system. Patroni prevents split-brain through several mechanisms:

DCS-based leader lock: Only one node can hold the leader key in etcd at any time. The key has a TTL, and the leader must renew it continuously. If a network partition isolates the leader from etcd, the key expires, and the leader demotes itself.
Watchdog: Patroni can configure a Linux watchdog device (/dev/watchdog). If Patroni loses access to the DCS and cannot confirm it should remain leader, the watchdog will reboot or power-off the node — a hard fencing mechanism that guarantees the old primary does not continue accepting writes.
pg_rewind: When a former primary comes back online, it may have WAL records that were never replicated. pg_rewind rewinds the timeline to the point of divergence, allowing the node to rejoin as a standby without a full base backup. Patroni's use_pg_rewind: true setting automates this.

# Enable watchdog in Patroni config
bootstrap:
  dcs:
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_log_hints: 'on'  # required for pg_rewind

# Watchdog configuration
watchdog:
  mode: required        # 'off', 'automatic', or 'required'
  device: /dev/watchdog
  safety_margin: 5      # seconds before TTL expiry to trigger watchdog

repmgr as an Alternative to Patroni

# repmgr.conf on the primary
node_id=1
node_name='node1'
conninfo='host=10.0.1.10 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/16/main'
failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=5
reconnect_attempts=6
reconnect_interval=10

Cloud Deployment Patterns

AWS Deployment: EC2, EBS, and Route53

On AWS, deploy each PostgreSQL + Patroni node on an EC2 instance with EBS gp3 or io2 volumes. Use separate instances across multiple Availability Zones for HA. etcd nodes should also span AZs.

# Terraform sketch for PostgreSQL HA on AWS
resource "aws_instance" "pg_node" {
  count                = 3
  ami                  = "ami-0abcdef1234567890"  # Ubuntu 22.04
  instance_type        = "r6g.2xlarge"             # 8 vCPU, 64GB RAM
  subnet_id            = aws_subnet.private[count.index].id
  vpc_security_group_ids = [aws_security_group.pg_sg.id]
  availability_zone    = element(["eu-west-1a", "eu-west-1b", "eu-west-1c"], count.index)

  root_block_device {
    volume_size = 50
    volume_type = "gp3"
  }

  tags = {
    Name = "pg-node-${count.index + 1}"
    Role = "patroni"
  }
}

resource "aws_ebs_volume" "pg_data" {
  count             = 3
  availability_zone = element(["eu-west-1a", "eu-west-1b", "eu-west-1c"], count.index)
  size              = 500
  type              = "gp3"
  iops              = 6000
  throughput        = 250
  encrypted         = true

  tags = {
    Name = "pg-data-${count.index + 1}"
  }
}

resource "aws_route53_health_check" "pg_primary" {
  count             = 3
  ip_address        = aws_instance.pg_node[count.index].private_ip
  port              = 8008
  type              = "HTTP"
  resource_path     = "/primary"
  failure_threshold = 3
  request_interval  = 10
}

Azure Deployment: VMs, Managed Disks, and Azure LB

# Azure CLI — create PostgreSQL VM with Managed Disk
az vm create \
  --resource-group pg-ha-rg \
  --name pg-node-1 \
  --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
  --size Standard_E8s_v5 \
  --zone 1 \
  --vnet-name pg-vnet \
  --subnet pg-subnet \
  --nsg pg-nsg \
  --admin-username pgadmin \
  --ssh-key-value ~/.ssh/id_rsa.pub

az disk create \
  --resource-group pg-ha-rg \
  --name pg-data-1 \
  --size-gb 512 \
  --sku Premium_LRS \
  --zone 1

az vm disk attach \
  --resource-group pg-ha-rg \
  --vm-name pg-node-1 \
  --name pg-data-1

# Azure Load Balancer health probe for Patroni
az network lb probe create \
  --resource-group pg-ha-rg \
  --lb-name pg-lb \
  --name patroni-primary-probe \
  --protocol Http \
  --port 8008 \
  --path /primary \
  --interval 5 \
  --threshold 3

GCP Deployment: Compute Engine and Cloud Load Balancing

On GCP, use n2-highmem-8 instances (8 vCPU, 64GB RAM) with SSD Persistent Disks. Distribute across zones within a region. Use an internal TCP/UDP load balancer with Patroni health checks.

# GCP — create instance and persistent disk
gcloud compute instances create pg-node-1 \
  --zone=europe-west1-b \
  --machine-type=n2-highmem-8 \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=50GB \
  --network=pg-network \
  --subnet=pg-subnet

gcloud compute disks create pg-data-1 \
  --zone=europe-west1-b \
  --size=500GB \
  --type=pd-ssd

gcloud compute instances attach-disk pg-node-1 \
  --disk=pg-data-1 \
  --zone=europe-west1-b

# Health check for Patroni primary endpoint
gcloud compute health-checks create http patroni-primary-check \
  --port=8008 \
  --request-path=/primary \
  --check-interval=5s \
  --timeout=5s \
  --unhealthy-threshold=3 \
  --healthy-threshold=2

Bare Metal k3s Deployment with Rancher and Longhorn

k3s and Longhorn Setup

# Install k3s on the first server node
curl -sfL https://get.k3s.io | K3S_TOKEN=my-cluster-token \
  INSTALL_K3S_EXEC="server --cluster-init --disable traefik --disable servicelb" sh -

# Join additional server nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=my-cluster-token \
  K3S_URL=https://10.0.0.1:6443 \
  INSTALL_K3S_EXEC="server" sh -

# Install Longhorn for distributed block storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
  --namespace longhorn-system --create-namespace \
  --set defaultSettings.defaultDataPath=/mnt/longhorn \
  --set defaultSettings.replicaCount=3 \
  --set defaultSettings.storageMinimalAvailablePercentage=15

# Deploy PostgreSQL with Patroni using the Zalando Postgres Operator
helm repo add postgres-operator-charts https://opensource.zalando.com/postgres-operator/charts/postgres-operator
helm install postgres-operator postgres-operator-charts/postgres-operator \
  --namespace postgres-system --create-namespace

# PostgreSQL cluster manifest for the Zalando Postgres Operator
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: pg-ha-cluster
  namespace: production
spec:
  teamId: "platform"
  numberOfInstances: 3
  volume:
    size: 500Gi
    storageClass: longhorn
  users:
    appuser:
      - superuser
      - createdb
    replicator: []
  databases:
    appdb: appuser
  postgresql:
    version: "16"
    parameters:
      shared_buffers: "16GB"
      effective_cache_size: "48GB"
      work_mem: "256MB"
      maintenance_work_mem: "2GB"
      max_connections: "200"
      max_wal_senders: "10"
      wal_level: replica
      synchronous_commit: "on"
      wal_keep_size: "2GB"
      archive_mode: "on"
      track_commit_timestamp: "on"
  patroni:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    synchronous_mode: true
  resources:
    requests:
      cpu: "4"
      memory: 32Gi
    limits:
      cpu: "8"
      memory: 64Gi

Keepalived for HAProxy VIP

# /etc/keepalived/keepalived.conf on lb1
vrrp_script chk_haproxy {
    script "killall -0 haproxy"
    interval 2
    weight 2
}

vrrp_instance VI_PG {
    state MASTER
    interface eth0
    virtual_router_id 52
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass pgha_vip_pass
    }
    virtual_ipaddress {
        10.0.0.100/24
    }
    track_script {
        chk_haproxy
    }
}

Monitoring PostgreSQL Replication

Built-in Monitoring Queries

-- Check replication status on the primary
SELECT
    client_addr,
    application_name,
    state,
    sync_state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    (sent_lsn - replay_lsn) AS replication_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

-- Check replication slot status
SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS slot_lag
FROM pg_replication_slots;

-- Check standby recovery status (run on standby)
SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_replayed,
    pg_last_xact_replay_timestamp() AS last_replayed_timestamp,
    EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int AS replay_lag_seconds;

-- Monitor WAL generation rate
SELECT
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal_generated,
    pg_size_pretty(sum(size)) AS wal_directory_size
FROM pg_ls_waldir();

-- Check for long-running queries that could block replication
SELECT
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
  AND state != 'idle'
ORDER BY duration DESC;

Prometheus and Grafana Stack

The postgres_exporter exposes PostgreSQL metrics in Prometheus format. Combined with the patroni_exporter, you get full visibility into both database performance and HA cluster state.

# Deploy postgres_exporter as a sidecar or standalone
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# Custom queries for postgres_exporter
# /etc/postgres_exporter/queries.yaml
pg_replication_lag:
  query: |
    SELECT
      CASE WHEN pg_is_in_recovery() THEN
        EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::float
      ELSE 0 END AS lag_seconds
  master: true
  metrics:
    - lag_seconds:
        usage: "GAUGE"
        description: "Replication lag in seconds"

pg_replication_slots:
  query: |
    SELECT
      slot_name,
      active::int AS active,
      pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)::float AS slot_lag_bytes
    FROM pg_replication_slots
  master: true
  metrics:
    - slot_name:
        usage: "LABEL"
    - active:
        usage: "GAUGE"
        description: "Whether the slot is active"
    - slot_lag_bytes:
        usage: "GAUGE"
        description: "Slot lag in bytes"

# PrometheusRule for PostgreSQL HA alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgresql-ha-alerts
  namespace: monitoring
spec:
  groups:
    - name: postgresql-replication
      rules:
        - alert: PostgreSQLReplicationLagHigh
          expr: pg_replication_lag_seconds > 30
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "PostgreSQL replication lag exceeds 30s on {{ $labels.instance }}"

        - alert: PostgreSQLReplicationSlotInactive
          expr: pg_replication_slots_active == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Replication slot {{ $labels.slot_name }} is inactive"

        - alert: PostgreSQLReplicationSlotLagHigh
          expr: pg_replication_slots_slot_lag_bytes > 1073741824
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Replication slot lag exceeds 1GB on {{ $labels.slot_name }}"

        - alert: PatroniClusterUnhealthy
          expr: patroni_cluster_members_count < 3
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Patroni cluster has fewer than 3 members"

Production Tuning Parameters

# postgresql.conf — Production HA tuning

# === Replication ===
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 4GB
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
track_commit_timestamp = on
wal_log_hints = on

# === WAL ===
min_wal_size = 1GB
max_wal_size = 8GB
wal_buffers = 64MB
wal_compression = zstd
archive_mode = on
archive_timeout = 300
checkpoint_completion_target = 0.9
checkpoint_timeout = 15min

# === Memory ===
shared_buffers = 16GB              # 25% of RAM
effective_cache_size = 48GB        # 75% of RAM
work_mem = 256MB                   # per-operation sort/hash memory
maintenance_work_mem = 2GB         # for VACUUM, CREATE INDEX
huge_pages = try

# === Connections ===
max_connections = 200              # use PgBouncer for higher client counts
superuser_reserved_connections = 5

# === Query Performance ===
random_page_cost = 1.1             # SSD storage
effective_io_concurrency = 200     # NVMe SSD
default_statistics_target = 500
jit = on

# === Logging ===
log_min_duration_statement = 500   # log queries > 500ms
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0

# === Autovacuum ===
autovacuum_max_workers = 4
autovacuum_naptime = 30s
autovacuum_vacuum_cost_limit = 2000
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.02

Tuning Patroni DCS Parameters

# Conservative (production default)
ttl: 30
loop_wait: 10
retry_timeout: 10
# Failover detection: ~30-40 seconds

# Aggressive (low-latency failover)
ttl: 15
loop_wait: 5
retry_timeout: 5
# Failover detection: ~15-20 seconds
# Warning: Higher risk of false failovers in unstable networks

Chaos Engineering and Failover Testing

Failover Test Playbook

# 1. Verify cluster health before testing
patronictlctl -c /etc/patroni/patroni.yml list
+----------+---------+---------+----+-----------+
| Member   | Host    | Role    | TL | Lag in MB |
+----------+---------+---------+----+-----------+
| node1    | 10.0.1.10| Leader |  5 |           |
| node2    | 10.0.1.11| Replica |  5 |         0 |
| node3    | 10.0.1.12| Replica |  5 |         0 |
+----------+---------+---------+----+-----------+

# 2. Simulate primary crash (on node1)
sudo systemctl stop patroni
# Or more aggressive: sudo kill -9 $(pgrep -f patroni)

# 3. Monitor failover (from any node with patronictl)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# 4. Verify new leader is elected (within 30-45 seconds)
# Expected: node2 or node3 promoted to Leader

# 5. Test write availability through HAProxy
PGPASSWORD=app_password psql -h haproxy-host -p 5000 -U appuser -d appdb \
  -c "INSERT INTO health_check (ts) VALUES (now()) RETURNING *;"

# 6. Restart the former primary
sudo systemctl start patroni
# Patroni will use pg_rewind to rejoin as a replica

# 7. Verify the former primary rejoins as replica
patronictlctl -c /etc/patroni/patroni.yml list

Network Partition Testing

# Simulate network partition on the primary using iptables
# Block all traffic to etcd from the primary
sudo iptables -A OUTPUT -d 10.0.2.10 -j DROP
sudo iptables -A OUTPUT -d 10.0.2.11 -j DROP
sudo iptables -A OUTPUT -d 10.0.2.12 -j DROP

# Expected behaviour:
# 1. Primary loses DCS access
# 2. Leader key TTL expires
# 3. Primary demotes itself (with watchdog, node may reboot)
# 4. Standby acquires leader lock and promotes
# 5. After clearing iptables rules, former primary rejoins as replica

# Clean up
sudo iptables -D OUTPUT -d 10.0.2.10 -j DROP
sudo iptables -D OUTPUT -d 10.0.2.11 -j DROP
sudo iptables -D OUTPUT -d 10.0.2.12 -j DROP

Automated Chaos Testing with Toxiproxy

# Run Toxiproxy alongside your Patroni cluster
# Create proxies for etcd and replication connections
toxiproxy-cli create etcd_proxy -l 0.0.0.0:12379 -u 10.0.2.10:2379
toxiproxy-cli create pg_repl_proxy -l 0.0.0.0:15432 -u 10.0.1.10:5432

# Add latency to etcd connections (simulates degraded network)
toxiproxy-cli toxic add etcd_proxy -t latency -a latency=500 -a jitter=200

# Add bandwidth limit to replication (simulates WAN replication)
toxiproxy-cli toxic add pg_repl_proxy -t bandwidth -a rate=1024

# Completely sever the connection (simulates network partition)
toxiproxy-cli toxic add etcd_proxy -t timeout -a timeout=0

# Monitor Patroni behaviour and verify correct failover
watch -n 2 'curl -s http://10.0.1.10:8008/patroni | python3 -m json.tool'

Continuous Validation Script

#!/bin/bash
# continuous_ha_check.sh — Run during chaos tests to measure availability

HAPROXY_HOST="10.0.0.100"
WRITE_PORT=5000
READ_PORT=5001
DATABASE="appdb"
USER="appuser"
LOGFILE="/var/log/ha_test_$(date +%Y%m%d_%H%M%S).log"

write_count=0
write_fail=0
read_count=0
read_fail=0

while true; do
    ts=$(date '+%Y-%m-%d %H:%M:%S.%3N')

    # Test write path
    if PGPASSWORD=app_password psql -h $HAPROXY_HOST -p $WRITE_PORT \
       -U $USER -d $DATABASE -c "SELECT 1" &>/dev/null; then
        ((write_count++))
    else
        ((write_fail++))
        echo "$ts WRITE_FAIL total_fails=$write_fail" >> $LOGFILE
    fi

    # Test read path
    if PGPASSWORD=app_password psql -h $HAPROXY_HOST -p $READ_PORT \
       -U $USER -d $DATABASE -c "SELECT 1" &>/dev/null; then
        ((read_count++))
    else
        ((read_fail++))
        echo "$ts READ_FAIL total_fails=$read_fail" >> $LOGFILE
    fi

    total=$((write_count + write_fail))
    if (( total % 100 == 0 )); then
        write_avail=$(echo "scale=2; $write_count * 100 / $total" | bc)
        read_total=$((read_count + read_fail))
        read_avail=$(echo "scale=2; $read_count * 100 / $read_total" | bc)
        echo "$ts Writes: ${write_avail}% ($write_count/$total) Reads: ${read_avail}% ($read_count/$read_total)"
    fi

    sleep 0.5
done

Advanced: Cascading Replication and Delayed Standbys

# postgresql.auto.conf on a cascading standby
primary_conninfo = 'host=standby1-host port=5432 user=replicator application_name=cascade1'
primary_slot_name = 'cascade1_slot'

# postgresql.conf on delayed standby
recovery_min_apply_delay = '1h'

Connection String Strategies

# Multi-host connection string with target_session_attrs
# The client tries each host in order and connects to the one matching the target attribute
postgresql://appuser:password@node1:5432,node2:5432,node3:5432/appdb?target_session_attrs=read-write&sslmode=require

# For read-only connections
postgresql://appuser:password@node1:5432,node2:5432,node3:5432/appdb?target_session_attrs=prefer-standby&sslmode=require

This approach works well for applications that cannot easily be reconfigured to point to an HAProxy VIP. The PostgreSQL client library (libpq) handles the failover transparently.

Security Hardening

A production PostgreSQL HA cluster must enforce encryption in transit and at rest, use strong authentication, and limit network exposure.

# Enable TLS in postgresql.conf
ssl = on
ssl_cert_file = '/etc/postgresql/certs/server.crt'
ssl_key_file = '/etc/postgresql/certs/server.key'
ssl_ca_file = '/etc/postgresql/certs/ca.crt'
ssl_min_protocol_version = 'TLSv1.3'

# Require TLS for all connections in pg_hba.conf
hostssl replication replicator 10.0.0.0/16 scram-sha-256
hostssl all         all        10.0.0.0/16 scram-sha-256

# etcd TLS
# In Patroni config
etcd3:
  hosts:
    - 10.0.2.10:2379
    - 10.0.2.11:2379
    - 10.0.2.12:2379
  protocol: https
  cacert: /etc/patroni/certs/etcd-ca.crt
  cert: /etc/patroni/certs/etcd-client.crt
  key: /etc/patroni/certs/etcd-client.key

Backup Strategy for HA Clusters

# Schedule backups via cron
# Full backup weekly (Sunday 2 AM)
0 2 * * 0 pgbackrest --stanza=pg-ha-cluster --type=full backup

# Differential backup daily (2 AM, Mon-Sat)
0 2 * * 1-6 pgbackrest --stanza=pg-ha-cluster --type=diff backup

# Verify backup integrity
pgbackrest --stanza=pg-ha-cluster --set=latest info

# Verify backup can be restored (dry run)
pgbackrest --stanza=pg-ha-cluster --set=latest verify

# List all backups
pgbackrest --stanza=pg-ha-cluster info
            full backup: 20260412-020000F
                timestamp: 2026-04-12 02:00:00 +0000
                wal start/stop: 000000050000000000000040 / 000000050000000000000042
                database size: 150GB, backup size: 150GB
                repository size: 45GB (compressed)
            diff backup: 20260412-020000F_20260413-020000D
                timestamp: 2026-04-13 02:00:00 +0000
                database size: 151GB, backup size: 2.1GB
                repository size: 650MB (compressed)

Operational Runbook Summary

Every team running a Patroni cluster should maintain a runbook covering the following scenarios. Having documented, tested procedures transforms a stressful outage into a routine operation.

# === Quick Reference Commands ===

# Cluster status
patronictlctl -c /etc/patroni/patroni.yml list
patronictlctl -c /etc/patroni/patroni.yml history

# Planned switchover
patronictlctl -c /etc/patroni/patroni.yml switchover --master node1 --candidate node2

# Restart PostgreSQL on a specific node (rolling restart)
patronictlctl -c /etc/patroni/patroni.yml restart pg-ha-cluster node2

# Reload PostgreSQL configuration without restart
patronictlctl -c /etc/patroni/patroni.yml reload pg-ha-cluster

# Pause automatic failover (during maintenance)
patronictlctl -c /etc/patroni/patroni.yml pause

# Resume automatic failover
patronictlctl -c /etc/patroni/patroni.yml resume

# Edit DCS configuration (applies to all nodes)
patronictlctl -c /etc/patroni/patroni.yml edit-config

# Reinitialise a failed replica
patronictlctl -c /etc/patroni/patroni.yml reinit pg-ha-cluster node3

# Check Patroni REST API directly
curl -s http://10.0.1.10:8008/patroni | python3 -m json.tool
curl -s http://10.0.1.10:8008/cluster | python3 -m json.tool

Performance Benchmarking

Before going to production, benchmark your HA cluster to establish baseline performance and verify that synchronous replication latency is acceptable for your workload.

# Benchmark with pgbench — initialise test data
pgbench -i -s 100 -h haproxy-host -p 5000 -U appuser appdb

# Run write-heavy benchmark (measures sync replication impact)
pgbench -h haproxy-host -p 5000 -U appuser -c 32 -j 8 -T 300 appdb
# Compare with async: temporarily set synchronous_commit = off

# Run read-only benchmark through read replica port
pgbench -h haproxy-host -p 5001 -U appuser -c 64 -j 16 -T 300 -S appdb

# Measure failover impact on transactions
# Run pgbench in background, then trigger a failover
pgbench -h haproxy-host -p 5000 -U appuser -c 8 -j 4 -T 600 appdb &
sleep 60 && patronictl switchover --master node1 --candidate node2 --force

PostgreSQL Native High Availability: Patroni, Streaming Replication, and Production Failover Strategies

PostgreSQL Streaming Replication Fundamentals

Configuring the Primary for Replication

Provisioning a Standby with pg_basebackup

Patroni: Automated HA Orchestration

Patroni YAML Configuration

etcd as Distributed Configuration Store

HAProxy for Connection Routing

PgBouncer for Connection Pooling

WAL Archiving and Point-in-Time Recovery

Logical Replication for Selective Data Sync

Multi-Region Streaming Replication

Failover and Switchover Procedures

Planned Switchover

Automatic Failover Sequence

Split-Brain Prevention

repmgr as an Alternative to Patroni

Cloud Deployment Patterns

AWS Deployment: EC2, EBS, and Route53

Azure Deployment: VMs, Managed Disks, and Azure LB

GCP Deployment: Compute Engine and Cloud Load Balancing

Bare Metal k3s Deployment with Rancher and Longhorn

k3s and Longhorn Setup

Keepalived for HAProxy VIP

Monitoring PostgreSQL Replication

Built-in Monitoring Queries

Prometheus and Grafana Stack

Production Tuning Parameters

Tuning Patroni DCS Parameters

Chaos Engineering and Failover Testing

Failover Test Playbook

Network Partition Testing

Automated Chaos Testing with Toxiproxy

Continuous Validation Script

Advanced: Cascading Replication and Delayed Standbys

Connection String Strategies

Security Hardening

Backup Strategy for HA Clusters

Operational Runbook Summary

Performance Benchmarking

Conclusion

PostgreSQL Native High Availability: Patroni, Streaming Replication, and Production Failover Strategies

PostgreSQL Streaming Replication Fundamentals

Configuring the Primary for Replication

Provisioning a Standby with pg_basebackup

Patroni: Automated HA Orchestration

Patroni YAML Configuration

etcd as Distributed Configuration Store

HAProxy for Connection Routing

PgBouncer for Connection Pooling

WAL Archiving and Point-in-Time Recovery

Logical Replication for Selective Data Sync

Multi-Region Streaming Replication

Failover and Switchover Procedures

Planned Switchover

Automatic Failover Sequence

Split-Brain Prevention

repmgr as an Alternative to Patroni

Cloud Deployment Patterns

AWS Deployment: EC2, EBS, and Route53

Azure Deployment: VMs, Managed Disks, and Azure LB

GCP Deployment: Compute Engine and Cloud Load Balancing

Bare Metal k3s Deployment with Rancher and Longhorn

k3s and Longhorn Setup

Keepalived for HAProxy VIP

Monitoring PostgreSQL Replication

Built-in Monitoring Queries

Prometheus and Grafana Stack

Production Tuning Parameters

Tuning Patroni DCS Parameters

Chaos Engineering and Failover Testing

Failover Test Playbook

Network Partition Testing

Automated Chaos Testing with Toxiproxy

Continuous Validation Script

Advanced: Cascading Replication and Delayed Standbys

Connection String Strategies

Security Hardening

Backup Strategy for HA Clusters

Operational Runbook Summary