Database Backup Strategies | MySQL & PostgreSQL Production Guide

Losing production data is the kind of incident that ends careers, shutters companies, and makes the front page of Hacker News for all the wrong reasons. Yet most teams treat database backups as an afterthought — a cron job somebody set up two years ago that nobody has verified since. This article lays out battle-tested backup strategies for MySQL and PostgreSQL that have been proven in production environments ranging from single-node k3s clusters to multi-region cloud deployments processing millions of transactions daily.

We will cover the full spectrum: logical and physical backup methods, point-in-time recovery, automated scheduling, cloud storage targets, encryption, verification, Kubernetes-native approaches, and the disaster recovery planning that ties it all together. Every recommendation comes with concrete configuration and code you can adapt to your environment.

Understanding Backup Types

Before diving into specific tools, you need a clear mental model of the four fundamental backup strategies and how they interact with recovery objectives. Each strategy makes a different trade-off between backup speed, storage consumption, and recovery time.

A full backup captures the entire database at a single point in time. It is the simplest to reason about and the fastest to restore, but also the most expensive in storage and the slowest to create. An incremental backup captures only the data that has changed since the last backup of any type. It is fast to create and compact in storage, but restoration requires replaying the full chain: last full backup plus every subsequent incremental. A differential backup captures everything that has changed since the last full backup. It occupies a middle ground — larger than an incremental but simpler to restore because you only need the last full plus the latest differential. Finally, continuous archiving (WAL archiving in PostgreSQL, binary log streaming in MySQL) captures every individual transaction as it happens, enabling point-in-time recovery to any moment between backups.

The optimal strategy for most production systems is a combination: weekly full backups, daily incrementals or differentials, and continuous WAL/binlog archiving. This gives you both fast recovery from recent full backups and the ability to recover to any point in time when you need it.

MySQL Backup Methods

MySQL offers several backup tools, each suited to different database sizes and recovery requirements. The right choice depends on your data volume, acceptable backup window, and RTO/RPO targets.

mysqldump — The Universal Logical Backup

mysqldump produces SQL statements that recreate the schema and data. It works on every MySQL version and storage engine, making it the universal fallback. However, it locks tables during the dump (unless using --single-transaction with InnoDB), and restoration speed degrades significantly for databases beyond 50–100 GB because it replays individual INSERT statements.

# Full logical backup with consistent snapshot for InnoDB
mysqldump --single-transaction --routines --triggers --events \
  --set-gtid-purged=ON --all-databases \
  | gzip > /backups/mysql-full-$(date +%Y%m%d-%H%M%S).sql.gz

# Single database backup with compression
mysqldump --single-transaction --routines --triggers \
  --databases production_db \
  | pigz -p4 > /backups/production_db-$(date +%Y%m%d).sql.gz

# Schema-only backup for migration planning
mysqldump --no-data --routines --triggers --events \
  --all-databases > /backups/schema-only-$(date +%Y%m%d).sql

mysqlpump — Parallel Logical Backup

mysqlpump improves on mysqldump with parallel table dumping and built-in compression. It can significantly reduce backup time for databases with many independent tables.

# Parallel logical backup with 4 threads and zstd compression
mysqlpump --default-parallelism=4 --compress-output=ZSTD \
  --include-databases=production_db,analytics_db \
  --set-gtid-purged=ON \
  > /backups/mysql-pump-$(date +%Y%m%d).sql.zst

Percona XtraBackup — Physical Backups for InnoDB

For databases larger than 50 GB, logical backups become impractical — both the backup window and the restore time grow linearly with data size. Percona XtraBackup takes physical-level copies of InnoDB data files without locking the database, making it suitable for multi-terabyte deployments.

# Full physical backup with streaming to compressed archive
xtrabackup --backup --target-dir=/backups/full-$(date +%Y%m%d) \
  --user=backup_user --password=secure_pass \
  --parallel=4 --compress --compress-threads=4

# Incremental backup based on last full
xtrabackup --backup --target-dir=/backups/incr-$(date +%Y%m%d) \
  --incremental-basedir=/backups/full-20260412 \
  --user=backup_user --password=secure_pass \
  --parallel=4

# Stream full backup directly to S3 via xbstream
xtrabackup --backup --stream=xbstream --compress \
  --user=backup_user --password=secure_pass | \
  aws s3 cp - s3://db-backups/mysql/full-$(date +%Y%m%d).xbstream

# Prepare for restore (apply redo log)
xtrabackup --prepare --target-dir=/backups/full-20260412

# Prepare incremental on top of full
xtrabackup --prepare --apply-log-only --target-dir=/backups/full-20260412
xtrabackup --prepare --target-dir=/backups/full-20260412 \
  --incremental-dir=/backups/incr-20260412

# Restore
systemctl stop mysqld
xtrabackup --copy-back --target-dir=/backups/full-20260412
chown -R mysql:mysql /var/lib/mysql
systemctl start mysqld

MySQL Binary Log — Point-in-Time Recovery

MySQL binary logs record every data-modifying statement or row change. When combined with a full or physical backup, they enable point-in-time recovery to any moment after the backup was taken.

# Enable binary logging in my.cnf
[mysqld]
server-id = 1
log-bin = /var/log/mysql/mysql-bin
binlog_format = ROW
binlog_expire_logs_seconds = 604800   # 7 days
sync_binlog = 1
gtid_mode = ON
enforce_gtid_consistency = ON

# Flush and archive binary logs
mysqladmin flush-logs
mysqlbinlog --read-from-remote-server --host=db-primary \
  --raw --stop-never --result-file=/backup/binlogs/ \
  mysql-bin.000042

# Point-in-time recovery: replay binlog up to specific timestamp
mysqlbinlog --stop-datetime="2026-04-12 14:30:00" \
  /backup/binlogs/mysql-bin.000042 \
  /backup/binlogs/mysql-bin.000043 | mysql -u root

PostgreSQL Backup Methods

PostgreSQL's backup ecosystem is arguably richer than MySQL's, with first-party tools and a mature set of community solutions purpose-built for large-scale production environments.

pg_dump and pg_dumpall — Logical Backups

Like mysqldump, pg_dump produces logical backups. The custom format (-Fc) is the recommended default because it supports parallel restore, selective table restoration, and built-in compression.

# Custom format with parallel dump (4 worker jobs)
pg_dump -Fc -j4 -f /backups/production-$(date +%Y%m%d).dump production_db

# Directory format for maximum parallelism on large databases
pg_dump -Fd -j8 -f /backups/production-$(date +%Y%m%d)/ production_db

# All databases including globals (roles, tablespaces)
pg_dumpall > /backups/pg-all-$(date +%Y%m%d).sql

# Parallel restore from custom format
pg_restore -j4 -d production_db_restored /backups/production-20260412.dump

# Selective restore: single table
pg_restore -j4 -d production_db -t orders /backups/production-20260412.dump

pg_basebackup — Physical Backup Foundation

pg_basebackup takes a physical copy of the entire PostgreSQL data directory. It is the foundation for both standalone recovery and streaming replication setup. Combined with WAL archiving, it enables point-in-time recovery.

# Physical backup with WAL files included
pg_basebackup -D /backups/base-$(date +%Y%m%d) \
  -Ft -z -Xs -P -c fast \
  -U replication_user -h db-primary

# Stream backup directly to a tar archive with checksums
pg_basebackup -D - -Ft -Xs -c fast \
  -U replication_user -h db-primary | \
  gzip > /backups/pg-base-$(date +%Y%m%d).tar.gz

pgBackRest — Enterprise-Grade Backup Management

pgBackRest is the gold standard for PostgreSQL backup management. It supports full, incremental, and differential backups, parallel backup and restore, encryption, multi-repository targets (local disk, S3, GCS, Azure Blob), and automated WAL archiving — all through a single, cohesive configuration.

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=pg-backups-production
repo1-s3-endpoint=s3.eu-west-1.amazonaws.com
repo1-s3-region=eu-west-1
repo1-s3-key=AKIAIOSFODNN7EXAMPLE
repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=a_very_secure_encryption_passphrase

# Second repository for geographic redundancy
repo2-type=azure
repo2-azure-container=pg-backups-dr
repo2-azure-account=prodbackupstorage
repo2-azure-key=base64encodedkeyhere
repo2-path=/pgbackrest
repo2-retention-full=2

process-max=4
compress-type=zst
compress-level=6
log-level-console=info
log-level-file=detail
start-fast=y

[production]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

# PostgreSQL WAL archive configuration (postgresql.conf)
# archive_mode = on
# archive_command = 'pgbackrest --stanza=production archive-push %p'

# Create the stanza
pgbackrest --stanza=production stanza-create

# Verify the stanza configuration
pgbackrest --stanza=production check

# Full backup
pgbackrest --stanza=production --type=full backup

# Differential backup
pgbackrest --stanza=production --type=diff backup

# Incremental backup
pgbackrest --stanza=production --type=incr backup

# List backups
pgbackrest --stanza=production info

WAL-G — Lightweight WAL Archiving to Cloud Storage

WAL-G is a simpler alternative to pgBackRest that focuses on streaming WAL segments and base backups to cloud object storage. It is popular in containerised environments where you want a single binary with minimal configuration.

# Environment variables for WAL-G with S3
export WALG_S3_PREFIX=s3://pg-wal-archive/production
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_REGION=eu-west-1
export WALG_COMPRESSION_METHOD=lz4
export PGHOST=/var/run/postgresql

# Configure WAL archiving in postgresql.conf
# archive_mode = on
# archive_command = 'wal-g wal-push %p'
# restore_command = 'wal-g wal-fetch %f %p'

# Take a base backup
wal-g backup-push /var/lib/postgresql/16/main

# List backups
wal-g backup-list

# Restore from latest backup
wal-g backup-fetch /var/lib/postgresql/16/main LATEST

# Delete old backups (retain last 4)
wal-g delete retain FULL 4 --confirm

Barman — Centralised Backup Server

Barman (Backup and Recovery Manager) is designed for environments where a dedicated backup server manages backups for multiple PostgreSQL instances. It supports both rsync/SSH and streaming replication protocols for backup transport.

# /etc/barman.d/production.conf
[production]
description = "Production PostgreSQL 16"
ssh_command = ssh postgres@db-primary
conninfo = host=db-primary user=barman dbname=postgres
streaming_conninfo = host=db-primary user=streaming_barman
backup_method = postgres
streaming_archiver = on
slot_name = barman
retention_policy = RECOVERY WINDOW OF 14 DAYS

# Create replication slot and start streaming
barman receive-wal --create-slot production
barman switch-wal --force --archive production

# Take a backup
barman backup production

# List backups
barman list-backup production

# Restore to point in time
barman recover --target-time "2026-04-12 14:30:00" \
  production 20260412T120000 /var/lib/postgresql/16/main

Point-in-Time Recovery (PITR)

Point-in-time recovery is the most important capability in your backup toolkit. It allows you to restore a database to the exact state it was in at any specific moment — not just when backups were taken, but any second between backups. This is critical for recovering from accidental data deletion, application bugs that corrupt data, and security incidents where you need to identify the exact moment of compromise.

PITR for PostgreSQL

PostgreSQL PITR works by restoring a base backup and replaying WAL segments up to the target timestamp. With pgBackRest, the entire process is a single command.

# Restore to specific point in time with pgBackRest
pgbackrest --stanza=production \
  --type=time --target="2026-04-12 16:42:30" \
  --target-action=promote \
  restore

# Manual PITR using pg_basebackup + WAL archive
# 1. Stop PostgreSQL
systemctl stop postgresql

# 2. Clear the data directory and restore base backup
rm -rf /var/lib/postgresql/16/main/*
tar xzf /backups/pg-base-20260412.tar.gz -C /var/lib/postgresql/16/main/

# 3. Create recovery signal and configure restore
cat > /var/lib/postgresql/16/main/postgresql.auto.conf << 'CONF'
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-04-12 16:42:30'
recovery_target_action = 'promote'
CONF
touch /var/lib/postgresql/16/main/recovery.signal

# 4. Start PostgreSQL — it will replay WAL and stop at target
systemctl start postgresql

PITR for MySQL

MySQL PITR combines an XtraBackup physical restore with binary log replay.

# 1. Restore the XtraBackup
systemctl stop mysqld
rm -rf /var/lib/mysql/*
xtrabackup --prepare --target-dir=/backups/full-20260412
xtrabackup --copy-back --target-dir=/backups/full-20260412
chown -R mysql:mysql /var/lib/mysql
systemctl start mysqld

# 2. Identify the binlog position from XtraBackup metadata
cat /backups/full-20260412/xtrabackup_binlog_info
# Output: mysql-bin.000042  154  3E11FA47-1111-1111-1111-AAAAAAAAAAAA:1-1234

# 3. Replay binlogs up to target time
mysqlbinlog --start-position=154 \
  --stop-datetime="2026-04-12 16:42:30" \
  /backup/binlogs/mysql-bin.000042 \
  /backup/binlogs/mysql-bin.000043 | mysql -u root

Multi-Cloud Backup Architecture

Production backup strategies must account for the failure of any single cloud provider or region. Storing backups exclusively in the same cloud account and region as your production database means a single account compromise, billing issue, or regional outage can take out both your data and your backups simultaneously. A robust architecture sends backups to at least two independent storage targets with cross-region replication.

Cloud Storage Configuration with Lifecycle Policies

# AWS S3 lifecycle policy (Terraform)
resource "aws_s3_bucket_lifecycle_configuration" "backup_lifecycle" {
  bucket = aws_s3_bucket.db_backups.id

  rule {
    id     = "backup-tiering"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
    expiration {
      days = 2555  # 7 years for compliance
    }
  }

  rule {
    id     = "wal-segments"
    status = "Enabled"
    filter {
      prefix = "wal-archive/"
    }
    expiration {
      days = 30
    }
  }
}

# Enable cross-region replication
resource "aws_s3_bucket_replication_configuration" "backup_replication" {
  bucket = aws_s3_bucket.db_backups.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "cross-region-dr"
    status = "Enabled"
    destination {
      bucket        = aws_s3_bucket.db_backups_dr.arn
      storage_class = "STANDARD_IA"
      encryption_configuration {
        replica_kms_key_id = aws_kms_key.dr_key.arn
      }
    }
    source_selection_criteria {
      sse_kms_encrypted_objects {
        status = "Enabled"
      }
    }
  }
}

# Azure Blob immutability policy (Azure CLI)
az storage container immutability-policy create \
  --account-name prodbackupstorage \
  --container-name pg-backups \
  --period 365 \
  --allow-protected-append-writes true

# GCS lifecycle with nearline transition
gsutil lifecycle set /dev/stdin gs://pg-backups-production << 'JSON'
{
  "rule": [
    {"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"}, "condition": {"age": 30}},
    {"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"}, "condition": {"age": 90}},
    {"action": {"type": "Delete"}, "condition": {"age": 2555}}
  ]
}
JSON

Backup Encryption and Security

Backups are a high-value target for attackers. A stolen unencrypted backup gives an attacker a complete copy of your production data without needing to breach your running systems. Every backup, in transit and at rest, must be encrypted.

Encryption in transit means using TLS for all connections between the database server and the backup destination. Both pgBackRest and XtraBackup stream over encrypted channels when configured with TLS. When shipping to S3, Azure Blob, or GCS, the client SDKs use HTTPS by default.

Encryption at rest has two layers. Server-side encryption (SSE) encrypts data after it arrives at the storage target — S3 SSE-KMS, Azure Storage Service Encryption, or GCS default encryption. Client-side encryption encrypts data before it leaves the database server, ensuring the cloud provider never sees plaintext data. pgBackRest and WAL-G both support client-side encryption natively.

# pgBackRest client-side encryption config
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=long_random_passphrase_stored_in_vault

# WAL-G client-side encryption
export WALG_LIBSODIUM_KEY=$(cat /etc/wal-g/encryption.key)
# Or GPG-based
export WALG_GPG_KEY_ID=backup@example.com

# MySQL XtraBackup with encryption
xtrabackup --backup --encrypt=AES256 \
  --encrypt-key-file=/etc/mysql/backup-encryption.key \
  --target-dir=/backups/full-encrypted-$(date +%Y%m%d)

Store encryption keys separately from the backups themselves. A secrets manager like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault is the recommended approach. If you store the key alongside the backup, an attacker who gains access to your backup storage has everything they need.

Automated Backup Scheduling

Manual backups are not backups — they are aspirations. Backup schedules must be automated, monitored, and alerting on failure.

System Cron for Bare Metal and VMs

# /etc/cron.d/database-backups

# PostgreSQL — pgBackRest
# Full backup every Sunday at 01:00
0 1 * * 0 postgres pgbackrest --stanza=production --type=full backup 2>&1 | logger -t pgbackrest-full

# Differential backup every day at 01:00 (except Sunday)
0 1 * * 1-6 postgres pgbackrest --stanza=production --type=diff backup 2>&1 | logger -t pgbackrest-diff

# MySQL — XtraBackup
# Full backup every Sunday at 02:00
0 2 * * 0 root /usr/local/bin/mysql-backup.sh full 2>&1 | logger -t xtrabackup-full

# Incremental backup every day at 02:00 (except Sunday)
0 2 * * 1-6 root /usr/local/bin/mysql-backup.sh incremental 2>&1 | logger -t xtrabackup-incr

# Backup verification — restore test every Wednesday at 04:00
0 4 * * 3 root /usr/local/bin/verify-backup.sh 2>&1 | logger -t backup-verify

Kubernetes CronJobs

In Kubernetes environments, CronJobs replace system cron. They offer built-in retry logic, concurrency control, and integration with the cluster's RBAC and secret management.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pg-backup-full
  namespace: database
spec:
  schedule: "0 1 * * 0"  # Sunday 01:00 UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 4
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 7200  # 2 hour timeout
      template:
        spec:
          serviceAccountName: db-backup
          restartPolicy: OnFailure
          containers:
            - name: pgbackrest
              image: pgbackrest/pgbackrest:2.50
              command:
                - /bin/bash
                - -c
                - |
                  pgbackrest --stanza=production --type=full backup
                  RESULT=$?
                  if [ $RESULT -eq 0 ]; then
                    curl -s -X POST "$SLACK_WEBHOOK" \
                      -d '{"text":"PostgreSQL full backup completed successfully"}'
                  else
                    curl -s -X POST "$SLACK_WEBHOOK" \
                      -d '{"text":"ALERT: PostgreSQL full backup FAILED"}'
                  fi
                  exit $RESULT
              envFrom:
                - secretRef:
                    name: pgbackrest-credentials
                - secretRef:
                    name: slack-webhook
              resources:
                requests:
                  memory: 512Mi
                  cpu: 500m
                limits:
                  memory: 2Gi
                  cpu: "2"
              volumeMounts:
                - name: pgbackrest-config
                  mountPath: /etc/pgbackrest
          volumes:
            - name: pgbackrest-config
              configMap:
                name: pgbackrest-conf
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pg-backup-diff
  namespace: database
spec:
  schedule: "0 1 * * 1-6"  # Mon-Sat 01:00 UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 3600
      template:
        spec:
          serviceAccountName: db-backup
          restartPolicy: OnFailure
          containers:
            - name: pgbackrest
              image: pgbackrest/pgbackrest:2.50
              command:
                - /bin/bash
                - -c
                - |
                  pgbackrest --stanza=production --type=diff backup
              envFrom:
                - secretRef:
                    name: pgbackrest-credentials
              resources:
                requests:
                  memory: 256Mi
                  cpu: 250m
                limits:
                  memory: 1Gi
                  cpu: "1"
              volumeMounts:
                - name: pgbackrest-config
                  mountPath: /etc/pgbackrest
          volumes:
            - name: pgbackrest-config
              configMap:
                name: pgbackrest-conf
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: mysql-backup-full
  namespace: database
spec:
  schedule: "0 2 * * 0"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 7200
      template:
        spec:
          serviceAccountName: db-backup
          restartPolicy: OnFailure
          containers:
            - name: xtrabackup
              image: percona/percona-xtrabackup:8.0
              command:
                - /bin/bash
                - -c
                - |
                  xtrabackup --backup --stream=xbstream --compress \
                    --user=$MYSQL_BACKUP_USER \
                    --password=$MYSQL_BACKUP_PASS \
                    --host=mysql-primary.database.svc | \
                    aws s3 cp - s3://$S3_BUCKET/mysql/full-$(date +%Y%m%d).xbstream
              envFrom:
                - secretRef:
                    name: mysql-backup-credentials
                - secretRef:
                    name: aws-credentials
              resources:
                requests:
                  memory: 512Mi
                  cpu: 500m
                limits:
                  memory: 2Gi
                  cpu: "2"

Kubernetes-Native Backup Approaches

Running databases in Kubernetes introduces a new dimension to backup strategy. In addition to application-level database backups, you need to consider cluster-level backup (etcd), persistent volume snapshots, and operator-managed backups.

Velero for Cluster-Level Backup

Velero backs up Kubernetes resources (deployments, services, configmaps, secrets) and persistent volumes. It is not a replacement for application-level database backups — it captures a crash-consistent snapshot of PVs, which may not be transactionally consistent for databases. Use Velero for cluster recovery and application-level tools (pgBackRest, XtraBackup) for database recovery.

# Install Velero with S3 backend
velero install \
  --provider aws \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config region=eu-west-1 \
  --snapshot-location-config region=eu-west-1 \
  --use-volume-snapshots=true \
  --plugins velero/velero-plugin-for-aws:v1.9.0

# Create a scheduled backup of the database namespace
velero schedule create db-namespace-backup \
  --schedule="0 3 * * *" \
  --include-namespaces database \
  --ttl 720h \
  --storage-location default \
  --volume-snapshot-locations default

# On-demand backup before maintenance
velero backup create pre-maintenance-$(date +%Y%m%d) \
  --include-namespaces database,monitoring \
  --wait

# Restore a namespace from backup
velero restore create --from-backup pre-maintenance-20260412 \
  --include-namespaces database

Longhorn Snapshots on k3s

k3s deployments commonly use Longhorn as their CSI storage provider. Longhorn provides volume-level snapshots and the ability to replicate snapshots to S3-compatible object storage. For databases, combine Longhorn snapshots with application-level backups for defense in depth.

# Longhorn recurring snapshot job via CRD
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: pg-data-snapshot
  namespace: longhorn-system
spec:
  cron: "0 */4 * * *"  # every 4 hours
  task: snapshot
  retain: 6
  concurrency: 1
  groups:
    - pg-data
  labels:
    app: postgresql
---
# Longhorn recurring backup to S3
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: pg-data-s3-backup
  namespace: longhorn-system
spec:
  cron: "0 2 * * *"  # daily at 02:00
  task: backup
  retain: 14
  concurrency: 1
  groups:
    - pg-data
  labels:
    app: postgresql
---
# VolumeSnapshot using Longhorn CSI
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: pg-data-snap-$(date +%Y%m%d)
  namespace: database
spec:
  volumeSnapshotClassName: longhorn-snapshot-vsc
  source:
    persistentVolumeClaimName: pg-data-postgresql-0

Operator-Managed Backups

Database operators like CloudNativePG (for PostgreSQL) and Percona Operator (for MySQL) integrate backup management directly into the database lifecycle. The operator handles scheduling, WAL archiving, and retention automatically through custom resources.

# CloudNativePG cluster with integrated backup
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-pg
  namespace: database
spec:
  instances: 3
  storage:
    size: 100Gi
    storageClass: longhorn
  backup:
    barmanObjectStore:
      destinationPath: s3://pg-backups/cnpg/
      endpointURL: https://s3.eu-west-1.amazonaws.com
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip
    retentionPolicy: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: production-pg-weekly
  namespace: database
spec:
  schedule: "0 1 * * 0"
  backupOwnerReference: self
  cluster:
    name: production-pg

Cloud Provider-Specific Strategies

AWS — RDS and Aurora

AWS RDS provides automated daily snapshots with configurable retention (up to 35 days) and continuous backup via transaction log archiving for point-in-time recovery. Aurora adds Backtrack, which can rewind the cluster to any point within the backtrack window without requiring a restore — it simply reverses the database's internal state.

# Enable automated backups with maximum retention (Terraform)
resource "aws_db_instance" "production" {
  identifier             = "production-pg"
  engine                 = "postgres"
  engine_version         = "16.2"
  instance_class         = "db.r6g.xlarge"
  allocated_storage      = 500
  backup_retention_period = 35  # maximum
  backup_window          = "03:00-04:00"
  copy_tags_to_snapshot  = true
  deletion_protection    = true
  storage_encrypted      = true
  kms_key_id             = aws_kms_key.rds.arn

  # Enable PITR
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
}

# Cross-region automated backup replication
resource "aws_db_instance_automated_backups_replication" "dr" {
  source_db_instance_arn = aws_db_instance.production.arn
  kms_key_id             = aws_kms_key.dr_rds.arn
  retention_period       = 14
}

# Aurora Backtrack (MySQL-compatible Aurora only)
resource "aws_rds_cluster" "aurora_production" {
  cluster_identifier      = "aurora-production"
  engine                  = "aurora-mysql"
  engine_version          = "8.0.mysql_aurora.3.05.2"
  backtrack_window        = 86400  # 24 hours of backtrack
  backup_retention_period = 35
  preferred_backup_window = "03:00-04:00"
  storage_encrypted       = true
}

# Manual snapshot with cross-region copy
aws rds create-db-snapshot \
  --db-instance-identifier production-pg \
  --db-snapshot-identifier pre-migration-$(date +%Y%m%d)

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:eu-west-1:123456789:snapshot:pre-migration-20260412 \
  --target-db-snapshot-identifier pre-migration-20260412-dr \
  --source-region eu-west-1 \
  --region us-east-1 \
  --kms-key-id arn:aws:kms:us-east-1:123456789:key/dr-key-id

Azure — Flexible Server

Azure Database for PostgreSQL and MySQL Flexible Server includes automatic backups with locally redundant or geo-redundant storage. Geo-redundant backup allows cross-region restore for disaster recovery.

# Azure Flexible Server with geo-redundant backup (Terraform)
resource "azurerm_postgresql_flexible_server" "production" {
  name                = "production-pg"
  location            = "westeurope"
  resource_group_name = azurerm_resource_group.db.name
  sku_name            = "GP_Standard_D4s_v3"
  version             = "16"
  storage_mb          = 524288  # 512 GB

  backup_retention_days        = 35
  geo_redundant_backup_enabled = true

  authentication {
    active_directory_auth_enabled = true
    password_auth_enabled         = false
  }
}

# Azure Blob immutability for self-managed backups
resource "azurerm_storage_management_policy" "backup_lifecycle" {
  storage_account_id = azurerm_storage_account.backups.id

  rule {
    name    = "backup-tiering"
    enabled = true
    filters {
      blob_types   = ["blockBlob"]
      prefix_match = ["pg-backups/"]
    }
    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than    = 30
        tier_to_archive_after_days_since_modification_greater_than = 90
        delete_after_days_since_modification_greater_than          = 2555
      }
    }
  }
}

GCP — Cloud SQL

Cloud SQL provides automated backups and point-in-time recovery out of the box. For self-managed databases on GCE or GKE, GCS with nearline and coldline storage classes provides cost-effective long-term backup storage.

# Cloud SQL with automated backups and PITR (Terraform)
resource "google_sql_database_instance" "production" {
  name             = "production-pg"
  database_version = "POSTGRES_16"
  region           = "europe-west1"

  settings {
    tier = "db-custom-4-16384"

    backup_configuration {
      enabled                        = true
      start_time                     = "03:00"
      point_in_time_recovery_enabled = true
      transaction_log_retention_days = 7
      backup_retention_settings {
        retained_backups = 30
        retention_unit   = "COUNT"
      }
    }

    ip_configuration {
      ssl_mode = "ENCRYPTED_ONLY"
    }
  }
}

# Export to GCS for long-term retention
gcloud sql export sql production-pg \
  gs://pg-backups-longterm/export-$(date +%Y%m%d).sql.gz \
  --database=production_db \
  --offload

Backup Verification and Restore Testing

A backup that has never been tested is not a backup — it is a hope. Automated restore testing should run on a regular schedule, ideally weekly, in an isolated environment. The test must validate not just that the restore completes without errors, but that the restored data is consistent and the application can connect and query it.

#!/bin/bash
# verify-backup.sh — Automated backup verification script
set -euo pipefail

RESTORE_DIR="/tmp/backup-verify-$(date +%Y%m%d-%H%M%S)"
LOG_FILE="/var/log/backup-verify.log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
DB_TYPE="${1:-postgresql}"  # postgresql or mysql

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"; }
alert() {
  log "ALERT: $*"
  curl -s -X POST "$SLACK_WEBHOOK" \
    -H 'Content-Type: application/json' \
    -d "{\"text\":\"BACKUP VERIFY FAILED: $*\"}"
}

cleanup() {
  log "Cleaning up $RESTORE_DIR"
  rm -rf "$RESTORE_DIR"
  if [ "$DB_TYPE" = "postgresql" ]; then
    pg_ctlcluster 16 verify stop 2>/dev/null || true
  else
    mysqladmin -S /tmp/mysql-verify.sock shutdown 2>/dev/null || true
  fi
}
trap cleanup EXIT

mkdir -p "$RESTORE_DIR"

if [ "$DB_TYPE" = "postgresql" ]; then
  log "Starting PostgreSQL backup verification"

  # Restore latest pgBackRest backup to temporary directory
  pgbackrest --stanza=production \
    --pg1-path="$RESTORE_DIR/pgdata" \
    --type=immediate \
    --target-action=promote \
    restore 2>&1 | tee -a "$LOG_FILE"

  if [ ${PIPESTATUS[0]} -ne 0 ]; then
    alert "pgBackRest restore failed"
    exit 1
  fi

  # Start PostgreSQL on a different port
  pg_ctlcluster 16 verify start -- \
    -D "$RESTORE_DIR/pgdata" \
    -o "-p 5433" \
    -o "-c listen_addresses=127.0.0.1"

  sleep 5

  # Verify data integrity
  TABLES=$(psql -p 5433 -d production_db -t -c \
    "SELECT count(*) FROM information_schema.tables WHERE table_schema='public';")
  log "Verified $TABLES tables exist"

  ROW_CHECK=$(psql -p 5433 -d production_db -t -c \
    "SELECT count(*) FROM orders WHERE created_at > now() - interval '7 days';")
  log "Recent orders count: $ROW_CHECK"

  if [ "$ROW_CHECK" -lt 1 ]; then
    alert "PostgreSQL restore has no recent data — possible stale backup"
    exit 1
  fi

  # Run pg_amcheck for corruption detection (PostgreSQL 14+)
  pg_amcheck -p 5433 -d production_db --heapallindexed 2>&1 | tee -a "$LOG_FILE"

  log "PostgreSQL backup verification PASSED"

else
  log "Starting MySQL backup verification"

  # Prepare and restore latest XtraBackup
  LATEST_FULL=$(ls -td /backups/full-* | head -1)
  cp -r "$LATEST_FULL" "$RESTORE_DIR/mysql-data"
  xtrabackup --prepare --target-dir="$RESTORE_DIR/mysql-data"

  # Start MySQL on a different socket and port
  mysqld --datadir="$RESTORE_DIR/mysql-data" \
    --socket=/tmp/mysql-verify.sock \
    --port=3307 \
    --skip-networking=0 \
    --bind-address=127.0.0.1 &

  sleep 10

  # Verify data
  TABLES=$(mysql -S /tmp/mysql-verify.sock -e \
    "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='production_db';" -sN)
  log "Verified $TABLES tables exist"

  mysqlcheck -S /tmp/mysql-verify.sock --all-databases --check 2>&1 | tee -a "$LOG_FILE"

  log "MySQL backup verification PASSED"
fi

curl -s -X POST "$SLACK_WEBHOOK" \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"Backup verification PASSED for $DB_TYPE at $(date)\"}"

log "Verification complete"

Disaster Recovery Planning: RTO and RPO

Every backup strategy must be designed around two metrics: Recovery Time Objective (RTO) — how long you can afford to be down — and Recovery Point Objective (RPO) — how much data you can afford to lose. These numbers drive every decision about backup frequency, method, and architecture.

Scenario	RTO	RPO	Strategy
E-commerce checkout	< 5 min	0 (zero data loss)	Synchronous replication + continuous WAL archiving + hot standby failover
SaaS application	< 30 min	< 1 min	Streaming replication + WAL-G continuous archiving + automated failover
Internal tools	< 4 hours	< 1 hour	Daily differential + hourly incremental + WAL archiving
Analytics / data warehouse	< 24 hours	< 24 hours	Daily full backup to cloud storage
Dev / staging	< 48 hours	< 1 week	Weekly full backup

For zero RPO, you need synchronous replication to at least one standby. This adds latency to every write transaction but guarantees no committed transaction is lost. Most production systems accept near-zero RPO (seconds of potential loss) by using asynchronous streaming replication with continuous WAL/binlog archiving, which avoids the write-latency penalty.

Disaster Recovery Runbook Template

# DR Runbook: Database Recovery

## Severity Levels
- P1: Complete data loss / corruption — all hands, CEO notified
- P2: Partial data loss / single region down — on-call team + escalation
- P3: Replica failure / backup failure — on-call investigation

## Recovery Procedures

### Scenario A: Primary DB failure, replicas healthy
1. Promote replica to primary (automated via Patroni / orchestrator)
2. Verify application connectivity
3. Re-establish replication from new primary
4. Investigate root cause

### Scenario B: Complete cluster failure, backups intact
1. Provision new database infrastructure
2. Restore latest full backup
3. Apply WAL/binlog to reach latest consistent point
4. Verify data integrity with checksums
5. Update DNS / connection strings
6. Resume application traffic
7. Re-establish backup schedule immediately

### Scenario C: Data corruption (bad migration / SQL injection)
1. Identify exact timestamp of corruption
2. Restore to point-in-time just before corruption
3. Export affected tables from restored copy
4. Merge clean data into production
5. OR: full PITR restore if corruption is widespread

## Contacts
- DBA on-call: [PagerDuty rotation]
- Infrastructure: [PagerDuty rotation]
- VP Engineering: [phone number]

## Validation Checklist
- [ ] Application health checks pass
- [ ] Row counts match expected ranges
- [ ] Recent transactions are present
- [ ] Replication re-established
- [ ] Backup schedule resumed
- [ ] Post-incident review scheduled

Backup Monitoring and Alerting

Backup systems fail silently. The database keeps running, the application keeps serving traffic, and nobody notices that backups stopped working three weeks ago — until they need one. Proactive monitoring is essential.

# Prometheus alerting rules for backup monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-alerts
  namespace: monitoring
spec:
  groups:
    - name: database-backups
      rules:
        - alert: BackupTooOld
          expr: |
            (time() - backup_last_successful_timestamp_seconds) > 90000
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "Database backup is older than 25 hours"
            description: "Last successful backup for {{ $labels.database }} was {{ $value | humanizeDuration }} ago"

        - alert: BackupJobFailed
          expr: |
            kube_job_status_failed{job_name=~".*backup.*"} > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Backup CronJob failed: {{ $labels.job_name }}"

        - alert: WALArchivingLagging
          expr: |
            pg_stat_archiver_failed_count > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "PostgreSQL WAL archiving has failures"

        - alert: BackupStorageQuotaNearing
          expr: |
            (backup_storage_used_bytes / backup_storage_quota_bytes) > 0.85
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "Backup storage at {{ $value | humanizePercentage }} capacity"

        - alert: BinlogSpaceCritical
          expr: |
            mysql_binlog_size_bytes > 53687091200
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "MySQL binlog space exceeds 50GB — check archiving"

# Custom Prometheus exporter for pgBackRest metrics
#!/usr/bin/env python3
"""pgBackRest Prometheus exporter — exposes backup age and size metrics."""
import json
import subprocess
import time
from prometheus_client import start_http_server, Gauge

BACKUP_AGE = Gauge('pgbackrest_last_backup_age_seconds', 'Seconds since last backup', ['stanza', 'type'])
BACKUP_SIZE = Gauge('pgbackrest_last_backup_size_bytes', 'Size of last backup', ['stanza', 'type'])
BACKUP_REPO_SIZE = Gauge('pgbackrest_repo_size_bytes', 'Total repository size', ['stanza'])

def collect():
    result = subprocess.run(
        ['pgbackrest', '--output=json', 'info'],
        capture_output=True, text=True
    )
    info = json.loads(result.stdout)
    for stanza_info in info:
        stanza = stanza_info['name']
        for backup in stanza_info.get('backup', []):
            backup_type = backup['type']
            stop_time = backup['timestamp']['stop']
            age = time.time() - stop_time
            BACKUP_AGE.labels(stanza=stanza, type=backup_type).set(age)
            size = backup['info']['size']
            BACKUP_SIZE.labels(stanza=stanza, type=backup_type).set(size)

if __name__ == '__main__':
    start_http_server(9854)
    while True:
        collect()
        time.sleep(300)

Common Mistakes and Anti-Patterns

After managing database backups across dozens of production environments, these are the mistakes that cause the most damage.

1. Backing up to the same disk as the database. If the disk fails, you lose both the database and the backup. Always write backups to a separate storage target — ideally off-host and off-region.

2. Never testing restores. A backup that cannot be restored is not a backup. Schedule automated restore tests weekly and have engineers perform manual restore drills quarterly.

3. Relying solely on replication as a backup. Replication is not backup. A DROP TABLE on the primary is instantly replicated to all replicas. Replication protects against hardware failure, not logical errors.

4. Not monitoring backup jobs. Cron jobs fail silently. Kubernetes CronJobs get suspended. S3 credentials expire. Every backup job must report success or failure to a monitoring system, and alert if the most recent successful backup is older than your RPO.

5. Storing encryption keys alongside backups. If an attacker gains access to your backup storage, they should not also have the decryption key. Store keys in a dedicated secrets manager.

6. No retention policy. Without lifecycle policies, backup storage grows unbounded. Define clear retention windows: 7 days for WAL segments, 30 days for daily backups, 12 months for monthly backups, and automate their deletion.

7. Ignoring backup performance impact. Running a full backup on a primary database during peak hours degrades application performance. Schedule backups during low-traffic windows, or back up from a dedicated replica.

8. Using mysqldump for large databases without --single-transaction. Without this flag, mysqldump locks tables, blocking writes for the duration of the dump. For large databases this can mean minutes or hours of downtime.

9. Forgetting to back up database configuration. Restoring data is only half the battle. If you lose postgresql.conf, pg_hba.conf, my.cnf, replication settings, and user grants, you cannot bring the database back to a usable state. Include configuration files in your backup process.

10. Not documenting the restore procedure. During an outage, the person restoring the database may not be the person who set up the backups. A written, tested runbook is critical.

Cost Optimization for Backup Storage

Backup storage costs can grow rapidly, especially with frequent full backups of large databases. These strategies keep costs under control without compromising recoverability.

Use incremental/differential backups. A weekly full plus daily differentials uses a fraction of the storage compared to daily full backups. pgBackRest's delta restore capability means incremental backups restore almost as fast as full backups.

Enable compression. Modern compression algorithms like zstd offer excellent compression ratios (5:1 to 10:1 for typical database data) with minimal CPU overhead. Both pgBackRest and XtraBackup support zstd natively.

Implement storage tiering. Move backups to progressively cheaper storage tiers as they age. The lifecycle policies shown earlier (S3 Standard → IA → Glacier, GCS Standard → Nearline → Coldline) can reduce long-term storage costs by 70–90%.

Deduplicate where possible. pgBackRest uses block-level deduplication in its repository, storing only changed blocks across backups. This dramatically reduces storage for databases where the majority of data is static.

Right-size retention windows. Many teams default to keeping backups forever out of caution. Analyze your actual recovery patterns and compliance requirements, then set retention that matches. A 7-year retention requirement can use deep archive storage at less than $1/TB/month on most cloud providers.

# Cost comparison: Full vs Incremental backup storage
# Assuming 500 GB database, 5% daily change rate, 30-day retention

# Strategy A: Daily full backups
# 500 GB × 30 days = 15,000 GB = 15 TB
# S3 Standard: 15 TB × $0.023/GB = $345/month

# Strategy B: Weekly full + daily incremental
# 4 full × 500 GB = 2,000 GB
# 26 incremental × 25 GB = 650 GB
# Total: 2,650 GB ≈ 2.6 TB
# S3 Standard: 2.6 TB × $0.023/GB = $59.80/month

# Strategy C: Strategy B + lifecycle tiering
# Current week: 525 GB Standard = $12.08
# Weeks 2-4: 2,125 GB Standard-IA = $26.56
# Total: $38.64/month

# Savings: Strategy C is 89% cheaper than Strategy A

Putting It All Together: A Complete Backup Architecture

Here is the recommended backup architecture for a production environment running both MySQL and PostgreSQL, deployed on Kubernetes with multi-cloud disaster recovery.

For PostgreSQL:

pgBackRest as the primary backup tool with two repositories (S3 + Azure Blob)
Continuous WAL archiving to both repositories
Weekly full backup + daily differential (via K8s CronJob)
Longhorn volume snapshots every 4 hours for fast rollback
Automated restore verification every Wednesday
Prometheus monitoring with alerts on backup age, failure, and storage

For MySQL:

Percona XtraBackup for physical backups streamed to S3
Continuous binary log shipping to separate storage
Weekly full + daily incremental (via K8s CronJob)
Daily mysqldump of critical schemas for portable logical backups
Longhorn volume snapshots every 4 hours
Automated restore verification every Thursday

For the Kubernetes cluster:

Velero daily backup of database namespaces with PV snapshots
RKE2/k3s etcd snapshots every 6 hours to off-cluster storage
GitOps repository as the source of truth for all manifests

For disaster recovery:

S3 cross-region replication to DR region
Azure GRS for geo-redundant backup copies
Monthly DR drill: full cluster rebuild from backups in DR region
Documented runbook with decision tree for each failure scenario

Conclusion

Database backups are the last line of defense between your business and catastrophic data loss. The strategies outlined in this article — physical and logical backups, continuous WAL and binlog archiving, multi-cloud storage with lifecycle policies, encryption at every layer, automated scheduling via Kubernetes CronJobs, and systematic restore verification — represent the current state of the art for production MySQL and PostgreSQL environments.

The most important takeaway is this: a backup strategy is only as good as its last successful restore test. Every tool, script, and architecture pattern in this article exists to serve one purpose — ensuring that when the worst happens, you can recover your data, meet your RTO and RPO commitments, and keep your business running. Build your backup system, automate it, monitor it, test it, and then test it again. Your future self will thank you.

Understanding Backup Types

MySQL Backup Methods

MySQL offers several backup tools, each suited to different database sizes and recovery requirements. The right choice depends on your data volume, acceptable backup window, and RTO/RPO targets.

mysqldump — The Universal Logical Backup

# Full logical backup with consistent snapshot for InnoDB
mysqldump --single-transaction --routines --triggers --events \
  --set-gtid-purged=ON --all-databases \
  | gzip > /backups/mysql-full-$(date +%Y%m%d-%H%M%S).sql.gz

# Single database backup with compression
mysqldump --single-transaction --routines --triggers \
  --databases production_db \
  | pigz -p4 > /backups/production_db-$(date +%Y%m%d).sql.gz

# Schema-only backup for migration planning
mysqldump --no-data --routines --triggers --events \
  --all-databases > /backups/schema-only-$(date +%Y%m%d).sql

mysqlpump — Parallel Logical Backup

mysqlpump improves on mysqldump with parallel table dumping and built-in compression. It can significantly reduce backup time for databases with many independent tables.

# Parallel logical backup with 4 threads and zstd compression
mysqlpump --default-parallelism=4 --compress-output=ZSTD \
  --include-databases=production_db,analytics_db \
  --set-gtid-purged=ON \
  > /backups/mysql-pump-$(date +%Y%m%d).sql.zst

Percona XtraBackup — Physical Backups for InnoDB

# Full physical backup with streaming to compressed archive
xtrabackup --backup --target-dir=/backups/full-$(date +%Y%m%d) \
  --user=backup_user --password=secure_pass \
  --parallel=4 --compress --compress-threads=4

# Incremental backup based on last full
xtrabackup --backup --target-dir=/backups/incr-$(date +%Y%m%d) \
  --incremental-basedir=/backups/full-20260412 \
  --user=backup_user --password=secure_pass \
  --parallel=4

# Stream full backup directly to S3 via xbstream
xtrabackup --backup --stream=xbstream --compress \
  --user=backup_user --password=secure_pass | \
  aws s3 cp - s3://db-backups/mysql/full-$(date +%Y%m%d).xbstream

# Prepare for restore (apply redo log)
xtrabackup --prepare --target-dir=/backups/full-20260412

# Prepare incremental on top of full
xtrabackup --prepare --apply-log-only --target-dir=/backups/full-20260412
xtrabackup --prepare --target-dir=/backups/full-20260412 \
  --incremental-dir=/backups/incr-20260412

# Restore
systemctl stop mysqld
xtrabackup --copy-back --target-dir=/backups/full-20260412
chown -R mysql:mysql /var/lib/mysql
systemctl start mysqld

MySQL Binary Log — Point-in-Time Recovery

MySQL binary logs record every data-modifying statement or row change. When combined with a full or physical backup, they enable point-in-time recovery to any moment after the backup was taken.

# Enable binary logging in my.cnf
[mysqld]
server-id = 1
log-bin = /var/log/mysql/mysql-bin
binlog_format = ROW
binlog_expire_logs_seconds = 604800   # 7 days
sync_binlog = 1
gtid_mode = ON
enforce_gtid_consistency = ON

# Flush and archive binary logs
mysqladmin flush-logs
mysqlbinlog --read-from-remote-server --host=db-primary \
  --raw --stop-never --result-file=/backup/binlogs/ \
  mysql-bin.000042

# Point-in-time recovery: replay binlog up to specific timestamp
mysqlbinlog --stop-datetime="2026-04-12 14:30:00" \
  /backup/binlogs/mysql-bin.000042 \
  /backup/binlogs/mysql-bin.000043 | mysql -u root

PostgreSQL Backup Methods

PostgreSQL's backup ecosystem is arguably richer than MySQL's, with first-party tools and a mature set of community solutions purpose-built for large-scale production environments.

pg_dump and pg_dumpall — Logical Backups

Like mysqldump, pg_dump produces logical backups. The custom format (-Fc) is the recommended default because it supports parallel restore, selective table restoration, and built-in compression.

# Custom format with parallel dump (4 worker jobs)
pg_dump -Fc -j4 -f /backups/production-$(date +%Y%m%d).dump production_db

# Directory format for maximum parallelism on large databases
pg_dump -Fd -j8 -f /backups/production-$(date +%Y%m%d)/ production_db

# All databases including globals (roles, tablespaces)
pg_dumpall > /backups/pg-all-$(date +%Y%m%d).sql

# Parallel restore from custom format
pg_restore -j4 -d production_db_restored /backups/production-20260412.dump

# Selective restore: single table
pg_restore -j4 -d production_db -t orders /backups/production-20260412.dump

pg_basebackup — Physical Backup Foundation

# Physical backup with WAL files included
pg_basebackup -D /backups/base-$(date +%Y%m%d) \
  -Ft -z -Xs -P -c fast \
  -U replication_user -h db-primary

# Stream backup directly to a tar archive with checksums
pg_basebackup -D - -Ft -Xs -c fast \
  -U replication_user -h db-primary | \
  gzip > /backups/pg-base-$(date +%Y%m%d).tar.gz

pgBackRest — Enterprise-Grade Backup Management

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=pg-backups-production
repo1-s3-endpoint=s3.eu-west-1.amazonaws.com
repo1-s3-region=eu-west-1
repo1-s3-key=AKIAIOSFODNN7EXAMPLE
repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=a_very_secure_encryption_passphrase

# Second repository for geographic redundancy
repo2-type=azure
repo2-azure-container=pg-backups-dr
repo2-azure-account=prodbackupstorage
repo2-azure-key=base64encodedkeyhere
repo2-path=/pgbackrest
repo2-retention-full=2

process-max=4
compress-type=zst
compress-level=6
log-level-console=info
log-level-file=detail
start-fast=y

[production]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

# PostgreSQL WAL archive configuration (postgresql.conf)
# archive_mode = on
# archive_command = 'pgbackrest --stanza=production archive-push %p'

# Create the stanza
pgbackrest --stanza=production stanza-create

# Verify the stanza configuration
pgbackrest --stanza=production check

# Full backup
pgbackrest --stanza=production --type=full backup

# Differential backup
pgbackrest --stanza=production --type=diff backup

# Incremental backup
pgbackrest --stanza=production --type=incr backup

# List backups
pgbackrest --stanza=production info

WAL-G — Lightweight WAL Archiving to Cloud Storage

# Environment variables for WAL-G with S3
export WALG_S3_PREFIX=s3://pg-wal-archive/production
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_REGION=eu-west-1
export WALG_COMPRESSION_METHOD=lz4
export PGHOST=/var/run/postgresql

# Configure WAL archiving in postgresql.conf
# archive_mode = on
# archive_command = 'wal-g wal-push %p'
# restore_command = 'wal-g wal-fetch %f %p'

# Take a base backup
wal-g backup-push /var/lib/postgresql/16/main

# List backups
wal-g backup-list

# Restore from latest backup
wal-g backup-fetch /var/lib/postgresql/16/main LATEST

# Delete old backups (retain last 4)
wal-g delete retain FULL 4 --confirm

Barman — Centralised Backup Server

# /etc/barman.d/production.conf
[production]
description = "Production PostgreSQL 16"
ssh_command = ssh postgres@db-primary
conninfo = host=db-primary user=barman dbname=postgres
streaming_conninfo = host=db-primary user=streaming_barman
backup_method = postgres
streaming_archiver = on
slot_name = barman
retention_policy = RECOVERY WINDOW OF 14 DAYS

# Create replication slot and start streaming
barman receive-wal --create-slot production
barman switch-wal --force --archive production

# Take a backup
barman backup production

# List backups
barman list-backup production

# Restore to point in time
barman recover --target-time "2026-04-12 14:30:00" \
  production 20260412T120000 /var/lib/postgresql/16/main

Point-in-Time Recovery (PITR)

PITR for PostgreSQL

PostgreSQL PITR works by restoring a base backup and replaying WAL segments up to the target timestamp. With pgBackRest, the entire process is a single command.

# Restore to specific point in time with pgBackRest
pgbackrest --stanza=production \
  --type=time --target="2026-04-12 16:42:30" \
  --target-action=promote \
  restore

# Manual PITR using pg_basebackup + WAL archive
# 1. Stop PostgreSQL
systemctl stop postgresql

# 2. Clear the data directory and restore base backup
rm -rf /var/lib/postgresql/16/main/*
tar xzf /backups/pg-base-20260412.tar.gz -C /var/lib/postgresql/16/main/

# 3. Create recovery signal and configure restore
cat > /var/lib/postgresql/16/main/postgresql.auto.conf << 'CONF'
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-04-12 16:42:30'
recovery_target_action = 'promote'
CONF
touch /var/lib/postgresql/16/main/recovery.signal

# 4. Start PostgreSQL — it will replay WAL and stop at target
systemctl start postgresql

PITR for MySQL

MySQL PITR combines an XtraBackup physical restore with binary log replay.

# 1. Restore the XtraBackup
systemctl stop mysqld
rm -rf /var/lib/mysql/*
xtrabackup --prepare --target-dir=/backups/full-20260412
xtrabackup --copy-back --target-dir=/backups/full-20260412
chown -R mysql:mysql /var/lib/mysql
systemctl start mysqld

# 2. Identify the binlog position from XtraBackup metadata
cat /backups/full-20260412/xtrabackup_binlog_info
# Output: mysql-bin.000042  154  3E11FA47-1111-1111-1111-AAAAAAAAAAAA:1-1234

# 3. Replay binlogs up to target time
mysqlbinlog --start-position=154 \
  --stop-datetime="2026-04-12 16:42:30" \
  /backup/binlogs/mysql-bin.000042 \
  /backup/binlogs/mysql-bin.000043 | mysql -u root

Multi-Cloud Backup Architecture

Cloud Storage Configuration with Lifecycle Policies

# AWS S3 lifecycle policy (Terraform)
resource "aws_s3_bucket_lifecycle_configuration" "backup_lifecycle" {
  bucket = aws_s3_bucket.db_backups.id

  rule {
    id     = "backup-tiering"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
    expiration {
      days = 2555  # 7 years for compliance
    }
  }

  rule {
    id     = "wal-segments"
    status = "Enabled"
    filter {
      prefix = "wal-archive/"
    }
    expiration {
      days = 30
    }
  }
}

# Enable cross-region replication
resource "aws_s3_bucket_replication_configuration" "backup_replication" {
  bucket = aws_s3_bucket.db_backups.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "cross-region-dr"
    status = "Enabled"
    destination {
      bucket        = aws_s3_bucket.db_backups_dr.arn
      storage_class = "STANDARD_IA"
      encryption_configuration {
        replica_kms_key_id = aws_kms_key.dr_key.arn
      }
    }
    source_selection_criteria {
      sse_kms_encrypted_objects {
        status = "Enabled"
      }
    }
  }
}

# Azure Blob immutability policy (Azure CLI)
az storage container immutability-policy create \
  --account-name prodbackupstorage \
  --container-name pg-backups \
  --period 365 \
  --allow-protected-append-writes true

# GCS lifecycle with nearline transition
gsutil lifecycle set /dev/stdin gs://pg-backups-production << 'JSON'
{
  "rule": [
    {"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"}, "condition": {"age": 30}},
    {"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"}, "condition": {"age": 90}},
    {"action": {"type": "Delete"}, "condition": {"age": 2555}}
  ]
}
JSON

Backup Encryption and Security

# pgBackRest client-side encryption config
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=long_random_passphrase_stored_in_vault

# WAL-G client-side encryption
export WALG_LIBSODIUM_KEY=$(cat /etc/wal-g/encryption.key)
# Or GPG-based
export WALG_GPG_KEY_ID=backup@example.com

# MySQL XtraBackup with encryption
xtrabackup --backup --encrypt=AES256 \
  --encrypt-key-file=/etc/mysql/backup-encryption.key \
  --target-dir=/backups/full-encrypted-$(date +%Y%m%d)

Automated Backup Scheduling

Manual backups are not backups — they are aspirations. Backup schedules must be automated, monitored, and alerting on failure.

System Cron for Bare Metal and VMs

# /etc/cron.d/database-backups

# PostgreSQL — pgBackRest
# Full backup every Sunday at 01:00
0 1 * * 0 postgres pgbackrest --stanza=production --type=full backup 2>&1 | logger -t pgbackrest-full

# Differential backup every day at 01:00 (except Sunday)
0 1 * * 1-6 postgres pgbackrest --stanza=production --type=diff backup 2>&1 | logger -t pgbackrest-diff

# MySQL — XtraBackup
# Full backup every Sunday at 02:00
0 2 * * 0 root /usr/local/bin/mysql-backup.sh full 2>&1 | logger -t xtrabackup-full

# Incremental backup every day at 02:00 (except Sunday)
0 2 * * 1-6 root /usr/local/bin/mysql-backup.sh incremental 2>&1 | logger -t xtrabackup-incr

# Backup verification — restore test every Wednesday at 04:00
0 4 * * 3 root /usr/local/bin/verify-backup.sh 2>&1 | logger -t backup-verify

Kubernetes CronJobs

In Kubernetes environments, CronJobs replace system cron. They offer built-in retry logic, concurrency control, and integration with the cluster's RBAC and secret management.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pg-backup-full
  namespace: database
spec:
  schedule: "0 1 * * 0"  # Sunday 01:00 UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 4
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 7200  # 2 hour timeout
      template:
        spec:
          serviceAccountName: db-backup
          restartPolicy: OnFailure
          containers:
            - name: pgbackrest
              image: pgbackrest/pgbackrest:2.50
              command:
                - /bin/bash
                - -c
                - |
                  pgbackrest --stanza=production --type=full backup
                  RESULT=$?
                  if [ $RESULT -eq 0 ]; then
                    curl -s -X POST "$SLACK_WEBHOOK" \
                      -d '{"text":"PostgreSQL full backup completed successfully"}'
                  else
                    curl -s -X POST "$SLACK_WEBHOOK" \
                      -d '{"text":"ALERT: PostgreSQL full backup FAILED"}'
                  fi
                  exit $RESULT
              envFrom:
                - secretRef:
                    name: pgbackrest-credentials
                - secretRef:
                    name: slack-webhook
              resources:
                requests:
                  memory: 512Mi
                  cpu: 500m
                limits:
                  memory: 2Gi
                  cpu: "2"
              volumeMounts:
                - name: pgbackrest-config
                  mountPath: /etc/pgbackrest
          volumes:
            - name: pgbackrest-config
              configMap:
                name: pgbackrest-conf
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pg-backup-diff
  namespace: database
spec:
  schedule: "0 1 * * 1-6"  # Mon-Sat 01:00 UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 7
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 3600
      template:
        spec:
          serviceAccountName: db-backup
          restartPolicy: OnFailure
          containers:
            - name: pgbackrest
              image: pgbackrest/pgbackrest:2.50
              command:
                - /bin/bash
                - -c
                - |
                  pgbackrest --stanza=production --type=diff backup
              envFrom:
                - secretRef:
                    name: pgbackrest-credentials
              resources:
                requests:
                  memory: 256Mi
                  cpu: 250m
                limits:
                  memory: 1Gi
                  cpu: "1"
              volumeMounts:
                - name: pgbackrest-config
                  mountPath: /etc/pgbackrest
          volumes:
            - name: pgbackrest-config
              configMap:
                name: pgbackrest-conf
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: mysql-backup-full
  namespace: database
spec:
  schedule: "0 2 * * 0"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 7200
      template:
        spec:
          serviceAccountName: db-backup
          restartPolicy: OnFailure
          containers:
            - name: xtrabackup
              image: percona/percona-xtrabackup:8.0
              command:
                - /bin/bash
                - -c
                - |
                  xtrabackup --backup --stream=xbstream --compress \
                    --user=$MYSQL_BACKUP_USER \
                    --password=$MYSQL_BACKUP_PASS \
                    --host=mysql-primary.database.svc | \
                    aws s3 cp - s3://$S3_BUCKET/mysql/full-$(date +%Y%m%d).xbstream
              envFrom:
                - secretRef:
                    name: mysql-backup-credentials
                - secretRef:
                    name: aws-credentials
              resources:
                requests:
                  memory: 512Mi
                  cpu: 500m
                limits:
                  memory: 2Gi
                  cpu: "2"

Kubernetes-Native Backup Approaches

Velero for Cluster-Level Backup

# Install Velero with S3 backend
velero install \
  --provider aws \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config region=eu-west-1 \
  --snapshot-location-config region=eu-west-1 \
  --use-volume-snapshots=true \
  --plugins velero/velero-plugin-for-aws:v1.9.0

# Create a scheduled backup of the database namespace
velero schedule create db-namespace-backup \
  --schedule="0 3 * * *" \
  --include-namespaces database \
  --ttl 720h \
  --storage-location default \
  --volume-snapshot-locations default

# On-demand backup before maintenance
velero backup create pre-maintenance-$(date +%Y%m%d) \
  --include-namespaces database,monitoring \
  --wait

# Restore a namespace from backup
velero restore create --from-backup pre-maintenance-20260412 \
  --include-namespaces database

Longhorn Snapshots on k3s

# Longhorn recurring snapshot job via CRD
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: pg-data-snapshot
  namespace: longhorn-system
spec:
  cron: "0 */4 * * *"  # every 4 hours
  task: snapshot
  retain: 6
  concurrency: 1
  groups:
    - pg-data
  labels:
    app: postgresql
---
# Longhorn recurring backup to S3
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: pg-data-s3-backup
  namespace: longhorn-system
spec:
  cron: "0 2 * * *"  # daily at 02:00
  task: backup
  retain: 14
  concurrency: 1
  groups:
    - pg-data
  labels:
    app: postgresql
---
# VolumeSnapshot using Longhorn CSI
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: pg-data-snap-$(date +%Y%m%d)
  namespace: database
spec:
  volumeSnapshotClassName: longhorn-snapshot-vsc
  source:
    persistentVolumeClaimName: pg-data-postgresql-0

Operator-Managed Backups

# CloudNativePG cluster with integrated backup
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: production-pg
  namespace: database
spec:
  instances: 3
  storage:
    size: 100Gi
    storageClass: longhorn
  backup:
    barmanObjectStore:
      destinationPath: s3://pg-backups/cnpg/
      endpointURL: https://s3.eu-west-1.amazonaws.com
      s3Credentials:
        accessKeyId:
          name: aws-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: aws-creds
          key: SECRET_ACCESS_KEY
      wal:
        compression: gzip
        maxParallel: 4
      data:
        compression: gzip
    retentionPolicy: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
  name: production-pg-weekly
  namespace: database
spec:
  schedule: "0 1 * * 0"
  backupOwnerReference: self
  cluster:
    name: production-pg

Cloud Provider-Specific Strategies

AWS — RDS and Aurora

# Enable automated backups with maximum retention (Terraform)
resource "aws_db_instance" "production" {
  identifier             = "production-pg"
  engine                 = "postgres"
  engine_version         = "16.2"
  instance_class         = "db.r6g.xlarge"
  allocated_storage      = 500
  backup_retention_period = 35  # maximum
  backup_window          = "03:00-04:00"
  copy_tags_to_snapshot  = true
  deletion_protection    = true
  storage_encrypted      = true
  kms_key_id             = aws_kms_key.rds.arn

  # Enable PITR
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
}

# Cross-region automated backup replication
resource "aws_db_instance_automated_backups_replication" "dr" {
  source_db_instance_arn = aws_db_instance.production.arn
  kms_key_id             = aws_kms_key.dr_rds.arn
  retention_period       = 14
}

# Aurora Backtrack (MySQL-compatible Aurora only)
resource "aws_rds_cluster" "aurora_production" {
  cluster_identifier      = "aurora-production"
  engine                  = "aurora-mysql"
  engine_version          = "8.0.mysql_aurora.3.05.2"
  backtrack_window        = 86400  # 24 hours of backtrack
  backup_retention_period = 35
  preferred_backup_window = "03:00-04:00"
  storage_encrypted       = true
}

# Manual snapshot with cross-region copy
aws rds create-db-snapshot \
  --db-instance-identifier production-pg \
  --db-snapshot-identifier pre-migration-$(date +%Y%m%d)

aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:eu-west-1:123456789:snapshot:pre-migration-20260412 \
  --target-db-snapshot-identifier pre-migration-20260412-dr \
  --source-region eu-west-1 \
  --region us-east-1 \
  --kms-key-id arn:aws:kms:us-east-1:123456789:key/dr-key-id

Azure — Flexible Server

# Azure Flexible Server with geo-redundant backup (Terraform)
resource "azurerm_postgresql_flexible_server" "production" {
  name                = "production-pg"
  location            = "westeurope"
  resource_group_name = azurerm_resource_group.db.name
  sku_name            = "GP_Standard_D4s_v3"
  version             = "16"
  storage_mb          = 524288  # 512 GB

  backup_retention_days        = 35
  geo_redundant_backup_enabled = true

  authentication {
    active_directory_auth_enabled = true
    password_auth_enabled         = false
  }
}

# Azure Blob immutability for self-managed backups
resource "azurerm_storage_management_policy" "backup_lifecycle" {
  storage_account_id = azurerm_storage_account.backups.id

  rule {
    name    = "backup-tiering"
    enabled = true
    filters {
      blob_types   = ["blockBlob"]
      prefix_match = ["pg-backups/"]
    }
    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than    = 30
        tier_to_archive_after_days_since_modification_greater_than = 90
        delete_after_days_since_modification_greater_than          = 2555
      }
    }
  }
}

GCP — Cloud SQL

# Cloud SQL with automated backups and PITR (Terraform)
resource "google_sql_database_instance" "production" {
  name             = "production-pg"
  database_version = "POSTGRES_16"
  region           = "europe-west1"

  settings {
    tier = "db-custom-4-16384"

    backup_configuration {
      enabled                        = true
      start_time                     = "03:00"
      point_in_time_recovery_enabled = true
      transaction_log_retention_days = 7
      backup_retention_settings {
        retained_backups = 30
        retention_unit   = "COUNT"
      }
    }

    ip_configuration {
      ssl_mode = "ENCRYPTED_ONLY"
    }
  }
}

# Export to GCS for long-term retention
gcloud sql export sql production-pg \
  gs://pg-backups-longterm/export-$(date +%Y%m%d).sql.gz \
  --database=production_db \
  --offload

Backup Verification and Restore Testing

#!/bin/bash
# verify-backup.sh — Automated backup verification script
set -euo pipefail

RESTORE_DIR="/tmp/backup-verify-$(date +%Y%m%d-%H%M%S)"
LOG_FILE="/var/log/backup-verify.log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
DB_TYPE="${1:-postgresql}"  # postgresql or mysql

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"; }
alert() {
  log "ALERT: $*"
  curl -s -X POST "$SLACK_WEBHOOK" \
    -H 'Content-Type: application/json' \
    -d "{\"text\":\"BACKUP VERIFY FAILED: $*\"}"
}

cleanup() {
  log "Cleaning up $RESTORE_DIR"
  rm -rf "$RESTORE_DIR"
  if [ "$DB_TYPE" = "postgresql" ]; then
    pg_ctlcluster 16 verify stop 2>/dev/null || true
  else
    mysqladmin -S /tmp/mysql-verify.sock shutdown 2>/dev/null || true
  fi
}
trap cleanup EXIT

mkdir -p "$RESTORE_DIR"

if [ "$DB_TYPE" = "postgresql" ]; then
  log "Starting PostgreSQL backup verification"

  # Restore latest pgBackRest backup to temporary directory
  pgbackrest --stanza=production \
    --pg1-path="$RESTORE_DIR/pgdata" \
    --type=immediate \
    --target-action=promote \
    restore 2>&1 | tee -a "$LOG_FILE"

  if [ ${PIPESTATUS[0]} -ne 0 ]; then
    alert "pgBackRest restore failed"
    exit 1
  fi

  # Start PostgreSQL on a different port
  pg_ctlcluster 16 verify start -- \
    -D "$RESTORE_DIR/pgdata" \
    -o "-p 5433" \
    -o "-c listen_addresses=127.0.0.1"

  sleep 5

  # Verify data integrity
  TABLES=$(psql -p 5433 -d production_db -t -c \
    "SELECT count(*) FROM information_schema.tables WHERE table_schema='public';")
  log "Verified $TABLES tables exist"

  ROW_CHECK=$(psql -p 5433 -d production_db -t -c \
    "SELECT count(*) FROM orders WHERE created_at > now() - interval '7 days';")
  log "Recent orders count: $ROW_CHECK"

  if [ "$ROW_CHECK" -lt 1 ]; then
    alert "PostgreSQL restore has no recent data — possible stale backup"
    exit 1
  fi

  # Run pg_amcheck for corruption detection (PostgreSQL 14+)
  pg_amcheck -p 5433 -d production_db --heapallindexed 2>&1 | tee -a "$LOG_FILE"

  log "PostgreSQL backup verification PASSED"

else
  log "Starting MySQL backup verification"

  # Prepare and restore latest XtraBackup
  LATEST_FULL=$(ls -td /backups/full-* | head -1)
  cp -r "$LATEST_FULL" "$RESTORE_DIR/mysql-data"
  xtrabackup --prepare --target-dir="$RESTORE_DIR/mysql-data"

  # Start MySQL on a different socket and port
  mysqld --datadir="$RESTORE_DIR/mysql-data" \
    --socket=/tmp/mysql-verify.sock \
    --port=3307 \
    --skip-networking=0 \
    --bind-address=127.0.0.1 &

  sleep 10

  # Verify data
  TABLES=$(mysql -S /tmp/mysql-verify.sock -e \
    "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='production_db';" -sN)
  log "Verified $TABLES tables exist"

  mysqlcheck -S /tmp/mysql-verify.sock --all-databases --check 2>&1 | tee -a "$LOG_FILE"

  log "MySQL backup verification PASSED"
fi

curl -s -X POST "$SLACK_WEBHOOK" \
  -H 'Content-Type: application/json' \
  -d "{\"text\":\"Backup verification PASSED for $DB_TYPE at $(date)\"}"

log "Verification complete"

Disaster Recovery Planning: RTO and RPO

Scenario	RTO	RPO	Strategy
E-commerce checkout	< 5 min	0 (zero data loss)	Synchronous replication + continuous WAL archiving + hot standby failover
SaaS application	< 30 min	< 1 min	Streaming replication + WAL-G continuous archiving + automated failover
Internal tools	< 4 hours	< 1 hour	Daily differential + hourly incremental + WAL archiving
Analytics / data warehouse	< 24 hours	< 24 hours	Daily full backup to cloud storage
Dev / staging	< 48 hours	< 1 week	Weekly full backup

Disaster Recovery Runbook Template

# DR Runbook: Database Recovery

## Severity Levels
- P1: Complete data loss / corruption — all hands, CEO notified
- P2: Partial data loss / single region down — on-call team + escalation
- P3: Replica failure / backup failure — on-call investigation

## Recovery Procedures

### Scenario A: Primary DB failure, replicas healthy
1. Promote replica to primary (automated via Patroni / orchestrator)
2. Verify application connectivity
3. Re-establish replication from new primary
4. Investigate root cause

### Scenario B: Complete cluster failure, backups intact
1. Provision new database infrastructure
2. Restore latest full backup
3. Apply WAL/binlog to reach latest consistent point
4. Verify data integrity with checksums
5. Update DNS / connection strings
6. Resume application traffic
7. Re-establish backup schedule immediately

### Scenario C: Data corruption (bad migration / SQL injection)
1. Identify exact timestamp of corruption
2. Restore to point-in-time just before corruption
3. Export affected tables from restored copy
4. Merge clean data into production
5. OR: full PITR restore if corruption is widespread

## Contacts
- DBA on-call: [PagerDuty rotation]
- Infrastructure: [PagerDuty rotation]
- VP Engineering: [phone number]

## Validation Checklist
- [ ] Application health checks pass
- [ ] Row counts match expected ranges
- [ ] Recent transactions are present
- [ ] Replication re-established
- [ ] Backup schedule resumed
- [ ] Post-incident review scheduled

Backup Monitoring and Alerting

# Prometheus alerting rules for backup monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-alerts
  namespace: monitoring
spec:
  groups:
    - name: database-backups
      rules:
        - alert: BackupTooOld
          expr: |
            (time() - backup_last_successful_timestamp_seconds) > 90000
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "Database backup is older than 25 hours"
            description: "Last successful backup for {{ $labels.database }} was {{ $value | humanizeDuration }} ago"

        - alert: BackupJobFailed
          expr: |
            kube_job_status_failed{job_name=~".*backup.*"} > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Backup CronJob failed: {{ $labels.job_name }}"

        - alert: WALArchivingLagging
          expr: |
            pg_stat_archiver_failed_count > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "PostgreSQL WAL archiving has failures"

        - alert: BackupStorageQuotaNearing
          expr: |
            (backup_storage_used_bytes / backup_storage_quota_bytes) > 0.85
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "Backup storage at {{ $value | humanizePercentage }} capacity"

        - alert: BinlogSpaceCritical
          expr: |
            mysql_binlog_size_bytes > 53687091200
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "MySQL binlog space exceeds 50GB — check archiving"

# Custom Prometheus exporter for pgBackRest metrics
#!/usr/bin/env python3
"""pgBackRest Prometheus exporter — exposes backup age and size metrics."""
import json
import subprocess
import time
from prometheus_client import start_http_server, Gauge

BACKUP_AGE = Gauge('pgbackrest_last_backup_age_seconds', 'Seconds since last backup', ['stanza', 'type'])
BACKUP_SIZE = Gauge('pgbackrest_last_backup_size_bytes', 'Size of last backup', ['stanza', 'type'])
BACKUP_REPO_SIZE = Gauge('pgbackrest_repo_size_bytes', 'Total repository size', ['stanza'])

def collect():
    result = subprocess.run(
        ['pgbackrest', '--output=json', 'info'],
        capture_output=True, text=True
    )
    info = json.loads(result.stdout)
    for stanza_info in info:
        stanza = stanza_info['name']
        for backup in stanza_info.get('backup', []):
            backup_type = backup['type']
            stop_time = backup['timestamp']['stop']
            age = time.time() - stop_time
            BACKUP_AGE.labels(stanza=stanza, type=backup_type).set(age)
            size = backup['info']['size']
            BACKUP_SIZE.labels(stanza=stanza, type=backup_type).set(size)

if __name__ == '__main__':
    start_http_server(9854)
    while True:
        collect()
        time.sleep(300)

Common Mistakes and Anti-Patterns

After managing database backups across dozens of production environments, these are the mistakes that cause the most damage.

2. Never testing restores. A backup that cannot be restored is not a backup. Schedule automated restore tests weekly and have engineers perform manual restore drills quarterly.

5. Storing encryption keys alongside backups. If an attacker gains access to your backup storage, they should not also have the decryption key. Store keys in a dedicated secrets manager.

10. Not documenting the restore procedure. During an outage, the person restoring the database may not be the person who set up the backups. A written, tested runbook is critical.

Cost Optimization for Backup Storage

Backup storage costs can grow rapidly, especially with frequent full backups of large databases. These strategies keep costs under control without compromising recoverability.

# Cost comparison: Full vs Incremental backup storage
# Assuming 500 GB database, 5% daily change rate, 30-day retention

# Strategy A: Daily full backups
# 500 GB × 30 days = 15,000 GB = 15 TB
# S3 Standard: 15 TB × $0.023/GB = $345/month

# Strategy B: Weekly full + daily incremental
# 4 full × 500 GB = 2,000 GB
# 26 incremental × 25 GB = 650 GB
# Total: 2,650 GB ≈ 2.6 TB
# S3 Standard: 2.6 TB × $0.023/GB = $59.80/month

# Strategy C: Strategy B + lifecycle tiering
# Current week: 525 GB Standard = $12.08
# Weeks 2-4: 2,125 GB Standard-IA = $26.56
# Total: $38.64/month

# Savings: Strategy C is 89% cheaper than Strategy A

Putting It All Together: A Complete Backup Architecture

Here is the recommended backup architecture for a production environment running both MySQL and PostgreSQL, deployed on Kubernetes with multi-cloud disaster recovery.

For PostgreSQL:

pgBackRest as the primary backup tool with two repositories (S3 + Azure Blob)
Continuous WAL archiving to both repositories
Weekly full backup + daily differential (via K8s CronJob)
Longhorn volume snapshots every 4 hours for fast rollback
Automated restore verification every Wednesday
Prometheus monitoring with alerts on backup age, failure, and storage

For MySQL:

Percona XtraBackup for physical backups streamed to S3
Continuous binary log shipping to separate storage
Weekly full + daily incremental (via K8s CronJob)
Daily mysqldump of critical schemas for portable logical backups
Longhorn volume snapshots every 4 hours
Automated restore verification every Thursday

For the Kubernetes cluster:

Velero daily backup of database namespaces with PV snapshots
RKE2/k3s etcd snapshots every 6 hours to off-cluster storage
GitOps repository as the source of truth for all manifests

For disaster recovery:

S3 cross-region replication to DR region
Azure GRS for geo-redundant backup copies
Monthly DR drill: full cluster rebuild from backups in DR region
Documented runbook with decision tree for each failure scenario

Production Database Backup Strategies That Actually Work: MySQL and PostgreSQL

Understanding Backup Types

MySQL Backup Methods

mysqldump — The Universal Logical Backup

mysqlpump — Parallel Logical Backup

Percona XtraBackup — Physical Backups for InnoDB

MySQL Binary Log — Point-in-Time Recovery

PostgreSQL Backup Methods

pg_dump and pg_dumpall — Logical Backups

pg_basebackup — Physical Backup Foundation

pgBackRest — Enterprise-Grade Backup Management

WAL-G — Lightweight WAL Archiving to Cloud Storage

Barman — Centralised Backup Server

Point-in-Time Recovery (PITR)

PITR for PostgreSQL

PITR for MySQL

Multi-Cloud Backup Architecture

Cloud Storage Configuration with Lifecycle Policies

Backup Encryption and Security

Automated Backup Scheduling

System Cron for Bare Metal and VMs

Kubernetes CronJobs

Kubernetes-Native Backup Approaches

Velero for Cluster-Level Backup

Longhorn Snapshots on k3s

Operator-Managed Backups

Cloud Provider-Specific Strategies

AWS — RDS and Aurora

Azure — Flexible Server

GCP — Cloud SQL

Backup Verification and Restore Testing

Disaster Recovery Planning: RTO and RPO

Disaster Recovery Runbook Template

Backup Monitoring and Alerting

Common Mistakes and Anti-Patterns

Cost Optimization for Backup Storage

Putting It All Together: A Complete Backup Architecture

Conclusion

Production Database Backup Strategies That Actually Work: MySQL and PostgreSQL

Understanding Backup Types

MySQL Backup Methods

mysqldump — The Universal Logical Backup

mysqlpump — Parallel Logical Backup

Percona XtraBackup — Physical Backups for InnoDB

MySQL Binary Log — Point-in-Time Recovery

PostgreSQL Backup Methods

pg_dump and pg_dumpall — Logical Backups

pg_basebackup — Physical Backup Foundation

pgBackRest — Enterprise-Grade Backup Management

WAL-G — Lightweight WAL Archiving to Cloud Storage

Barman — Centralised Backup Server

Point-in-Time Recovery (PITR)

PITR for PostgreSQL

PITR for MySQL

Multi-Cloud Backup Architecture

Cloud Storage Configuration with Lifecycle Policies

Backup Encryption and Security

Automated Backup Scheduling

System Cron for Bare Metal and VMs

Kubernetes CronJobs

Kubernetes-Native Backup Approaches

Velero for Cluster-Level Backup

Longhorn Snapshots on k3s

Operator-Managed Backups

Cloud Provider-Specific Strategies

AWS — RDS and Aurora

Azure — Flexible Server

GCP — Cloud SQL

Backup Verification and Restore Testing

Disaster Recovery Planning: RTO and RPO

Disaster Recovery Runbook Template

Backup Monitoring and Alerting

Common Mistakes and Anti-Patterns

Cost Optimization for Backup Storage

Putting It All Together: A Complete Backup Architecture

Conclusion