Production Database Backup Strategies That Actually Work: MySQL and PostgreSQL
Proven Backup and Recovery Strategies for MySQL and PostgreSQL in Production
Losing production data is the kind of incident that ends careers, shutters companies, and makes the front page of Hacker News for all the wrong reasons. Yet most teams treat database backups as an afterthought — a cron job somebody set up two years ago that nobody has verified since. This article lays out battle-tested backup strategies for MySQL and PostgreSQL that have been proven in production environments ranging from single-node k3s clusters to multi-region cloud deployments processing millions of transactions daily.
We will cover the full spectrum: logical and physical backup methods, point-in-time recovery, automated scheduling, cloud storage targets, encryption, verification, Kubernetes-native approaches, and the disaster recovery planning that ties it all together. Every recommendation comes with concrete configuration and code you can adapt to your environment.
Understanding Backup Types
Before diving into specific tools, you need a clear mental model of the four fundamental backup strategies and how they interact with recovery objectives. Each strategy makes a different trade-off between backup speed, storage consumption, and recovery time.
A full backup captures the entire database at a single point in time. It is the simplest to reason about and the fastest to restore, but also the most expensive in storage and the slowest to create. An incremental backup captures only the data that has changed since the last backup of any type. It is fast to create and compact in storage, but restoration requires replaying the full chain: last full backup plus every subsequent incremental. A differential backup captures everything that has changed since the last full backup. It occupies a middle ground — larger than an incremental but simpler to restore because you only need the last full plus the latest differential. Finally, continuous archiving (WAL archiving in PostgreSQL, binary log streaming in MySQL) captures every individual transaction as it happens, enabling point-in-time recovery to any moment between backups.
The optimal strategy for most production systems is a combination: weekly full backups, daily incrementals or differentials, and continuous WAL/binlog archiving. This gives you both fast recovery from recent full backups and the ability to recover to any point in time when you need it.
MySQL Backup Methods
MySQL offers several backup tools, each suited to different database sizes and recovery requirements. The right choice depends on your data volume, acceptable backup window, and RTO/RPO targets.
mysqldump — The Universal Logical Backup
mysqldump produces SQL statements that recreate the schema and data. It works on every MySQL version and storage engine, making it the universal fallback. However, it locks tables during the dump (unless using --single-transaction with InnoDB), and restoration speed degrades significantly for databases beyond 50–100 GB because it replays individual INSERT statements.
# Full logical backup with consistent snapshot for InnoDB
mysqldump --single-transaction --routines --triggers --events \
--set-gtid-purged=ON --all-databases \
| gzip > /backups/mysql-full-$(date +%Y%m%d-%H%M%S).sql.gz
# Single database backup with compression
mysqldump --single-transaction --routines --triggers \
--databases production_db \
| pigz -p4 > /backups/production_db-$(date +%Y%m%d).sql.gz
# Schema-only backup for migration planning
mysqldump --no-data --routines --triggers --events \
--all-databases > /backups/schema-only-$(date +%Y%m%d).sql
mysqlpump — Parallel Logical Backup
mysqlpump improves on mysqldump with parallel table dumping and built-in compression. It can significantly reduce backup time for databases with many independent tables.
# Parallel logical backup with 4 threads and zstd compression
mysqlpump --default-parallelism=4 --compress-output=ZSTD \
--include-databases=production_db,analytics_db \
--set-gtid-purged=ON \
> /backups/mysql-pump-$(date +%Y%m%d).sql.zst
Percona XtraBackup — Physical Backups for InnoDB
For databases larger than 50 GB, logical backups become impractical — both the backup window and the restore time grow linearly with data size. Percona XtraBackup takes physical-level copies of InnoDB data files without locking the database, making it suitable for multi-terabyte deployments.
# Full physical backup with streaming to compressed archive
xtrabackup --backup --target-dir=/backups/full-$(date +%Y%m%d) \
--user=backup_user --password=secure_pass \
--parallel=4 --compress --compress-threads=4
# Incremental backup based on last full
xtrabackup --backup --target-dir=/backups/incr-$(date +%Y%m%d) \
--incremental-basedir=/backups/full-20260412 \
--user=backup_user --password=secure_pass \
--parallel=4
# Stream full backup directly to S3 via xbstream
xtrabackup --backup --stream=xbstream --compress \
--user=backup_user --password=secure_pass | \
aws s3 cp - s3://db-backups/mysql/full-$(date +%Y%m%d).xbstream
# Prepare for restore (apply redo log)
xtrabackup --prepare --target-dir=/backups/full-20260412
# Prepare incremental on top of full
xtrabackup --prepare --apply-log-only --target-dir=/backups/full-20260412
xtrabackup --prepare --target-dir=/backups/full-20260412 \
--incremental-dir=/backups/incr-20260412
# Restore
systemctl stop mysqld
xtrabackup --copy-back --target-dir=/backups/full-20260412
chown -R mysql:mysql /var/lib/mysql
systemctl start mysqld
MySQL Binary Log — Point-in-Time Recovery
MySQL binary logs record every data-modifying statement or row change. When combined with a full or physical backup, they enable point-in-time recovery to any moment after the backup was taken.
# Enable binary logging in my.cnf
[mysqld]
server-id = 1
log-bin = /var/log/mysql/mysql-bin
binlog_format = ROW
binlog_expire_logs_seconds = 604800 # 7 days
sync_binlog = 1
gtid_mode = ON
enforce_gtid_consistency = ON
# Flush and archive binary logs
mysqladmin flush-logs
mysqlbinlog --read-from-remote-server --host=db-primary \
--raw --stop-never --result-file=/backup/binlogs/ \
mysql-bin.000042
# Point-in-time recovery: replay binlog up to specific timestamp
mysqlbinlog --stop-datetime="2026-04-12 14:30:00" \
/backup/binlogs/mysql-bin.000042 \
/backup/binlogs/mysql-bin.000043 | mysql -u root
PostgreSQL Backup Methods
PostgreSQL's backup ecosystem is arguably richer than MySQL's, with first-party tools and a mature set of community solutions purpose-built for large-scale production environments.
pg_dump and pg_dumpall — Logical Backups
Like mysqldump, pg_dump produces logical backups. The custom format (-Fc) is the recommended default because it supports parallel restore, selective table restoration, and built-in compression.
# Custom format with parallel dump (4 worker jobs)
pg_dump -Fc -j4 -f /backups/production-$(date +%Y%m%d).dump production_db
# Directory format for maximum parallelism on large databases
pg_dump -Fd -j8 -f /backups/production-$(date +%Y%m%d)/ production_db
# All databases including globals (roles, tablespaces)
pg_dumpall > /backups/pg-all-$(date +%Y%m%d).sql
# Parallel restore from custom format
pg_restore -j4 -d production_db_restored /backups/production-20260412.dump
# Selective restore: single table
pg_restore -j4 -d production_db -t orders /backups/production-20260412.dump
pg_basebackup — Physical Backup Foundation
pg_basebackup takes a physical copy of the entire PostgreSQL data directory. It is the foundation for both standalone recovery and streaming replication setup. Combined with WAL archiving, it enables point-in-time recovery.
# Physical backup with WAL files included
pg_basebackup -D /backups/base-$(date +%Y%m%d) \
-Ft -z -Xs -P -c fast \
-U replication_user -h db-primary
# Stream backup directly to a tar archive with checksums
pg_basebackup -D - -Ft -Xs -c fast \
-U replication_user -h db-primary | \
gzip > /backups/pg-base-$(date +%Y%m%d).tar.gz
pgBackRest — Enterprise-Grade Backup Management
pgBackRest is the gold standard for PostgreSQL backup management. It supports full, incremental, and differential backups, parallel backup and restore, encryption, multi-repository targets (local disk, S3, GCS, Azure Blob), and automated WAL archiving — all through a single, cohesive configuration.
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=pg-backups-production
repo1-s3-endpoint=s3.eu-west-1.amazonaws.com
repo1-s3-region=eu-west-1
repo1-s3-key=AKIAIOSFODNN7EXAMPLE
repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=a_very_secure_encryption_passphrase
# Second repository for geographic redundancy
repo2-type=azure
repo2-azure-container=pg-backups-dr
repo2-azure-account=prodbackupstorage
repo2-azure-key=base64encodedkeyhere
repo2-path=/pgbackrest
repo2-retention-full=2
process-max=4
compress-type=zst
compress-level=6
log-level-console=info
log-level-file=detail
start-fast=y
[production]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432
# PostgreSQL WAL archive configuration (postgresql.conf)
# archive_mode = on
# archive_command = 'pgbackrest --stanza=production archive-push %p'
# Create the stanza
pgbackrest --stanza=production stanza-create
# Verify the stanza configuration
pgbackrest --stanza=production check
# Full backup
pgbackrest --stanza=production --type=full backup
# Differential backup
pgbackrest --stanza=production --type=diff backup
# Incremental backup
pgbackrest --stanza=production --type=incr backup
# List backups
pgbackrest --stanza=production info
WAL-G — Lightweight WAL Archiving to Cloud Storage
WAL-G is a simpler alternative to pgBackRest that focuses on streaming WAL segments and base backups to cloud object storage. It is popular in containerised environments where you want a single binary with minimal configuration.
# Environment variables for WAL-G with S3
export WALG_S3_PREFIX=s3://pg-wal-archive/production
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
export AWS_REGION=eu-west-1
export WALG_COMPRESSION_METHOD=lz4
export PGHOST=/var/run/postgresql
# Configure WAL archiving in postgresql.conf
# archive_mode = on
# archive_command = 'wal-g wal-push %p'
# restore_command = 'wal-g wal-fetch %f %p'
# Take a base backup
wal-g backup-push /var/lib/postgresql/16/main
# List backups
wal-g backup-list
# Restore from latest backup
wal-g backup-fetch /var/lib/postgresql/16/main LATEST
# Delete old backups (retain last 4)
wal-g delete retain FULL 4 --confirm
Barman — Centralised Backup Server
Barman (Backup and Recovery Manager) is designed for environments where a dedicated backup server manages backups for multiple PostgreSQL instances. It supports both rsync/SSH and streaming replication protocols for backup transport.
# /etc/barman.d/production.conf
[production]
description = "Production PostgreSQL 16"
ssh_command = ssh postgres@db-primary
conninfo = host=db-primary user=barman dbname=postgres
streaming_conninfo = host=db-primary user=streaming_barman
backup_method = postgres
streaming_archiver = on
slot_name = barman
retention_policy = RECOVERY WINDOW OF 14 DAYS
# Create replication slot and start streaming
barman receive-wal --create-slot production
barman switch-wal --force --archive production
# Take a backup
barman backup production
# List backups
barman list-backup production
# Restore to point in time
barman recover --target-time "2026-04-12 14:30:00" \
production 20260412T120000 /var/lib/postgresql/16/main
Point-in-Time Recovery (PITR)
Point-in-time recovery is the most important capability in your backup toolkit. It allows you to restore a database to the exact state it was in at any specific moment — not just when backups were taken, but any second between backups. This is critical for recovering from accidental data deletion, application bugs that corrupt data, and security incidents where you need to identify the exact moment of compromise.
PITR for PostgreSQL
PostgreSQL PITR works by restoring a base backup and replaying WAL segments up to the target timestamp. With pgBackRest, the entire process is a single command.
# Restore to specific point in time with pgBackRest
pgbackrest --stanza=production \
--type=time --target="2026-04-12 16:42:30" \
--target-action=promote \
restore
# Manual PITR using pg_basebackup + WAL archive
# 1. Stop PostgreSQL
systemctl stop postgresql
# 2. Clear the data directory and restore base backup
rm -rf /var/lib/postgresql/16/main/*
tar xzf /backups/pg-base-20260412.tar.gz -C /var/lib/postgresql/16/main/
# 3. Create recovery signal and configure restore
cat > /var/lib/postgresql/16/main/postgresql.auto.conf << 'CONF'
restore_command = 'cp /backup/wal_archive/%f %p'
recovery_target_time = '2026-04-12 16:42:30'
recovery_target_action = 'promote'
CONF
touch /var/lib/postgresql/16/main/recovery.signal
# 4. Start PostgreSQL — it will replay WAL and stop at target
systemctl start postgresql
PITR for MySQL
MySQL PITR combines an XtraBackup physical restore with binary log replay.
# 1. Restore the XtraBackup
systemctl stop mysqld
rm -rf /var/lib/mysql/*
xtrabackup --prepare --target-dir=/backups/full-20260412
xtrabackup --copy-back --target-dir=/backups/full-20260412
chown -R mysql:mysql /var/lib/mysql
systemctl start mysqld
# 2. Identify the binlog position from XtraBackup metadata
cat /backups/full-20260412/xtrabackup_binlog_info
# Output: mysql-bin.000042 154 3E11FA47-1111-1111-1111-AAAAAAAAAAAA:1-1234
# 3. Replay binlogs up to target time
mysqlbinlog --start-position=154 \
--stop-datetime="2026-04-12 16:42:30" \
/backup/binlogs/mysql-bin.000042 \
/backup/binlogs/mysql-bin.000043 | mysql -u root
Multi-Cloud Backup Architecture
Production backup strategies must account for the failure of any single cloud provider or region. Storing backups exclusively in the same cloud account and region as your production database means a single account compromise, billing issue, or regional outage can take out both your data and your backups simultaneously. A robust architecture sends backups to at least two independent storage targets with cross-region replication.
Cloud Storage Configuration with Lifecycle Policies
# AWS S3 lifecycle policy (Terraform)
resource "aws_s3_bucket_lifecycle_configuration" "backup_lifecycle" {
bucket = aws_s3_bucket.db_backups.id
rule {
id = "backup-tiering"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
expiration {
days = 2555 # 7 years for compliance
}
}
rule {
id = "wal-segments"
status = "Enabled"
filter {
prefix = "wal-archive/"
}
expiration {
days = 30
}
}
}
# Enable cross-region replication
resource "aws_s3_bucket_replication_configuration" "backup_replication" {
bucket = aws_s3_bucket.db_backups.id
role = aws_iam_role.replication.arn
rule {
id = "cross-region-dr"
status = "Enabled"
destination {
bucket = aws_s3_bucket.db_backups_dr.arn
storage_class = "STANDARD_IA"
encryption_configuration {
replica_kms_key_id = aws_kms_key.dr_key.arn
}
}
source_selection_criteria {
sse_kms_encrypted_objects {
status = "Enabled"
}
}
}
}
# Azure Blob immutability policy (Azure CLI)
az storage container immutability-policy create \
--account-name prodbackupstorage \
--container-name pg-backups \
--period 365 \
--allow-protected-append-writes true
# GCS lifecycle with nearline transition
gsutil lifecycle set /dev/stdin gs://pg-backups-production << 'JSON'
{
"rule": [
{"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"}, "condition": {"age": 30}},
{"action": {"type": "SetStorageClass", "storageClass": "COLDLINE"}, "condition": {"age": 90}},
{"action": {"type": "Delete"}, "condition": {"age": 2555}}
]
}
JSON
Backup Encryption and Security
Backups are a high-value target for attackers. A stolen unencrypted backup gives an attacker a complete copy of your production data without needing to breach your running systems. Every backup, in transit and at rest, must be encrypted.
Encryption in transit means using TLS for all connections between the database server and the backup destination. Both pgBackRest and XtraBackup stream over encrypted channels when configured with TLS. When shipping to S3, Azure Blob, or GCS, the client SDKs use HTTPS by default.
Encryption at rest has two layers. Server-side encryption (SSE) encrypts data after it arrives at the storage target — S3 SSE-KMS, Azure Storage Service Encryption, or GCS default encryption. Client-side encryption encrypts data before it leaves the database server, ensuring the cloud provider never sees plaintext data. pgBackRest and WAL-G both support client-side encryption natively.
# pgBackRest client-side encryption config
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=long_random_passphrase_stored_in_vault
# WAL-G client-side encryption
export WALG_LIBSODIUM_KEY=$(cat /etc/wal-g/encryption.key)
# Or GPG-based
export WALG_GPG_KEY_ID=backup@example.com
# MySQL XtraBackup with encryption
xtrabackup --backup --encrypt=AES256 \
--encrypt-key-file=/etc/mysql/backup-encryption.key \
--target-dir=/backups/full-encrypted-$(date +%Y%m%d)
Store encryption keys separately from the backups themselves. A secrets manager like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault is the recommended approach. If you store the key alongside the backup, an attacker who gains access to your backup storage has everything they need.
Automated Backup Scheduling
Manual backups are not backups — they are aspirations. Backup schedules must be automated, monitored, and alerting on failure.
System Cron for Bare Metal and VMs
# /etc/cron.d/database-backups
# PostgreSQL — pgBackRest
# Full backup every Sunday at 01:00
0 1 * * 0 postgres pgbackrest --stanza=production --type=full backup 2>&1 | logger -t pgbackrest-full
# Differential backup every day at 01:00 (except Sunday)
0 1 * * 1-6 postgres pgbackrest --stanza=production --type=diff backup 2>&1 | logger -t pgbackrest-diff
# MySQL — XtraBackup
# Full backup every Sunday at 02:00
0 2 * * 0 root /usr/local/bin/mysql-backup.sh full 2>&1 | logger -t xtrabackup-full
# Incremental backup every day at 02:00 (except Sunday)
0 2 * * 1-6 root /usr/local/bin/mysql-backup.sh incremental 2>&1 | logger -t xtrabackup-incr
# Backup verification — restore test every Wednesday at 04:00
0 4 * * 3 root /usr/local/bin/verify-backup.sh 2>&1 | logger -t backup-verify
Kubernetes CronJobs
In Kubernetes environments, CronJobs replace system cron. They offer built-in retry logic, concurrency control, and integration with the cluster's RBAC and secret management.
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup-full
namespace: database
spec:
schedule: "0 1 * * 0" # Sunday 01:00 UTC
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 4
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 7200 # 2 hour timeout
template:
spec:
serviceAccountName: db-backup
restartPolicy: OnFailure
containers:
- name: pgbackrest
image: pgbackrest/pgbackrest:2.50
command:
- /bin/bash
- -c
- |
pgbackrest --stanza=production --type=full backup
RESULT=$?
if [ $RESULT -eq 0 ]; then
curl -s -X POST "$SLACK_WEBHOOK" \
-d '{"text":"PostgreSQL full backup completed successfully"}'
else
curl -s -X POST "$SLACK_WEBHOOK" \
-d '{"text":"ALERT: PostgreSQL full backup FAILED"}'
fi
exit $RESULT
envFrom:
- secretRef:
name: pgbackrest-credentials
- secretRef:
name: slack-webhook
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 2Gi
cpu: "2"
volumeMounts:
- name: pgbackrest-config
mountPath: /etc/pgbackrest
volumes:
- name: pgbackrest-config
configMap:
name: pgbackrest-conf
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup-diff
namespace: database
spec:
schedule: "0 1 * * 1-6" # Mon-Sat 01:00 UTC
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 3600
template:
spec:
serviceAccountName: db-backup
restartPolicy: OnFailure
containers:
- name: pgbackrest
image: pgbackrest/pgbackrest:2.50
command:
- /bin/bash
- -c
- |
pgbackrest --stanza=production --type=diff backup
envFrom:
- secretRef:
name: pgbackrest-credentials
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 1Gi
cpu: "1"
volumeMounts:
- name: pgbackrest-config
mountPath: /etc/pgbackrest
volumes:
- name: pgbackrest-config
configMap:
name: pgbackrest-conf
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: mysql-backup-full
namespace: database
spec:
schedule: "0 2 * * 0"
concurrencyPolicy: Forbid
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 7200
template:
spec:
serviceAccountName: db-backup
restartPolicy: OnFailure
containers:
- name: xtrabackup
image: percona/percona-xtrabackup:8.0
command:
- /bin/bash
- -c
- |
xtrabackup --backup --stream=xbstream --compress \
--user=$MYSQL_BACKUP_USER \
--password=$MYSQL_BACKUP_PASS \
--host=mysql-primary.database.svc | \
aws s3 cp - s3://$S3_BUCKET/mysql/full-$(date +%Y%m%d).xbstream
envFrom:
- secretRef:
name: mysql-backup-credentials
- secretRef:
name: aws-credentials
resources:
requests:
memory: 512Mi
cpu: 500m
limits:
memory: 2Gi
cpu: "2"
Kubernetes-Native Backup Approaches
Running databases in Kubernetes introduces a new dimension to backup strategy. In addition to application-level database backups, you need to consider cluster-level backup (etcd), persistent volume snapshots, and operator-managed backups.
Velero for Cluster-Level Backup
Velero backs up Kubernetes resources (deployments, services, configmaps, secrets) and persistent volumes. It is not a replacement for application-level database backups — it captures a crash-consistent snapshot of PVs, which may not be transactionally consistent for databases. Use Velero for cluster recovery and application-level tools (pgBackRest, XtraBackup) for database recovery.
# Install Velero with S3 backend
velero install \
--provider aws \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-location-config region=eu-west-1 \
--snapshot-location-config region=eu-west-1 \
--use-volume-snapshots=true \
--plugins velero/velero-plugin-for-aws:v1.9.0
# Create a scheduled backup of the database namespace
velero schedule create db-namespace-backup \
--schedule="0 3 * * *" \
--include-namespaces database \
--ttl 720h \
--storage-location default \
--volume-snapshot-locations default
# On-demand backup before maintenance
velero backup create pre-maintenance-$(date +%Y%m%d) \
--include-namespaces database,monitoring \
--wait
# Restore a namespace from backup
velero restore create --from-backup pre-maintenance-20260412 \
--include-namespaces database
Longhorn Snapshots on k3s
k3s deployments commonly use Longhorn as their CSI storage provider. Longhorn provides volume-level snapshots and the ability to replicate snapshots to S3-compatible object storage. For databases, combine Longhorn snapshots with application-level backups for defense in depth.
# Longhorn recurring snapshot job via CRD
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: pg-data-snapshot
namespace: longhorn-system
spec:
cron: "0 */4 * * *" # every 4 hours
task: snapshot
retain: 6
concurrency: 1
groups:
- pg-data
labels:
app: postgresql
---
# Longhorn recurring backup to S3
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: pg-data-s3-backup
namespace: longhorn-system
spec:
cron: "0 2 * * *" # daily at 02:00
task: backup
retain: 14
concurrency: 1
groups:
- pg-data
labels:
app: postgresql
---
# VolumeSnapshot using Longhorn CSI
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: pg-data-snap-$(date +%Y%m%d)
namespace: database
spec:
volumeSnapshotClassName: longhorn-snapshot-vsc
source:
persistentVolumeClaimName: pg-data-postgresql-0
Operator-Managed Backups
Database operators like CloudNativePG (for PostgreSQL) and Percona Operator (for MySQL) integrate backup management directly into the database lifecycle. The operator handles scheduling, WAL archiving, and retention automatically through custom resources.
# CloudNativePG cluster with integrated backup
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: production-pg
namespace: database
spec:
instances: 3
storage:
size: 100Gi
storageClass: longhorn
backup:
barmanObjectStore:
destinationPath: s3://pg-backups/cnpg/
endpointURL: https://s3.eu-west-1.amazonaws.com
s3Credentials:
accessKeyId:
name: aws-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: aws-creds
key: SECRET_ACCESS_KEY
wal:
compression: gzip
maxParallel: 4
data:
compression: gzip
retentionPolicy: "30d"
---
apiVersion: postgresql.cnpg.io/v1
kind: ScheduledBackup
metadata:
name: production-pg-weekly
namespace: database
spec:
schedule: "0 1 * * 0"
backupOwnerReference: self
cluster:
name: production-pg
Cloud Provider-Specific Strategies
AWS — RDS and Aurora
AWS RDS provides automated daily snapshots with configurable retention (up to 35 days) and continuous backup via transaction log archiving for point-in-time recovery. Aurora adds Backtrack, which can rewind the cluster to any point within the backtrack window without requiring a restore — it simply reverses the database's internal state.
# Enable automated backups with maximum retention (Terraform)
resource "aws_db_instance" "production" {
identifier = "production-pg"
engine = "postgres"
engine_version = "16.2"
instance_class = "db.r6g.xlarge"
allocated_storage = 500
backup_retention_period = 35 # maximum
backup_window = "03:00-04:00"
copy_tags_to_snapshot = true
deletion_protection = true
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
# Enable PITR
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
}
# Cross-region automated backup replication
resource "aws_db_instance_automated_backups_replication" "dr" {
source_db_instance_arn = aws_db_instance.production.arn
kms_key_id = aws_kms_key.dr_rds.arn
retention_period = 14
}
# Aurora Backtrack (MySQL-compatible Aurora only)
resource "aws_rds_cluster" "aurora_production" {
cluster_identifier = "aurora-production"
engine = "aurora-mysql"
engine_version = "8.0.mysql_aurora.3.05.2"
backtrack_window = 86400 # 24 hours of backtrack
backup_retention_period = 35
preferred_backup_window = "03:00-04:00"
storage_encrypted = true
}
# Manual snapshot with cross-region copy
aws rds create-db-snapshot \
--db-instance-identifier production-pg \
--db-snapshot-identifier pre-migration-$(date +%Y%m%d)
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:eu-west-1:123456789:snapshot:pre-migration-20260412 \
--target-db-snapshot-identifier pre-migration-20260412-dr \
--source-region eu-west-1 \
--region us-east-1 \
--kms-key-id arn:aws:kms:us-east-1:123456789:key/dr-key-id
Azure — Flexible Server
Azure Database for PostgreSQL and MySQL Flexible Server includes automatic backups with locally redundant or geo-redundant storage. Geo-redundant backup allows cross-region restore for disaster recovery.
# Azure Flexible Server with geo-redundant backup (Terraform)
resource "azurerm_postgresql_flexible_server" "production" {
name = "production-pg"
location = "westeurope"
resource_group_name = azurerm_resource_group.db.name
sku_name = "GP_Standard_D4s_v3"
version = "16"
storage_mb = 524288 # 512 GB
backup_retention_days = 35
geo_redundant_backup_enabled = true
authentication {
active_directory_auth_enabled = true
password_auth_enabled = false
}
}
# Azure Blob immutability for self-managed backups
resource "azurerm_storage_management_policy" "backup_lifecycle" {
storage_account_id = azurerm_storage_account.backups.id
rule {
name = "backup-tiering"
enabled = true
filters {
blob_types = ["blockBlob"]
prefix_match = ["pg-backups/"]
}
actions {
base_blob {
tier_to_cool_after_days_since_modification_greater_than = 30
tier_to_archive_after_days_since_modification_greater_than = 90
delete_after_days_since_modification_greater_than = 2555
}
}
}
}
GCP — Cloud SQL
Cloud SQL provides automated backups and point-in-time recovery out of the box. For self-managed databases on GCE or GKE, GCS with nearline and coldline storage classes provides cost-effective long-term backup storage.
# Cloud SQL with automated backups and PITR (Terraform)
resource "google_sql_database_instance" "production" {
name = "production-pg"
database_version = "POSTGRES_16"
region = "europe-west1"
settings {
tier = "db-custom-4-16384"
backup_configuration {
enabled = true
start_time = "03:00"
point_in_time_recovery_enabled = true
transaction_log_retention_days = 7
backup_retention_settings {
retained_backups = 30
retention_unit = "COUNT"
}
}
ip_configuration {
ssl_mode = "ENCRYPTED_ONLY"
}
}
}
# Export to GCS for long-term retention
gcloud sql export sql production-pg \
gs://pg-backups-longterm/export-$(date +%Y%m%d).sql.gz \
--database=production_db \
--offload
Backup Verification and Restore Testing
A backup that has never been tested is not a backup — it is a hope. Automated restore testing should run on a regular schedule, ideally weekly, in an isolated environment. The test must validate not just that the restore completes without errors, but that the restored data is consistent and the application can connect and query it.
#!/bin/bash
# verify-backup.sh — Automated backup verification script
set -euo pipefail
RESTORE_DIR="/tmp/backup-verify-$(date +%Y%m%d-%H%M%S)"
LOG_FILE="/var/log/backup-verify.log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
DB_TYPE="${1:-postgresql}" # postgresql or mysql
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG_FILE"; }
alert() {
log "ALERT: $*"
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"BACKUP VERIFY FAILED: $*\"}"
}
cleanup() {
log "Cleaning up $RESTORE_DIR"
rm -rf "$RESTORE_DIR"
if [ "$DB_TYPE" = "postgresql" ]; then
pg_ctlcluster 16 verify stop 2>/dev/null || true
else
mysqladmin -S /tmp/mysql-verify.sock shutdown 2>/dev/null || true
fi
}
trap cleanup EXIT
mkdir -p "$RESTORE_DIR"
if [ "$DB_TYPE" = "postgresql" ]; then
log "Starting PostgreSQL backup verification"
# Restore latest pgBackRest backup to temporary directory
pgbackrest --stanza=production \
--pg1-path="$RESTORE_DIR/pgdata" \
--type=immediate \
--target-action=promote \
restore 2>&1 | tee -a "$LOG_FILE"
if [ ${PIPESTATUS[0]} -ne 0 ]; then
alert "pgBackRest restore failed"
exit 1
fi
# Start PostgreSQL on a different port
pg_ctlcluster 16 verify start -- \
-D "$RESTORE_DIR/pgdata" \
-o "-p 5433" \
-o "-c listen_addresses=127.0.0.1"
sleep 5
# Verify data integrity
TABLES=$(psql -p 5433 -d production_db -t -c \
"SELECT count(*) FROM information_schema.tables WHERE table_schema='public';")
log "Verified $TABLES tables exist"
ROW_CHECK=$(psql -p 5433 -d production_db -t -c \
"SELECT count(*) FROM orders WHERE created_at > now() - interval '7 days';")
log "Recent orders count: $ROW_CHECK"
if [ "$ROW_CHECK" -lt 1 ]; then
alert "PostgreSQL restore has no recent data — possible stale backup"
exit 1
fi
# Run pg_amcheck for corruption detection (PostgreSQL 14+)
pg_amcheck -p 5433 -d production_db --heapallindexed 2>&1 | tee -a "$LOG_FILE"
log "PostgreSQL backup verification PASSED"
else
log "Starting MySQL backup verification"
# Prepare and restore latest XtraBackup
LATEST_FULL=$(ls -td /backups/full-* | head -1)
cp -r "$LATEST_FULL" "$RESTORE_DIR/mysql-data"
xtrabackup --prepare --target-dir="$RESTORE_DIR/mysql-data"
# Start MySQL on a different socket and port
mysqld --datadir="$RESTORE_DIR/mysql-data" \
--socket=/tmp/mysql-verify.sock \
--port=3307 \
--skip-networking=0 \
--bind-address=127.0.0.1 &
sleep 10
# Verify data
TABLES=$(mysql -S /tmp/mysql-verify.sock -e \
"SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='production_db';" -sN)
log "Verified $TABLES tables exist"
mysqlcheck -S /tmp/mysql-verify.sock --all-databases --check 2>&1 | tee -a "$LOG_FILE"
log "MySQL backup verification PASSED"
fi
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"Backup verification PASSED for $DB_TYPE at $(date)\"}"
log "Verification complete"
Disaster Recovery Planning: RTO and RPO
Every backup strategy must be designed around two metrics: Recovery Time Objective (RTO) — how long you can afford to be down — and Recovery Point Objective (RPO) — how much data you can afford to lose. These numbers drive every decision about backup frequency, method, and architecture.
| Scenario | RTO | RPO | Strategy |
|---|---|---|---|
| E-commerce checkout | < 5 min | 0 (zero data loss) | Synchronous replication + continuous WAL archiving + hot standby failover |
| SaaS application | < 30 min | < 1 min | Streaming replication + WAL-G continuous archiving + automated failover |
| Internal tools | < 4 hours | < 1 hour | Daily differential + hourly incremental + WAL archiving |
| Analytics / data warehouse | < 24 hours | < 24 hours | Daily full backup to cloud storage |
| Dev / staging | < 48 hours | < 1 week | Weekly full backup |
For zero RPO, you need synchronous replication to at least one standby. This adds latency to every write transaction but guarantees no committed transaction is lost. Most production systems accept near-zero RPO (seconds of potential loss) by using asynchronous streaming replication with continuous WAL/binlog archiving, which avoids the write-latency penalty.
Disaster Recovery Runbook Template
# DR Runbook: Database Recovery
## Severity Levels
- P1: Complete data loss / corruption — all hands, CEO notified
- P2: Partial data loss / single region down — on-call team + escalation
- P3: Replica failure / backup failure — on-call investigation
## Recovery Procedures
### Scenario A: Primary DB failure, replicas healthy
1. Promote replica to primary (automated via Patroni / orchestrator)
2. Verify application connectivity
3. Re-establish replication from new primary
4. Investigate root cause
### Scenario B: Complete cluster failure, backups intact
1. Provision new database infrastructure
2. Restore latest full backup
3. Apply WAL/binlog to reach latest consistent point
4. Verify data integrity with checksums
5. Update DNS / connection strings
6. Resume application traffic
7. Re-establish backup schedule immediately
### Scenario C: Data corruption (bad migration / SQL injection)
1. Identify exact timestamp of corruption
2. Restore to point-in-time just before corruption
3. Export affected tables from restored copy
4. Merge clean data into production
5. OR: full PITR restore if corruption is widespread
## Contacts
- DBA on-call: [PagerDuty rotation]
- Infrastructure: [PagerDuty rotation]
- VP Engineering: [phone number]
## Validation Checklist
- [ ] Application health checks pass
- [ ] Row counts match expected ranges
- [ ] Recent transactions are present
- [ ] Replication re-established
- [ ] Backup schedule resumed
- [ ] Post-incident review scheduled
Backup Monitoring and Alerting
Backup systems fail silently. The database keeps running, the application keeps serving traffic, and nobody notices that backups stopped working three weeks ago — until they need one. Proactive monitoring is essential.
# Prometheus alerting rules for backup monitoring
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backup-alerts
namespace: monitoring
spec:
groups:
- name: database-backups
rules:
- alert: BackupTooOld
expr: |
(time() - backup_last_successful_timestamp_seconds) > 90000
for: 10m
labels:
severity: critical
annotations:
summary: "Database backup is older than 25 hours"
description: "Last successful backup for {{ $labels.database }} was {{ $value | humanizeDuration }} ago"
- alert: BackupJobFailed
expr: |
kube_job_status_failed{job_name=~".*backup.*"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup CronJob failed: {{ $labels.job_name }}"
- alert: WALArchivingLagging
expr: |
pg_stat_archiver_failed_count > 0
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL WAL archiving has failures"
- alert: BackupStorageQuotaNearing
expr: |
(backup_storage_used_bytes / backup_storage_quota_bytes) > 0.85
for: 30m
labels:
severity: warning
annotations:
summary: "Backup storage at {{ $value | humanizePercentage }} capacity"
- alert: BinlogSpaceCritical
expr: |
mysql_binlog_size_bytes > 53687091200
for: 10m
labels:
severity: warning
annotations:
summary: "MySQL binlog space exceeds 50GB — check archiving"
# Custom Prometheus exporter for pgBackRest metrics
#!/usr/bin/env python3
"""pgBackRest Prometheus exporter — exposes backup age and size metrics."""
import json
import subprocess
import time
from prometheus_client import start_http_server, Gauge
BACKUP_AGE = Gauge('pgbackrest_last_backup_age_seconds', 'Seconds since last backup', ['stanza', 'type'])
BACKUP_SIZE = Gauge('pgbackrest_last_backup_size_bytes', 'Size of last backup', ['stanza', 'type'])
BACKUP_REPO_SIZE = Gauge('pgbackrest_repo_size_bytes', 'Total repository size', ['stanza'])
def collect():
result = subprocess.run(
['pgbackrest', '--output=json', 'info'],
capture_output=True, text=True
)
info = json.loads(result.stdout)
for stanza_info in info:
stanza = stanza_info['name']
for backup in stanza_info.get('backup', []):
backup_type = backup['type']
stop_time = backup['timestamp']['stop']
age = time.time() - stop_time
BACKUP_AGE.labels(stanza=stanza, type=backup_type).set(age)
size = backup['info']['size']
BACKUP_SIZE.labels(stanza=stanza, type=backup_type).set(size)
if __name__ == '__main__':
start_http_server(9854)
while True:
collect()
time.sleep(300)
Common Mistakes and Anti-Patterns
After managing database backups across dozens of production environments, these are the mistakes that cause the most damage.
1. Backing up to the same disk as the database. If the disk fails, you lose both the database and the backup. Always write backups to a separate storage target — ideally off-host and off-region.
2. Never testing restores. A backup that cannot be restored is not a backup. Schedule automated restore tests weekly and have engineers perform manual restore drills quarterly.
3. Relying solely on replication as a backup. Replication is not backup. A DROP TABLE on the primary is instantly replicated to all replicas. Replication protects against hardware failure, not logical errors.
4. Not monitoring backup jobs. Cron jobs fail silently. Kubernetes CronJobs get suspended. S3 credentials expire. Every backup job must report success or failure to a monitoring system, and alert if the most recent successful backup is older than your RPO.
5. Storing encryption keys alongside backups. If an attacker gains access to your backup storage, they should not also have the decryption key. Store keys in a dedicated secrets manager.
6. No retention policy. Without lifecycle policies, backup storage grows unbounded. Define clear retention windows: 7 days for WAL segments, 30 days for daily backups, 12 months for monthly backups, and automate their deletion.
7. Ignoring backup performance impact. Running a full backup on a primary database during peak hours degrades application performance. Schedule backups during low-traffic windows, or back up from a dedicated replica.
8. Using mysqldump for large databases without --single-transaction. Without this flag, mysqldump locks tables, blocking writes for the duration of the dump. For large databases this can mean minutes or hours of downtime.
9. Forgetting to back up database configuration. Restoring data is only half the battle. If you lose postgresql.conf, pg_hba.conf, my.cnf, replication settings, and user grants, you cannot bring the database back to a usable state. Include configuration files in your backup process.
10. Not documenting the restore procedure. During an outage, the person restoring the database may not be the person who set up the backups. A written, tested runbook is critical.
Cost Optimization for Backup Storage
Backup storage costs can grow rapidly, especially with frequent full backups of large databases. These strategies keep costs under control without compromising recoverability.
Use incremental/differential backups. A weekly full plus daily differentials uses a fraction of the storage compared to daily full backups. pgBackRest's delta restore capability means incremental backups restore almost as fast as full backups.
Enable compression. Modern compression algorithms like zstd offer excellent compression ratios (5:1 to 10:1 for typical database data) with minimal CPU overhead. Both pgBackRest and XtraBackup support zstd natively.
Implement storage tiering. Move backups to progressively cheaper storage tiers as they age. The lifecycle policies shown earlier (S3 Standard → IA → Glacier, GCS Standard → Nearline → Coldline) can reduce long-term storage costs by 70–90%.
Deduplicate where possible. pgBackRest uses block-level deduplication in its repository, storing only changed blocks across backups. This dramatically reduces storage for databases where the majority of data is static.
Right-size retention windows. Many teams default to keeping backups forever out of caution. Analyze your actual recovery patterns and compliance requirements, then set retention that matches. A 7-year retention requirement can use deep archive storage at less than $1/TB/month on most cloud providers.
# Cost comparison: Full vs Incremental backup storage
# Assuming 500 GB database, 5% daily change rate, 30-day retention
# Strategy A: Daily full backups
# 500 GB × 30 days = 15,000 GB = 15 TB
# S3 Standard: 15 TB × $0.023/GB = $345/month
# Strategy B: Weekly full + daily incremental
# 4 full × 500 GB = 2,000 GB
# 26 incremental × 25 GB = 650 GB
# Total: 2,650 GB ≈ 2.6 TB
# S3 Standard: 2.6 TB × $0.023/GB = $59.80/month
# Strategy C: Strategy B + lifecycle tiering
# Current week: 525 GB Standard = $12.08
# Weeks 2-4: 2,125 GB Standard-IA = $26.56
# Total: $38.64/month
# Savings: Strategy C is 89% cheaper than Strategy A
Putting It All Together: A Complete Backup Architecture
Here is the recommended backup architecture for a production environment running both MySQL and PostgreSQL, deployed on Kubernetes with multi-cloud disaster recovery.
For PostgreSQL:
- pgBackRest as the primary backup tool with two repositories (S3 + Azure Blob)
- Continuous WAL archiving to both repositories
- Weekly full backup + daily differential (via K8s CronJob)
- Longhorn volume snapshots every 4 hours for fast rollback
- Automated restore verification every Wednesday
- Prometheus monitoring with alerts on backup age, failure, and storage
For MySQL:
- Percona XtraBackup for physical backups streamed to S3
- Continuous binary log shipping to separate storage
- Weekly full + daily incremental (via K8s CronJob)
- Daily
mysqldumpof critical schemas for portable logical backups - Longhorn volume snapshots every 4 hours
- Automated restore verification every Thursday
For the Kubernetes cluster:
- Velero daily backup of database namespaces with PV snapshots
- RKE2/k3s etcd snapshots every 6 hours to off-cluster storage
- GitOps repository as the source of truth for all manifests
For disaster recovery:
- S3 cross-region replication to DR region
- Azure GRS for geo-redundant backup copies
- Monthly DR drill: full cluster rebuild from backups in DR region
- Documented runbook with decision tree for each failure scenario
Conclusion
Database backups are the last line of defense between your business and catastrophic data loss. The strategies outlined in this article — physical and logical backups, continuous WAL and binlog archiving, multi-cloud storage with lifecycle policies, encryption at every layer, automated scheduling via Kubernetes CronJobs, and systematic restore verification — represent the current state of the art for production MySQL and PostgreSQL environments.
The most important takeaway is this: a backup strategy is only as good as its last successful restore test. Every tool, script, and architecture pattern in this article exists to serve one purpose — ensuring that when the worst happens, you can recover your data, meet your RTO and RPO commitments, and keep your business running. Build your backup system, automate it, monitor it, test it, and then test it again. Your future self will thank you.