MongoDB High Availability in Production: Replica Sets, Sharding, and Kubernetes Operators
MongoDB HA with Replica Sets, Sharding, and Kubernetes Operators
MongoDB occupies a unique position in the database landscape. Its document model maps naturally to application objects, its flexible schema accommodates evolving data structures without migrations, and its built-in replication and sharding primitives provide a foundation for high availability and horizontal scaling that relational databases require external tooling to achieve. But a foundation is not a finished building. Running MongoDB in production with genuine high availability — where a node failure, a network partition, or an entire region going offline does not result in downtime or data loss — demands deliberate architecture, careful tuning, and rigorous operational discipline.
This guide walks through every layer of MongoDB HA: from the replica set consensus protocol and oplog replication that power automatic failover, through the sharded cluster architecture that enables horizontal scaling, to the Kubernetes operators that automate lifecycle management, and across the deployment options on AWS, Azure, GCP, and bare metal k3s with Rancher. Every section includes concrete configuration, YAML manifests, and operational procedures you can adapt to your environment.
MongoDB Replica Set Architecture
A replica set is MongoDB's fundamental unit of high availability. It is a group of mongod processes that maintain the same data set. One member is the primary, which receives all write operations. The remaining members are secondaries, which replicate data from the primary by tailing its operation log (oplog). If the primary becomes unavailable, the replica set holds an election to choose a new primary from the eligible secondaries — typically within 10 to 12 seconds.
A production replica set should have at least three data-bearing members, ideally spread across different failure domains (availability zones, racks, or data centres). This ensures that the replica set can survive the loss of any single member and still maintain a majority for election purposes. An optional arbiter participates in elections but holds no data — it exists solely to break ties when you have an even number of data-bearing members, though MongoDB best practice is to use an odd number of data-bearing members instead.
Oplog and Replication Mechanics
The oplog is a capped collection (local.oplog.rs) that records every data-modifying operation on the primary in idempotent form. Secondaries continuously tail the primary's oplog and apply operations locally. The oplog size determines how far behind a secondary can fall before it needs a full resync — for production workloads, size the oplog to hold at least 24 to 72 hours of write activity. MongoDB 4.4+ supports dynamic oplog sizing via replSetResizeOplog.
# Check current oplog size and window
rs.printReplicationInfo()
# Resize the oplog to 50 GB
db.adminCommand({ replSetResizeOplog: 1, size: 51200 })
# Check replication lag on secondaries
rs.printSecondaryReplicationInfo()
Elections and the Raft-Based Protocol
MongoDB 4.0+ uses a Raft-inspired consensus protocol for replica set elections. When a secondary detects that the primary is unreachable (default electionTimeoutMillis of 10,000ms), it may call an election. To win, a candidate must receive votes from a majority of the voting members. The member with the most recent oplog entry and the highest priority wins if multiple candidates are eligible. You can influence election outcomes by setting member priorities — a member with priority: 0 can never become primary, which is useful for analytics replicas or members in remote regions.
# Initiate a 3-member replica set
rs.initiate({
_id: "rs-production",
members: [
{ _id: 0, host: "mongo-0.mongo-svc:27017", priority: 10 },
{ _id: 1, host: "mongo-1.mongo-svc:27017", priority: 5 },
{ _id: 2, host: "mongo-2.mongo-svc:27017", priority: 5 }
],
settings: {
electionTimeoutMillis: 10000,
heartbeatTimeoutSecs: 10,
chainingAllowed: true
}
})
# Check replica set status
rs.status()
# Step down the primary (for maintenance)
rs.stepDown(60) // step down for 60 seconds
# Force reconfiguration (emergency)
rs.reconfig(newConfig, { force: true })
Read Preference and Write Concern
Read Preference controls where the driver sends read operations. The options are:
primary— All reads go to the primary. Strongest consistency but no read scaling.primaryPreferred— Reads go to the primary unless it is unavailable, then to a secondary.secondary— All reads go to secondaries. Provides read scaling but may return stale data.secondaryPreferred— Reads go to secondaries unless none are available.nearest— Reads go to the member with the lowest network latency regardless of role. Best for geo-distributed deployments.
Write Concern controls how many replica set members must acknowledge a write before the operation returns to the client.
w: 1— Only the primary must acknowledge. Fastest but risks data loss if the primary fails before replication.w: "majority"— A majority of data-bearing members must acknowledge. This is the recommended production default. It guarantees the write survives a primary election.w: <number>— A specific number of members must acknowledge.j: true— The write must be committed to the on-disk journal before acknowledgement. Combined withw: "majority", this provides the strongest durability guarantee.
# Connection string with write concern and read preference
mongodb://mongo-0:27017,mongo-1:27017,mongo-2:27017/appdb?replicaSet=rs-production&w=majority&j=true&readPreference=secondaryPreferred&readPreferenceTags=region:us-east
# SRV connection string (DNS-based discovery)
mongodb+srv://appuser:password@cluster.example.com/appdb?w=majority&retryWrites=true&readPreference=nearest
MongoDB Sharded Cluster Architecture
A replica set provides high availability but not horizontal write scaling — all writes go to a single primary. When your data set exceeds the capacity of a single server or your write throughput exceeds what one primary can handle, you need sharding. A sharded cluster distributes data across multiple replica sets (shards) using a shard key, enabling horizontal scaling of both storage and write throughput.
A sharded cluster has three component types. mongos routers are stateless query routers that direct client operations to the appropriate shard(s). Deploy at least two for redundancy. Config servers form a replica set that stores the cluster metadata — which chunks live on which shards, the shard key ranges, and balancer state. Shard servers are replica sets that each hold a subset of the sharded data.
Shard Key Selection
The shard key is the most consequential decision in a sharded cluster. It determines how data is distributed across shards and directly impacts query performance, write distribution, and the ability to scale. A good shard key has high cardinality (many distinct values), distributes writes evenly across shards, and supports the most common query patterns with targeted operations rather than scatter-gather.
# Enable sharding on a database
sh.enableSharding("appdb")
# Shard a collection with a hashed shard key (even distribution)
sh.shardCollection("appdb.events", { "event_id": "hashed" })
# Shard with a ranged shard key (supports range queries)
sh.shardCollection("appdb.orders", { "customer_id": 1, "order_date": 1 })
# Check shard distribution
db.orders.getShardDistribution()
# View chunk distribution across shards
use config
db.chunks.aggregate([
{ $group: { _id: "$shard", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
])
Common shard key strategies include: hashed keys for uniform write distribution (best when you don't need range queries on the shard key), compound keys that combine a coarse grouping field with a high-cardinality field (e.g., { tenant_id: 1, _id: 1 } for multi-tenant applications), and zone-based keys that align data placement with geographic regions.
Zone Sharding for Multi-Region
Zone sharding constrains specific ranges of the shard key to specific shards, enabling data locality. For example, you can ensure that European customer data lives on shards in the EU region while US customer data lives on shards in the US region.
# Add shards to zones
sh.addShardTag("shard-us-east", "US")
sh.addShardTag("shard-eu-west", "EU")
sh.addShardTag("shard-ap-south", "APAC")
# Define zone ranges
sh.addTagRange("appdb.customers",
{ "region": "US", "customer_id": MinKey },
{ "region": "US", "customer_id": MaxKey },
"US"
)
sh.addTagRange("appdb.customers",
{ "region": "EU", "customer_id": MinKey },
{ "region": "EU", "customer_id": MaxKey },
"EU"
)
sh.addTagRange("appdb.customers",
{ "region": "APAC", "customer_id": MinKey },
{ "region": "APAC", "customer_id": MaxKey },
"APAC"
)
# Verify zone configuration
sh.status()
Multi-Region MongoDB Deployment
Distributing MongoDB across multiple regions serves two purposes: disaster recovery (surviving the loss of an entire region) and latency optimization (serving reads from the nearest replica). MongoDB supports multi-region deployments through replica set members distributed across regions, zone sharding for data locality, and read preference configuration that routes reads to the nearest member.
In a five-member replica set distributed across three regions (2 in primary region, 2 in DR region, 1 in read region), the loss of the primary region still leaves three members available — enough for a majority to elect a new primary. The member in the read region should have priority: 0 to prevent it from becoming primary (high cross-region latency would degrade write performance). Use hidden: true for analytics-dedicated members that should not receive regular application reads.
MongoDB Community Kubernetes Operator
The MongoDB Community Kubernetes Operator deploys and manages MongoDB replica sets on Kubernetes. It is the open-source operator from MongoDB Inc. that handles StatefulSet management, automated replica set configuration, TLS certificate rotation, user management, and rolling upgrades.
# Install the MongoDB Community Operator via Helm
helm repo add mongodb https://mongodb.github.io/helm-charts
helm repo update
helm install community-operator mongodb/community-operator \
--namespace mongodb \
--create-namespace \
--set operator.watchNamespace="*"
MongoDBCommunity CRD Specification
The MongoDBCommunity custom resource defines the desired state of a MongoDB replica set. Below is a production-ready specification.
apiVersion: mongodbcommunity.mongodb.com/v1
kind: MongoDBCommunity
metadata:
name: production-mongodb
namespace: databases
spec:
members: 3
type: ReplicaSet
version: "7.0.12"
security:
authentication:
modes: ["SCRAM"]
tls:
enabled: true
certificateKeySecretRef:
name: mongodb-tls-cert
caCertificateSecretRef:
name: mongodb-ca-cert
users:
- name: appuser
db: admin
passwordSecretRef:
name: mongodb-appuser-password
roles:
- name: readWrite
db: appdb
- name: clusterMonitor
db: admin
scramCredentialsSecretName: appuser-scram
- name: backup-user
db: admin
passwordSecretRef:
name: mongodb-backup-password
roles:
- name: backup
db: admin
- name: restore
db: admin
scramCredentialsSecretName: backup-scram
- name: monitoring
db: admin
passwordSecretRef:
name: mongodb-monitoring-password
roles:
- name: clusterMonitor
db: admin
scramCredentialsSecretName: monitoring-scram
additionalMongodConfig:
storage.wiredTiger.engineConfig.cacheSizeGB: 4
storage.wiredTiger.engineConfig.journalCompressor: snappy
storage.wiredTiger.collectionConfig.blockCompressor: snappy
net.maxIncomingConnections: 10000
operationProfiling.mode: slowOp
operationProfiling.slowOpThresholdMs: 100
replication.oplogSizeMB: 51200
setParameter.cursorTimeoutMillis: 600000
statefulSet:
spec:
template:
spec:
containers:
- name: mongod
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
- name: mongodb-agent
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: topology.kubernetes.io/zone
labelSelector:
matchLabels:
app: production-mongodb-svc
tolerations:
- key: "workload"
operator: "Equal"
value: "database"
effect: "NoSchedule"
volumeClaimTemplates:
- metadata:
name: data-volume
spec:
storageClassName: gp3-csi
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
- metadata:
name: logs-volume
spec:
storageClassName: gp3-csi
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
This specification creates a three-member replica set running MongoDB 7.0 with SCRAM authentication, TLS encryption, separate data and log volumes, pod anti-affinity across availability zones, and WiredTiger tuning appropriate for a node with 16 GB RAM. The operator handles replica set initialization, member configuration, and rolling upgrades when you change the version field.
Percona Server for MongoDB Operator
The Percona Operator for MongoDB (PSMDB Operator) provides a more feature-rich alternative to the Community Operator. It deploys Percona Server for MongoDB (a drop-in replacement for MongoDB with additional enterprise features), manages sharded clusters as well as replica sets, integrates backup via Percona Backup for MongoDB (PBM), and supports point-in-time recovery.
# Install Percona Operator
helm repo add percona https://percona.github.io/percona-helm-charts/
helm repo update
helm install psmdb-operator percona/psmdb-operator \
--namespace psmdb \
--create-namespace
# Percona Server for MongoDB Cluster CRD
apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDB
metadata:
name: production-psmdb
namespace: databases
spec:
crVersion: "1.16.0"
image: percona/percona-server-mongodb:7.0.12-7
imagePullPolicy: IfNotPresent
replsets:
- name: rs0
size: 3
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
volumeSpec:
persistentVolumeClaim:
storageClassName: gp3-csi
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
nonvoting:
enabled: false
arbiter:
enabled: false
configuration: |
storage:
wiredTiger:
engineConfig:
cacheSizeGB: 4
journalCompressor: snappy
collectionConfig:
blockCompressor: snappy
operationProfiling:
mode: slowOp
slowOpThresholdMs: 100
replication:
oplogSizeMB: 51200
affinity:
antiAffinityTopologyKey: topology.kubernetes.io/zone
sharding:
enabled: false
mongos: {}
configsrv: {}
backup:
enabled: true
image: percona/percona-backup-mongodb:2.5.0
storages:
s3-backup:
type: s3
s3:
bucket: company-mongodb-backups
region: us-east-1
credentialsSecret: aws-s3-credentials
prefix: production
insecureSkipTLSVerify: false
pitr:
enabled: true
oplogOnly: false
compressionType: gzip
tasks:
- name: daily-full
enabled: true
schedule: "0 3 * * *"
keep: 7
storageName: s3-backup
compressionType: gzip
secrets:
users: mongodb-users-secret
pmm:
enabled: true
image: percona/pmm-client:2
serverHost: pmm-server.monitoring
The Percona Operator's key advantage is its integrated backup management with Percona Backup for MongoDB (PBM). PBM supports logical and physical backups, incremental backups, and point-in-time recovery from the oplog — all configured declaratively through the CRD.
AWS Deployment: DocumentDB vs Atlas vs Self-Managed on EKS
Amazon DocumentDB is a MongoDB-compatible document database service. It is not MongoDB — it is a proprietary engine that implements the MongoDB wire protocol (compatible up to MongoDB 4.0 API). DocumentDB separates compute from storage using a distributed storage layer similar to Aurora. It provides automatic failover within a region, up to 15 read replicas, and point-in-time recovery. However, it lacks many MongoDB features: change streams have limitations, transactions work differently, and many aggregation pipeline stages are unsupported. Use DocumentDB only if your application uses a subset of MongoDB's API and you value the operational simplicity of a fully managed service.
MongoDB Atlas on AWS is MongoDB's own managed service running on AWS infrastructure. It provides genuine MongoDB with all features, automated HA, continuous backups, point-in-time recovery, auto-scaling, and multi-region clusters. Atlas is the easiest path to production MongoDB but the most expensive option at scale.
Self-Managed on EKS gives you full control over MongoDB version, configuration, and cost. Use the MongoDB Community Operator or Percona Operator with EBS gp3 storage and IAM Roles for Service Accounts (IRSA) for secure S3 backup access.
# EBS StorageClass optimised for MongoDB
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-csi
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "6000"
throughput: "250"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
# IRSA for backup S3 access
eksctl create iamserviceaccount \
--name mongodb-backup-sa \
--namespace databases \
--cluster my-eks-cluster \
--attach-policy-arn arn:aws:iam::111122223333:policy/MongoDBBackupS3Policy \
--approve
Azure Deployment: Cosmos DB vs Atlas vs Self-Managed on AKS
Azure Cosmos DB for MongoDB (vCore) is the closest Azure native offering to real MongoDB. Unlike the older RU-based Cosmos DB API for MongoDB, the vCore model runs actual MongoDB engine instances on dedicated compute, providing high compatibility with MongoDB 6.0+ features including full aggregation pipeline, change streams, and transactions. It offers zone-redundant HA, point-in-time recovery, and automatic backups.
MongoDB Atlas on Azure provides the same fully managed MongoDB experience as on AWS, running on Azure infrastructure with VNET peering, Azure Private Link, and Azure AD integration.
Self-Managed on AKS uses Azure Managed Disks (Premium SSD v2 recommended) with the MongoDB or Percona operator.
# Azure Premium SSD StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azure-premium-mongodb
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_LRS
cachingMode: None
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
GCP Deployment: Atlas on GCP vs Self-Managed on GKE
MongoDB Atlas on GCP runs on Google Cloud infrastructure with GCP-native integrations: VPC peering, Private Service Connect, and GKE cluster integration. Atlas on GCP supports multi-region clusters spanning GCP regions with automatic failover.
Self-Managed on GKE uses Persistent Disk SSD with the MongoDB or Percona operator. GKE Workload Identity provides secure, keyless authentication for backup to GCS.
# GKE SSD StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: pd-ssd-mongodb
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
Bare Metal k3s/Rancher with Longhorn
For data sovereignty, compliance, or cost optimisation, MongoDB runs effectively on bare metal Kubernetes using k3s with Rancher management and Longhorn distributed storage. This architecture eliminates cloud provider dependencies while maintaining the same operator-based management model.
Longhorn Storage for MongoDB
# Install Longhorn
helm repo add longhorn https://charts.longhorn.io
helm repo update
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
--create-namespace \
--set defaultSettings.defaultReplicaCount=3 \
--set defaultSettings.storageMinimalAvailablePercentage=15
# StorageClass for MongoDB on Longhorn
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-mongo
provisioner: driver.longhorn.io
parameters:
numberOfReplicas: "3"
staleReplicaTimeout: "2880"
dataLocality: best-effort
fsType: xfs
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: Retain
XFS is recommended over ext4 for MongoDB on Linux. MongoDB's WiredTiger storage engine benefits from XFS's allocation patterns, particularly for the journal and data files. Longhorn's three-way replication provides volume-level redundancy on top of MongoDB's replica set-level redundancy, giving you defence in depth against storage failures.
MetalLB and Headless Services
# MetalLB IP Pool for MongoDB
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: mongo-pool
namespace: metallb-system
spec:
addresses:
- 192.168.1.220-192.168.1.225
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: mongo-l2
namespace: metallb-system
spec:
ipAddressPools:
- mongo-pool
The MongoDB operator creates a headless Service that gives each pod a stable DNS name (mongo-0.mongo-svc.databases.svc.cluster.local). This is essential for replica set member discovery. If you need external access, MetalLB assigns a routable IP to a LoadBalancer Service in front of the mongos routers (for sharded clusters) or the primary (for replica sets).
Backup Strategies
MongoDB offers multiple backup approaches, each suited to different scenarios.
mongodump / mongorestore
Logical backups that export BSON documents. Portable and human-inspectable, but slow for large datasets and don't support point-in-time recovery on their own.
# Full logical backup with oplog for consistency
mongodump --uri="mongodb://backup-user:password@mongo-0:27017,mongo-1:27017,mongo-2:27017/appdb?replicaSet=rs-production&authSource=admin" \
--oplog \
--gzip \
--out=/backups/$(date +%Y%m%d-%H%M%S)
# Restore
mongorestore --uri="mongodb://admin:password@mongo-0:27017/?replicaSet=rs-production&authSource=admin" \
--oplogReplay \
--gzip \
/backups/20260412-030000/
Percona Backup for MongoDB (PBM)
PBM provides physical backups, incremental backups, and point-in-time recovery. It is the recommended backup tool for self-managed MongoDB in production.
# Configure PBM storage
pbm config --set storage.type=s3 \
--set storage.s3.bucket=company-mongodb-backups \
--set storage.s3.region=us-east-1 \
--set storage.s3.credentials.access-key-id=$AWS_ACCESS_KEY \
--set storage.s3.credentials.secret-access-key=$AWS_SECRET_KEY
# Full backup
pbm backup --type=logical --compression=gzip
# Physical backup (faster, requires WiredTiger)
pbm backup --type=physical --compression=gzip
# Incremental backup
pbm backup --type=incremental --base-snapshot=2026-04-12T03:00:00Z
# Point-in-time recovery
pbm restore --time="2026-04-12T09:30:00Z"
# List backups
pbm list
# Check backup status
pbm status
Cloud Snapshots
On cloud providers, EBS snapshots (AWS), managed disk snapshots (Azure), and persistent disk snapshots (GCP) provide fast, storage-level backups. Combined with db.fsyncLock() for point-in-time consistency, they offer the fastest backup and restore times for large datasets.
# Kubernetes VolumeSnapshot for MongoDB
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: mongodb-snapshot-$(date +%Y%m%d)
namespace: databases
spec:
volumeSnapshotClassName: csi-snapclass
source:
persistentVolumeClaimName: data-volume-production-mongodb-0
Oplog-Based Point-in-Time Recovery
The oplog enables point-in-time recovery (PITR) when combined with a base backup. PBM continuously archives oplog entries to the backup storage. To restore to a specific point in time, PBM restores the most recent base backup and then replays oplog entries up to the target timestamp. This is configured in the Percona Operator CRD via the pitr section.
Connection String Configuration for HA
Proper connection string configuration is critical for application resilience during failover events.
# Standard connection string with all replica set members
mongodb://appuser:password@mongo-0.mongo-svc:27017,mongo-1.mongo-svc:27017,mongo-2.mongo-svc:27017/appdb?replicaSet=rs-production&w=majority&j=true&readPreference=secondaryPreferred&retryWrites=true&retryReads=true&connectTimeoutMS=10000&socketTimeoutMS=30000&serverSelectionTimeoutMS=15000&maxPoolSize=100&minPoolSize=10
# SRV-based connection string (DNS discovery)
mongodb+srv://appuser:password@mongo-cluster.databases.svc.cluster.local/appdb?w=majority&retryWrites=true&readPreference=nearest
# For sharded clusters (connect through mongos)
mongodb://appuser:password@mongos-0:27017,mongos-1:27017/appdb?w=majority&retryWrites=true&readPreference=nearest
Key connection string parameters for HA: retryWrites=true and retryReads=true enable automatic retry of operations that fail during failover. w=majority ensures writes survive primary elections. serverSelectionTimeoutMS controls how long the driver waits to find a suitable server — set it higher than the expected election time (at least 15 seconds). maxPoolSize limits the connection pool per mongos/replica set member to prevent connection exhaustion.
Index Optimization and Query Performance
Indexes are the primary lever for MongoDB query performance. A missing index on a frequently queried field forces a collection scan that degrades linearly with data size.
# Create compound index for common query pattern
db.orders.createIndex(
{ customer_id: 1, order_date: -1, status: 1 },
{ name: "idx_customer_orders", background: true }
)
# Partial index (only index documents matching a filter)
db.events.createIndex(
{ timestamp: 1 },
{ name: "idx_active_events", partialFilterExpression: { status: "active" } }
)
# TTL index for automatic document expiration
db.sessions.createIndex(
{ createdAt: 1 },
{ expireAfterSeconds: 86400, name: "idx_session_ttl" }
)
# Text index for search
db.products.createIndex(
{ name: "text", description: "text" },
{ weights: { name: 10, description: 5 }, name: "idx_product_search" }
)
# Analyze query performance
db.orders.find({ customer_id: "c123" }).sort({ order_date: -1 }).explain("executionStats")
# Find unused indexes
db.orders.aggregate([ { $indexStats: {} } ])
Review $indexStats regularly to identify unused indexes that waste storage and slow writes. Use the explain() method to verify that queries use the expected index and check for high totalDocsExamined relative to nReturned, which indicates an inefficient query plan.
WiredTiger Storage Engine Tuning
WiredTiger is MongoDB's default and only production storage engine since MongoDB 4.2. Its performance characteristics are heavily influenced by cache sizing, compression, and journal configuration.
# WiredTiger configuration in mongod.conf
storage:
dbPath: /data/db
journal:
enabled: true
commitIntervalMs: 100
wiredTiger:
engineConfig:
cacheSizeGB: 4 # ~50% of (RAM - 1GB), max 80%
journalCompressor: snappy
directoryForIndexes: true # separate dir for index files
collectionConfig:
blockCompressor: snappy # or zstd for better ratio
indexConfig:
prefixCompression: true
operationProfiling:
mode: slowOp
slowOpThresholdMs: 100
replication:
oplogSizeMB: 51200 # 50 GB oplog
replSetName: rs-production
net:
maxIncomingConnections: 10000
compression:
compressors: snappy,zstd,zlib
setParameter:
wiredTigerConcurrentReadTransactions: 128
wiredTigerConcurrentWriteTransactions: 128
The WiredTiger cache should be sized to hold your working set — the data and indexes actively accessed by your queries. If the cache is too small, WiredTiger evicts pages frequently, causing high I/O. If it is too large, it leaves insufficient memory for the OS filesystem cache and other processes. A starting point is 50% of available RAM minus 1 GB (for the OS and other processes), capped at the working set size.
Authentication and TLS Encryption
# Generate TLS certificates for MongoDB
# CA certificate
openssl req -x509 -newkey rsa:4096 -days 3650 -nodes \
-keyout ca.key -out ca.crt \
-subj "/CN=MongoDB-CA"
# Server certificate (include all member hostnames in SAN)
openssl req -newkey rsa:4096 -nodes \
-keyout server.key -out server.csr \
-subj "/CN=mongo-0.mongo-svc.databases.svc.cluster.local" \
-addext "subjectAltName=DNS:mongo-0.mongo-svc.databases.svc.cluster.local,DNS:mongo-1.mongo-svc.databases.svc.cluster.local,DNS:mongo-2.mongo-svc.databases.svc.cluster.local,DNS:localhost,IP:127.0.0.1"
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out server.crt -days 365
# Combine cert and key into PEM
cat server.crt server.key > server.pem
# Create Kubernetes secrets
kubectl create secret tls mongodb-tls-cert \
--cert=server.crt --key=server.key -n databases
kubectl create secret generic mongodb-ca-cert \
--from-file=ca.crt=ca.crt -n databases
# MongoDB TLS configuration (mongod.conf)
net:
tls:
mode: requireTLS
certificateKeyFile: /etc/mongodb/tls/server.pem
CAFile: /etc/mongodb/tls/ca.crt
allowConnectionsWithoutCertificates: false
security:
authorization: enabled
clusterAuthMode: x509
Monitoring with Prometheus and Grafana
MongoDB exposes metrics through the mongodb_exporter (from Percona) that integrates with Prometheus. Key metrics to monitor include connection counts, operation rates, replication lag, WiredTiger cache usage, and query targeting efficiency.
# Deploy mongodb_exporter as a sidecar or standalone
apiVersion: apps/v1
kind: Deployment
metadata:
name: mongodb-exporter
namespace: databases
spec:
replicas: 1
selector:
matchLabels:
app: mongodb-exporter
template:
metadata:
labels:
app: mongodb-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9216"
spec:
containers:
- name: exporter
image: percona/mongodb_exporter:0.40.0
args:
- --mongodb.uri=mongodb://monitoring:password@production-mongodb-0.production-mongodb-svc:27017,production-mongodb-1.production-mongodb-svc:27017,production-mongodb-2.production-mongodb-svc:27017/?replicaSet=rs-production&authSource=admin
- --collect-all
- --compatible-mode
ports:
- containerPort: 9216
resources:
requests:
cpu: 100m
memory: 128Mi
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: mongodb-metrics
namespace: databases
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: mongodb-exporter
endpoints:
- port: metrics
interval: 15s
scrapeTimeout: 10s
# Critical Prometheus alert rules for MongoDB
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mongodb-alerts
namespace: databases
labels:
release: kube-prometheus-stack
spec:
groups:
- name: mongodb-health
rules:
- alert: MongoDBReplicationLagHigh
expr: mongodb_mongod_replset_member_replication_lag > 30
for: 5m
labels:
severity: warning
annotations:
summary: "MongoDB replica {{ $labels.name }} lag exceeds 30s"
- alert: MongoDBConnectionsHigh
expr: mongodb_connections{state="current"} / mongodb_connections{state="available"} > 0.8
for: 2m
labels:
severity: critical
annotations:
summary: "MongoDB connections above 80% capacity"
- alert: MongoDBWiredTigerCacheEvictions
expr: rate(mongodb_wiredtiger_cache_evicted_pages_total[5m]) > 100
for: 10m
labels:
severity: warning
annotations:
summary: "High WiredTiger cache eviction rate"
- alert: MongoDBReplicaSetNoPrimary
expr: mongodb_mongod_replset_number_of_members{state="PRIMARY"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MongoDB replica set has no primary"
- alert: MongoDBQueryTargetingInefficient
expr: rate(mongodb_mongod_metrics_query_executor_total{state="scanned_objects"}[5m]) / rate(mongodb_mongod_metrics_query_executor_total{state="returned"}[5m]) > 100
for: 15m
labels:
severity: warning
annotations:
summary: "MongoDB scanning 100x more documents than returned"
Change Streams for Real-Time Applications
Change streams provide a real-time notification mechanism for data changes in MongoDB. They leverage the oplog to push change events to applications, enabling event-driven architectures, real-time dashboards, and data synchronisation pipelines without polling.
// Watch changes on a collection
const pipeline = [
{ $match: { operationType: { $in: ["insert", "update", "replace"] } } },
{ $match: { "fullDocument.status": "active" } }
];
const changeStream = db.collection("orders").watch(pipeline, {
fullDocument: "updateLookup", // include full document on updates
resumeAfter: resumeToken // resume from last processed event
});
changeStream.on("change", (change) => {
console.log("Change detected:", change.operationType);
console.log("Document:", change.fullDocument);
// Store resume token for crash recovery
saveResumeToken(change._id);
});
changeStream.on("error", (error) => {
console.error("Change stream error:", error);
// Reconnect using saved resume token
});
Change streams require a replica set or sharded cluster (they do not work on standalone mongod instances). They survive primary elections — the driver automatically reconnects and resumes from the last received resume token. For production use, always persist the resume token so your application can recover from restarts without missing events.
Rolling Maintenance and Version Upgrades
MongoDB supports rolling upgrades where you upgrade one replica set member at a time, starting with secondaries and finishing with the primary (which triggers a step-down and election). This allows zero-downtime upgrades for minor and major version changes.
# Rolling upgrade procedure for self-managed replica set
# 1. Upgrade each secondary one at a time
# On secondary (mongo-2):
sudo systemctl stop mongod
sudo yum install -y mongodb-org-7.0.14 # or apt-get
sudo systemctl start mongod
# Wait for the member to reach SECONDARY state
rs.status()
# 2. Step down the primary
rs.stepDown()
# 3. Upgrade the old primary (now a secondary)
sudo systemctl stop mongod
sudo yum install -y mongodb-org-7.0.14
sudo systemctl start mongod
# 4. Verify cluster health
rs.status()
db.adminCommand({ getParameter: 1, featureCompatibilityVersion: 1 })
# 5. Set feature compatibility version (irreversible)
db.adminCommand({ setFeatureCompatibilityVersion: "7.0" })
With the MongoDB Community Operator or Percona Operator on Kubernetes, rolling upgrades are automated — you simply update the version field in the CRD and the operator handles the rolling update of each pod, waiting for each member to become healthy before proceeding.
Disaster Recovery and Failover Testing
A high availability deployment that has never been tested under failure is untested. Regular failover testing validates your architecture, your monitoring alerts, and your team's incident response procedures.
Controlled Failover Tests
# Test 1: Step down the primary
rs.stepDown(120) // 120-second election hold
// Monitor: election should complete in ~10-12s
// Verify: application reconnects and resumes operations
# Test 2: Kill a secondary pod in Kubernetes
kubectl delete pod production-mongodb-1 -n databases --grace-period=0 --force
// The StatefulSet recreates the pod
// The operator rejoins it to the replica set
# Test 3: Simulate network partition
kubectl exec production-mongodb-0 -n databases -- \
iptables -A INPUT -p tcp --dport 27017 -j DROP
// The isolated primary should step down (cannot reach majority)
// Remaining members elect a new primary
// Cleanup: remove iptables rule and let member rejoin
# Test 4: Simulate storage failure
kubectl exec production-mongodb-2 -n databases -- \
chmod 000 /data/db
// mongod should crash; Kubernetes restarts the pod
// The member resyncs from the primary's oplog
What to Measure During Failover
- RTO (Recovery Time Objective) — Time from primary failure to new primary accepting writes. Target: under 30 seconds for replica sets.
- RPO (Recovery Point Objective) — Data loss during failover. With
w: majority, RPO is zero for acknowledged writes. Withw: 1, RPO equals the replication lag at the time of failure. - Application error rate — Percentage of requests that fail during the failover window. With
retryWrites=true, most write failures are automatically retried by the driver. - Change stream continuity — Verify that change stream consumers resume from their saved token without missing events.
Disaster Recovery Runbook
- Single member failure — Automatic recovery via Kubernetes pod restart and oplog catch-up. No manual action required unless the member needs a full resync.
- Primary failure — Automatic election promotes a secondary within 10-12 seconds. Verify application connectivity and replication lag on remaining secondaries.
- Majority failure — If a majority of members are down, the replica set becomes read-only (no elections possible). Restore members or use
rs.reconfig({ force: true })as a last resort (this can cause data loss). - Complete cluster loss — Deploy a new cluster, restore from the latest PBM backup, and replay oplog to the target point in time. Update connection strings and DNS records.
- Regional failover — If the primary region is lost, a secondary in another region is automatically elected primary (if it has sufficient priority and the remaining members form a majority). Update DNS to route traffic to the new primary region.
Production Tuning Recommendations
OS-Level Tuning
# Disable Transparent Huge Pages (critical for MongoDB)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Set readahead to 8-32 sectors for SSD
blockdev --setra 32 /dev/sda
# Increase file descriptor limits
ulimit -n 64000
ulimit -u 64000
# Swappiness
vm.swappiness = 1
# Dirty page ratio
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
# Network tuning
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_keepalive_time = 120
Resource Sizing Guidelines
- WiredTiger cache: 50% of available RAM minus 1 GB, or the size of your working set, whichever is smaller.
- Oplog size: Enough to hold 24-72 hours of write operations. Start with 50 GB and monitor with
rs.printReplicationInfo(). - Storage IOPS: MongoDB is I/O-intensive, especially during compaction and checkpointing. Use NVMe SSD or cloud storage with provisioned IOPS (gp3 with 6000+ IOPS on AWS, Premium SSD v2 on Azure, pd-ssd on GCP).
- CPU: MongoDB benefits from multiple cores for concurrent read/write operations, WiredTiger's concurrent transactions, and background tasks (checkpointing, compaction, replication).
- Network: Replication traffic can be significant for write-heavy workloads. Ensure low-latency, high-bandwidth networking between replica set members.
Connection Management
# Application-side connection pool configuration
const client = new MongoClient(uri, {
maxPoolSize: 100,
minPoolSize: 10,
maxIdleTimeMS: 60000,
waitQueueTimeoutMS: 5000,
connectTimeoutMS: 10000,
socketTimeoutMS: 30000,
serverSelectionTimeoutMS: 15000,
retryWrites: true,
retryReads: true,
w: "majority",
readPreference: "secondaryPreferred",
compressors: ["snappy", "zstd"]
});
Monitoring Checklist
- Replication lag — Alert when any secondary exceeds 30 seconds of lag.
- Connection saturation — Alert when current connections exceed 80% of
maxIncomingConnections. - WiredTiger cache — Alert when the cache dirty fill ratio exceeds 20% (indicates write pressure exceeding checkpoint throughput).
- Oplog window — Alert when the oplog window drops below 12 hours (risk of secondaries needing full resync after maintenance).
- Query targeting — Alert when the ratio of scanned documents to returned documents exceeds 100 (missing index).
- Disk usage — Alert at 70% and 85% thresholds. MongoDB can use significant temporary disk space during compaction.
- Backup freshness — Alert when the last successful backup is older than your RPO window.
- Ticket availability — Monitor WiredTiger read and write tickets. Exhaustion causes operation queuing and latency spikes.
Operational Commands Quick Reference
# Replica set status
rs.status()
rs.conf()
rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()
# Cluster health (sharded)
sh.status()
db.adminCommand({ balancerStatus: 1 })
db.adminCommand({ listShards: 1 })
# Server diagnostics
db.serverStatus()
db.currentOp({ "$all": true })
db.adminCommand({ hostInfo: 1 })
# Kill long-running operations
db.killOp(opId)
# Compaction (reclaim disk space)
db.runCommand({ compact: "orders" })
# Profiler (identify slow queries)
db.setProfilingLevel(1, { slowms: 100 })
db.system.profile.find().sort({ ts: -1 }).limit(10)
# Index management
db.orders.getIndexes()
db.orders.createIndex({ field: 1 }, { background: true })
db.orders.dropIndex("index_name")
# Kubernetes-specific
kubectl get mongodbcommunity -n databases
kubectl describe mongodbcommunity production-mongodb -n databases
kubectl logs production-mongodb-0 -n databases -c mongod
kubectl exec -it production-mongodb-0 -n databases -c mongod -- mongosh
Decision Matrix: Choosing Your MongoDB HA Architecture
- Single cloud, managed preference — Use MongoDB Atlas. It handles HA, backups, monitoring, and scaling with minimal operational overhead.
- AWS with MongoDB-compatible needs only — Consider Amazon DocumentDB if your application uses a basic subset of the MongoDB API and you want a fully managed experience. Otherwise, Atlas or self-managed on EKS.
- Azure with full MongoDB features — Use Cosmos DB for MongoDB vCore for a managed experience, or Atlas on Azure for the genuine MongoDB managed service.
- Multi-cloud or hybrid — Self-managed with the MongoDB Community Operator or Percona Operator. The Kubernetes abstraction enables consistent deployment across providers.
- Bare metal or edge — k3s + Rancher + Longhorn + MongoDB Community Operator or Percona Operator. No cloud dependencies required.
- Large-scale horizontal scaling — Sharded cluster with zone sharding for data locality. Use the Percona Operator which supports sharded deployments natively.
- Real-time event-driven applications — MongoDB change streams provide built-in CDC. Ensure you use a replica set or sharded cluster (not standalone).
Conclusion
MongoDB's built-in replication and sharding primitives give it an architectural advantage for high availability — replica sets with automatic failover and sharded clusters with horizontal scaling are native capabilities, not bolted-on afterthoughts. But these primitives must be configured correctly and operated with discipline to deliver the availability guarantees that production systems demand.
A production MongoDB deployment requires three data-bearing replica set members across failure domains, w: majority write concern for durability, proper oplog sizing for replication resilience, WiredTiger cache tuning for your working set, and a tested backup strategy with point-in-time recovery capability. Kubernetes operators — whether the MongoDB Community Operator for replica sets or the Percona Operator for full-featured deployments including sharding and integrated backups — automate the lifecycle management that would otherwise require significant operational investment.
The deployment patterns across AWS, Azure, GCP, and bare metal share the same core MongoDB configuration. What changes is the storage class, the backup destination, and the networking layer. This consistency is the value of an operator-based approach: your team learns one tool, one operational model, and one set of runbooks that work everywhere.
Start with a three-member replica set, w: majority writes, a daily PBM backup with continuous oplog archiving, and the core Prometheus alerts for replication lag, connection saturation, and cache pressure. Test your failover on day one — not during your first incident. Expand to sharding, zone-based data locality, and multi-region deployments as your data volume and availability requirements grow. The infrastructure handles the mechanics; your responsibility is to understand the architecture deeply enough to make the right trade-offs for your workload and to test those assumptions relentlessly.