Couchbase High Availability in Production: XDCR, Kubernetes Operator, and Multi-Region Deployment
Enterprise Couchbase HA with XDCR, Autonomous Operator, and Multi-Cloud Deployment
Introduction: Why Couchbase for High Availability Production Workloads
Couchbase Server is a distributed, multi-model NoSQL database engineered for interactive applications that demand consistent low-latency performance at any scale. Unlike databases that bolt on distributed features as an afterthought, Couchbase was architected from its inception around a shared-nothing, peer-to-peer topology where every node is equal and data is automatically sharded across the cluster using a deterministic hashing mechanism called vBuckets. This architectural choice eliminates single points of failure at the data layer and enables horizontal scaling without application changes.
What sets Couchbase apart in the high availability landscape is its Cross Data Center Replication (XDCR) — a built-in, asynchronous replication engine that continuously streams mutations between geographically distributed clusters. Combined with automatic failover, rack/zone awareness, and a rich set of integrated services (Data, Index, Query, Search, Analytics, and Eventing), Couchbase provides a unified platform that can serve as both the operational database and the analytical engine for modern applications.
In this comprehensive guide, we will explore every aspect of running Couchbase Server in high availability production environments: the internal architecture that makes HA possible, XDCR configuration for multi-region deployments, the Couchbase Autonomous Operator for Kubernetes, cloud-specific deployment patterns for AWS EKS, Azure AKS, and GCP GKE, bare metal k3s deployments with Rancher and Longhorn, backup and restore strategies, N1QL query tuning, security hardening, monitoring, and Couchbase Mobile with Sync Gateway for edge deployments. By the end, you will have actionable knowledge to deploy and operate production-grade Couchbase clusters on any infrastructure.
Couchbase Server Architecture: Services, vBuckets, and Automatic Sharding
Understanding Couchbase's internal architecture is critical before deploying for high availability. Couchbase uses a Multi-Dimensional Scaling (MDS) architecture where different services can be independently deployed and scaled across cluster nodes. This gives operators fine-grained control over resource allocation and performance isolation.
The Six Core Services
Couchbase Server provides six integrated services, each handling a distinct workload type:
- Data Service (KV) — The core key-value engine built on a memory-first architecture. It handles CRUD operations, manages vBucket distribution, and serves as the persistence layer. Data is stored in memory (managed cache) and asynchronously persisted to disk. This service must run on at least one node in every cluster.
- Index Service (GSI) — Maintains Global Secondary Indexes that support N1QL queries. Indexes are stored separately from data, allowing independent scaling. Supports standard and memory-optimized index storage modes.
- Query Service (N1QL) — Executes N1QL (SQL++ for JSON) queries against the cluster. Stateless by design, making it easy to scale horizontally. Coordinates with the Data and Index services to plan and execute queries.
- Search Service (FTS) — Provides full-text search capabilities powered by the Bleve search engine. Supports fuzzy matching, geospatial queries, faceted search, and custom analyzers. Indexes are partitioned and replicated across Search nodes.
- Analytics Service (CBAS) — Runs complex analytical queries using a parallel processing engine based on Apache Asterix. Operates on its own copy of the data, ensuring that analytical workloads never impact operational latency.
- Eventing Service — Executes server-side JavaScript functions in response to data mutations. Enables real-time data enrichment, transformations, cascade deletes, and integration triggers without external infrastructure.
vBucket Distribution and Automatic Sharding
Couchbase distributes data across the cluster using 1024 vBuckets (virtual buckets). Every document is mapped to a vBucket using a CRC32 hash of the document key modulo 1024. The cluster map — maintained by every node and cached by every SDK client — maps each vBucket to a specific node. This deterministic mapping means clients always know exactly which node holds any given document, enabling single-hop reads and writes with sub-millisecond latency.
When nodes are added or removed, Couchbase redistributes vBuckets automatically through a process called rebalance. During rebalance, the cluster moves vBuckets between nodes while remaining fully operational. The rebalance is carefully orchestrated to maintain the configured number of replicas at all times, and clients are seamlessly redirected to the new vBucket locations through cluster map updates.
Intra-Cluster Replication and Auto-Failover
Each vBucket has one active copy and up to three replica copies distributed across different nodes. When a client writes a document, the write goes to the active vBucket on the responsible node. The Data Service then replicates the mutation to replica vBuckets on other nodes through the internal DCP (Database Change Protocol) stream. By default, Couchbase configures one replica, but for production HA deployments, two replicas are recommended:
# Configure bucket with 2 replicas via CLI
/opt/couchbase/bin/couchbase-cli bucket-create \
--cluster localhost:8091 \
--username Administrator \
--password password \
--bucket production-data \
--bucket-type couchbase \
--bucket-ramsize 4096 \
--bucket-replica 2 \
--bucket-priority high \
--bucket-eviction-policy valueOnly \
--enable-flush 0 \
--compression-mode active \
--max-ttl 0 \
--durability-min-level majorityAndPersistActive
Auto-failover is Couchbase's mechanism for automatically detecting and recovering from node failures. When a node becomes unresponsive, the cluster orchestrator waits for a configurable timeout (minimum 5 seconds, recommended 30 seconds for production), then promotes the replica vBuckets on surviving nodes to active status. This happens without any application-side intervention — SDK clients receive an updated cluster map and immediately route requests to the new active vBuckets.
# Configure auto-failover settings
/opt/couchbase/bin/couchbase-cli setting-autofailover \
--cluster localhost:8091 \
--username Administrator \
--password password \
--enable-auto-failover 1 \
--auto-failover-timeout 30 \
--max-failovers 3 \
--enable-failover-of-server-groups 1 \
--failover-on-data-disk-issues 1 \
--failover-data-disk-period 120 \
--can-abort-rebalance 1
Key auto-failover parameters:
- auto-failover-timeout — Seconds to wait before triggering failover. Lower values reduce downtime but increase false-positive risk. 30 seconds is the recommended production setting.
- max-failovers — Maximum number of sequential auto-failovers before requiring manual intervention. Set to 3 for a 5-node cluster (to maintain quorum).
- enable-failover-of-server-groups — Enables failover of an entire server group (rack/zone), critical for zone-aware deployments.
- failover-on-data-disk-issues — Triggers failover when the Data Service detects persistent disk I/O errors.
XDCR: Cross Data Center Replication
XDCR is Couchbase's flagship multi-region replication technology. Unlike database-level replication found in traditional RDBMS systems, XDCR operates at the bucket level and streams individual document mutations between independent Couchbase clusters. Each cluster remains fully autonomous — it can accept reads and writes independently, making XDCR ideal for active-active multi-region deployments where users need low-latency access from any geography.
Unidirectional vs Bidirectional XDCR
Unidirectional XDCR replicates mutations from a source cluster to a target cluster in one direction. This is suitable for disaster recovery scenarios, read replicas in remote regions, or feeding data from an operational cluster to an analytics cluster.
Bidirectional XDCR creates replication links in both directions between two clusters, enabling active-active deployments where both clusters accept writes. This is the most powerful configuration but requires careful conflict resolution planning.
Setting Up XDCR Replication
Configuring XDCR involves creating a remote cluster reference and then defining replication links at the bucket level. Below are the CLI commands and REST API calls for a complete bidirectional setup:
# Step 1: Create remote cluster reference on the US-EAST cluster
/opt/couchbase/bin/couchbase-cli xdcr-setup \
--cluster cb-us-east.example.com:8091 \
--username Administrator \
--password password \
--create \
--xdcr-cluster-name eu-west-cluster \
--xdcr-hostname cb-eu-west.example.com:8091 \
--xdcr-username Administrator \
--xdcr-password password \
--xdcr-demand-encryption 1 \
--xdcr-encryption-type full \
--xdcr-certificate /path/to/eu-west-ca.pem
# Step 2: Create replication from US-EAST to EU-WEST for the 'app' bucket
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
--cluster cb-us-east.example.com:8091 \
--username Administrator \
--password password \
--create \
--xdcr-cluster-name eu-west-cluster \
--xdcr-from-bucket app \
--xdcr-to-bucket app \
--xdcr-replication-mode xmem \
--enable-compression 1 \
--filter-expression "" \
--priority high \
--network-usage-limit 0
# Step 3: Create the reverse replication on EU-WEST cluster (bidirectional)
/opt/couchbase/bin/couchbase-cli xdcr-setup \
--cluster cb-eu-west.example.com:8091 \
--username Administrator \
--password password \
--create \
--xdcr-cluster-name us-east-cluster \
--xdcr-hostname cb-us-east.example.com:8091 \
--xdcr-username Administrator \
--xdcr-password password \
--xdcr-demand-encryption 1 \
--xdcr-encryption-type full \
--xdcr-certificate /path/to/us-east-ca.pem
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
--cluster cb-eu-west.example.com:8091 \
--username Administrator \
--password password \
--create \
--xdcr-cluster-name us-east-cluster \
--xdcr-from-bucket app \
--xdcr-to-bucket app \
--xdcr-replication-mode xmem \
--enable-compression 1
Conflict Resolution Strategies
In bidirectional XDCR, the same document can be modified concurrently on different clusters, creating conflicts. Couchbase provides multiple conflict resolution strategies:
- Timestamp-based (LWW — Last Write Wins) — The mutation with the most recent timestamp wins. This is the default and works well for most use cases. Requires NTP synchronization across all clusters (within 5 seconds skew). Set at bucket creation time and cannot be changed later.
- Sequence Number-based — Uses the internal sequence number (revision ID) to determine the winner. The mutation with the higher revision count wins. Useful when timestamp synchronization is unreliable.
- Custom Conflict Resolution (Enterprise) — Couchbase Enterprise Edition supports custom merge functions that execute server-side JavaScript to resolve conflicts with application-specific logic. This enables scenarios like merging shopping cart items from different regions or applying domain-specific conflict resolution rules.
# Create a bucket with timestamp-based conflict resolution
curl -X POST http://localhost:8091/pools/default/buckets \
-u Administrator:password \
-d name=app \
-d ramQuota=4096 \
-d replicaNumber=2 \
-d bucketType=couchbase \
-d conflictResolutionType=lww \
-d compressionMode=active \
-d durabilityMinLevel=majorityAndPersistActive
# XDCR advanced settings via REST API
curl -X POST http://localhost:8091/settings/replications/<replication-id> \
-u Administrator:password \
-d optimisticReplicationThreshold=256 \
-d sourceNozzlePerNode=4 \
-d targetNozzlePerNode=4 \
-d checkpointInterval=600 \
-d batchCount=500 \
-d batchSize=2048 \
-d failureRestartInterval=10 \
-d docBatchSizeKb=2048 \
-d networkUsageLimit=0 \
-d priority=High
XDCR Filtering
XDCR supports filtering so you can replicate only a subset of documents. Filters use regular expressions against document keys and can also filter based on document expiration or deletion:
# Replicate only documents with keys starting with 'user::'
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
--cluster localhost:8091 \
--username Administrator \
--password password \
--create \
--xdcr-cluster-name remote-cluster \
--xdcr-from-bucket app \
--xdcr-to-bucket app-users \
--filter-expression "^user::" \
--filter-skip-restream 0
# Replicate documents matching a complex pattern
# (orders from 2026 with specific type)
--filter-expression "REGEXP_CONTAINS(META().id, '^order::2026') AND type='premium'"
Couchbase Autonomous Operator for Kubernetes
The Couchbase Autonomous Operator (CAO) is an enterprise-grade Kubernetes operator that automates the deployment, management, scaling, and recovery of Couchbase Server clusters. Unlike simple StatefulSet deployments, the Autonomous Operator understands Couchbase's internal topology — it manages rebalance operations, coordinates rolling upgrades, handles server group awareness, and integrates with Kubernetes scheduling primitives to ensure optimal placement of Couchbase pods.
CouchbaseCluster CRD Specification
The CouchbaseCluster CRD is the central configuration that declares the desired state of your Couchbase deployment. The Autonomous Operator reconciles this into StatefulSets, Services, PVCs, Secrets, and RBAC resources. Below is a production-ready CRD:
apiVersion: couchbase.com/v2
kind: CouchbaseCluster
metadata:
name: cb-production
namespace: couchbase
spec:
image: couchbase/server:7.6.1-enterprise
antiAffinity: true
platform: aws
cluster:
autoFailoverTimeout: 30s
autoFailoverMaxCount: 3
autoFailoverOnDataDiskIssues: true
autoFailoverOnDataDiskIssuesTimePeriod: 120s
autoFailoverServerGroup: true
clusterName: cb-production
dataServiceMemoryQuota: 8Gi
indexServiceMemoryQuota: 4Gi
searchServiceMemoryQuota: 2Gi
analyticsServiceMemoryQuota: 4Gi
eventingServiceMemoryQuota: 2Gi
indexStorageSetting: memory_optimized
autoCompaction:
databaseFragmentationThreshold:
percent: 30
size: 1Gi
viewFragmentationThreshold:
percent: 30
size: 1Gi
parallelCompaction: false
timeWindow:
start: "02:00"
end: "06:00"
abortCompactionOutsideWindow: true
security:
adminSecret: cb-admin-credentials
rbac:
managed: true
selector:
matchLabels:
cluster: cb-production
ldap:
hosts:
- ldap.example.com
port: 636
encryption: TLS
networking:
tls:
static:
serverSecret: couchbase-server-tls
operatorSecret: couchbase-operator-tls
exposeAdminConsole: true
adminConsoleServices:
- data
adminConsoleServiceType: NodePort
exposedFeatures:
- client
- xdcr
exposedFeatureServiceType: NodePort
buckets:
managed: true
selector:
matchLabels:
cluster: cb-production
servers:
- name: data-zone-a
size: 2
services:
- data
- index
serverGroups:
- zone-a
pod:
metadata:
labels:
couchbase-service: data-index
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9091"
spec:
nodeSelector:
topology.kubernetes.io/zone: us-east-1a
tolerations:
- key: couchbase
operator: Equal
value: "true"
effect: NoSchedule
resources:
requests:
cpu: "4"
memory: 16Gi
limits:
cpu: "8"
memory: 20Gi
volumeMounts:
default: couchbase-data
data: couchbase-data
index: couchbase-index
- name: data-zone-b
size: 2
services:
- data
- index
serverGroups:
- zone-b
pod:
spec:
nodeSelector:
topology.kubernetes.io/zone: us-east-1b
tolerations:
- key: couchbase
operator: Equal
value: "true"
effect: NoSchedule
resources:
requests:
cpu: "4"
memory: 16Gi
limits:
cpu: "8"
memory: 20Gi
volumeMounts:
default: couchbase-data
data: couchbase-data
index: couchbase-index
- name: query-search
size: 2
services:
- query
- search
serverGroups:
- zone-a
- zone-b
pod:
spec:
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "8"
memory: 12Gi
volumeMounts:
default: couchbase-default
- name: analytics-eventing
size: 2
services:
- analytics
- eventing
serverGroups:
- zone-c
pod:
spec:
nodeSelector:
topology.kubernetes.io/zone: us-east-1c
resources:
requests:
cpu: "8"
memory: 32Gi
limits:
cpu: "16"
memory: 40Gi
volumeMounts:
default: couchbase-analytics
analytics:
- couchbase-analytics
serverGroups:
- zone-a
- zone-b
- zone-c
volumeClaimTemplates:
- metadata:
name: couchbase-data
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 100Gi
- metadata:
name: couchbase-index
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 50Gi
- metadata:
name: couchbase-default
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 20Gi
- metadata:
name: couchbase-analytics
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 200Gi
Operator Installation via Helm
# Add Couchbase Helm repository
helm repo add couchbase https://couchbase-partners.github.io/helm-charts/
helm repo update
# Install the Couchbase Autonomous Operator
helm install couchbase-operator couchbase/couchbase-operator \
--namespace couchbase \
--create-namespace \
--set operator.image.repository=couchbase/operator \
--set operator.image.tag=2.7.1 \
--set admissionController.enabled=true
# Create the admin credentials secret
kubectl create secret generic cb-admin-credentials \
--namespace couchbase \
--from-literal=username=Administrator \
--from-literal=password=$(openssl rand -base64 24)
# Deploy the CouchbaseCluster CRD
kubectl apply -f couchbase-cluster.yaml
# Verify deployment
kubectl get couchbaseclusters -n couchbase
kubectl get pods -n couchbase -l app=couchbase
kubectl get svc -n couchbase
Server Groups and Rack/Zone Awareness
Server groups are Couchbase's mechanism for ensuring that active vBuckets and their replicas are placed in different failure domains (availability zones, racks, or data centers). When server groups are configured, Couchbase guarantees that no active and replica pair for the same vBucket resides in the same server group. This means a complete zone failure will not result in data loss.
The Autonomous Operator maps server groups to Kubernetes node topology labels, automatically scheduling pods in the correct zones. Combined with pod anti-affinity rules, this ensures that Couchbase pods are distributed across physical infrastructure for maximum resilience.
AWS EKS Deployment
Amazon EKS requires specific configuration for optimal Couchbase performance. The key considerations are storage (EBS gp3 for throughput), instance types (memory-optimized r6i/r7i for Data nodes), and networking (VPC CNI for pod-level networking).
# EBS gp3 StorageClass optimized for Couchbase
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-gp3-couchbase
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "6000"
throughput: "500"
encrypted: "true"
kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/mrk-abcdef"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Recommended EKS node groups for Couchbase
# Data nodes: r6i.2xlarge (8 vCPU, 64 GiB) or r7i.2xlarge
# Index/Query: m6i.2xlarge (8 vCPU, 32 GiB)
# Analytics: r6i.4xlarge (16 vCPU, 128 GiB)
# Eventing: m6i.xlarge (4 vCPU, 16 GiB)
# EKS managed node group with taints for Couchbase
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: couchbase-eks
region: us-east-1
managedNodeGroups:
- name: cb-data
instanceType: r6i.2xlarge
desiredCapacity: 4
minSize: 4
maxSize: 8
volumeSize: 200
volumeType: gp3
volumeIOPS: 6000
volumeThroughput: 500
availabilityZones: ["us-east-1a", "us-east-1b"]
labels:
workload: couchbase-data
taints:
- key: couchbase
value: "true"
effect: NoSchedule
iam:
attachPolicyARNs:
- arn:aws:iam::policy/AmazonEBSCSIDriverPolicy
- name: cb-query
instanceType: m6i.2xlarge
desiredCapacity: 2
minSize: 2
maxSize: 4
availabilityZones: ["us-east-1a", "us-east-1b"]
labels:
workload: couchbase-query
Azure AKS Deployment
Azure AKS uses Premium SSD v2 or Ultra Disk for Couchbase's I/O demands and Azure Private Link for secure XDCR connectivity between regions.
# Azure Premium SSD v2 StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azure-premium-couchbase
provisioner: disk.csi.azure.com
parameters:
skuName: PremiumV2_LRS
DiskIOPSReadWrite: "6000"
DiskMBpsReadWrite: "500"
cachingMode: None
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# Ultra Disk StorageClass for high-performance workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: azure-ultra-couchbase
provisioner: disk.csi.azure.com
parameters:
skuName: UltraSSD_LRS
DiskIOPSReadWrite: "10000"
DiskMBpsReadWrite: "1000"
cachingMode: None
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# AKS recommended VM sizes:
# Data nodes: Standard_E8s_v5 (8 vCPU, 64 GiB)
# Index/Query: Standard_D8s_v5 (8 vCPU, 32 GiB)
# Analytics: Standard_E16s_v5 (16 vCPU, 128 GiB)
GCP GKE Deployment
Google Kubernetes Engine uses SSD Persistent Disks and Workload Identity for secure access to Google Cloud Storage for backups.
# GKE SSD Persistent Disk StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ssd-couchbase
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
provisioned-iops-on-create: "6000"
provisioned-throughput-on-create: "500"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# GKE Hyperdisk Balanced for cost-effective performance
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: hyperdisk-couchbase
provisioner: pd.csi.storage.gke.io
parameters:
type: hyperdisk-balanced
provisioned-iops-on-create: "6000"
provisioned-throughput-on-create: "500"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
# GKE recommended machine types:
# Data nodes: n2-highmem-8 (8 vCPU, 64 GB)
# Index/Query: n2-standard-8 (8 vCPU, 32 GB)
# Analytics: n2-highmem-16 (16 vCPU, 128 GB)
Bare Metal k3s/Rancher with Longhorn
For organizations that need full infrastructure control without cloud vendor lock-in, bare metal k3s with Rancher management and Longhorn distributed storage provides an excellent foundation for Couchbase HA. This architecture is popular in regulated industries, edge computing scenarios, and cost-sensitive environments.
# k3s bare metal setup for Couchbase
# Install k3s on master nodes (HA with embedded etcd)
curl -sfL https://get.k3s.io | sh -s - server \
--cluster-init \
--disable traefik \
--disable servicelb \
--write-kubeconfig-mode 644 \
--node-taint couchbase=true:NoSchedule \
--node-label topology.kubernetes.io/zone=rack-1
# Join additional server nodes
curl -sfL https://get.k3s.io | sh -s - server \
--server https://master-1:6443 \
--token $(cat /var/lib/rancher/k3s/server/node-token) \
--node-taint couchbase=true:NoSchedule \
--node-label topology.kubernetes.io/zone=rack-2
# Join agent nodes
curl -sfL https://get.k3s.io | sh -s - agent \
--server https://master-1:6443 \
--token $(cat /var/lib/rancher/k3s/server/node-token) \
--node-taint couchbase=true:NoSchedule \
--node-label topology.kubernetes.io/zone=rack-3
# Install Longhorn for distributed storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
--namespace longhorn-system \
--create-namespace \
--set defaultSettings.defaultReplicaCount=3 \
--set defaultSettings.defaultDataPath=/mnt/longhorn \
--set defaultSettings.guaranteedInstanceManagerCPU=12
# Create Longhorn StorageClass for Couchbase
kubectl apply -f - <
Backup and Restore with cbbackupmgr
Couchbase provides cbbackupmgr, an enterprise backup tool that supports full, incremental, and differential backups with optional compression and encryption. For production HA deployments, a robust backup strategy combines Couchbase-level backups with cloud snapshot capabilities.
Backup Configuration
# Initialize a backup repository
/opt/couchbase/bin/cbbackupmgr config \
--archive /backup/couchbase \
--repo production-backup \
--include-data production-data \
--include-data user-profiles \
--exclude-data _system
# Run a full backup
/opt/couchbase/bin/cbbackupmgr backup \
--archive /backup/couchbase \
--repo production-backup \
--cluster couchbase://localhost \
--username Administrator \
--password "$CB_PASSWORD" \
--threads 4 \
--no-progress-bar
# Run an incremental backup (only mutations since last backup)
/opt/couchbase/bin/cbbackupmgr backup \
--archive /backup/couchbase \
--repo production-backup \
--cluster couchbase://localhost \
--username Administrator \
--password "$CB_PASSWORD" \
--threads 4
# List available backups
/opt/couchbase/bin/cbbackupmgr list \
--archive /backup/couchbase \
--repo production-backup
# Restore from a specific backup
/opt/couchbase/bin/cbbackupmgr restore \
--archive /backup/couchbase \
--repo production-backup \
--cluster couchbase://target-cluster:8091 \
--username Administrator \
--password "$CB_PASSWORD" \
--start 2026-04-12T00_00_00 \
--end 2026-04-12T14_30_00 \
--threads 4
Automated Backup Script for Kubernetes
#!/bin/bash
# couchbase-backup.sh — Automated backup to S3-compatible storage
set -euo pipefail
CLUSTER_HOST="cb-production-srv.couchbase.svc.cluster.local"
BACKUP_DIR="/backup/couchbase"
REPO_NAME="prod-$(date +%Y%m%d)"
S3_BUCKET="s3://couchbase-backups/production"
RETENTION_DAYS=14
log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }
log "Starting Couchbase backup for cluster: $CLUSTER_HOST"
if [ ! -d "$BACKUP_DIR/$REPO_NAME" ]; then
log "Configuring new backup repository: $REPO_NAME"
cbbackupmgr config \
--archive "$BACKUP_DIR" \
--repo "$REPO_NAME" \
--include-data production-data \
--include-data user-profiles
fi
log "Running incremental backup..."
cbbackupmgr backup \
--archive "$BACKUP_DIR" \
--repo "$REPO_NAME" \
--cluster "couchbase://$CLUSTER_HOST" \
--username "$CB_USERNAME" \
--password "$CB_PASSWORD" \
--threads 4 \
--no-progress-bar
BACKUP_SIZE=$(du -sh "$BACKUP_DIR/$REPO_NAME" | cut -f1)
log "Backup complete. Size: $BACKUP_SIZE"
log "Syncing to S3: $S3_BUCKET/$REPO_NAME"
aws s3 sync "$BACKUP_DIR/$REPO_NAME" "$S3_BUCKET/$REPO_NAME" \
--storage-class STANDARD_IA \
--sse aws:kms
log "Cleaning up backups older than $RETENTION_DAYS days..."
find "$BACKUP_DIR" -maxdepth 1 -name "prod-*" -mtime +"$RETENTION_DAYS" -exec rm -rf {} \;
log "Backup pipeline complete."
CouchbaseBackup CRD (Operator-Managed)
The Autonomous Operator provides CRDs for automated backup management:
apiVersion: couchbase.com/v2
kind: CouchbaseBackup
metadata:
name: cb-daily-backup
namespace: couchbase
spec:
strategy: full_incremental
full:
schedule: "0 2 * * 0" # Full backup every Sunday at 2 AM
incremental:
schedule: "0 2 * * 1-6" # Incremental Mon-Sat at 2 AM
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 3
backOffLimit: 3
logRetention: 168h
size: 100Gi
s3bucket: s3://couchbase-backups/production
---
apiVersion: couchbase.com/v2
kind: CouchbaseBackupRestore
metadata:
name: cb-restore-pitr
namespace: couchbase
spec:
backup: cb-daily-backup
repo: "20260412"
start:
int: 1
end:
int: 5
backOffLimit: 3
N1QL Query Performance Tuning
N1QL (SQL++ for JSON) is Couchbase's query language. Tuning N1QL performance requires understanding the query planner, index design, and server-side optimizations.
Index Strategies: GSI and FTS
-- Global Secondary Index (GSI) for common query patterns
-- Composite index for user lookups
CREATE INDEX idx_users_email_status
ON `user-profiles`(email, status)
WHERE type = 'user'
WITH {"num_replica": 1, "defer_build": false};
-- Covering index (includes all queried fields to avoid fetch)
CREATE INDEX idx_orders_covering
ON `production-data`(customer_id, order_date, total_amount, status)
WHERE type = 'order'
WITH {"num_replica": 1};
-- Array index for nested documents
CREATE INDEX idx_order_items
ON `production-data`(DISTINCT ARRAY item.product_id FOR item IN items END)
WHERE type = 'order'
WITH {"num_replica": 1};
-- Partial index for active records only
CREATE INDEX idx_active_sessions
ON `production-data`(user_id, created_at)
WHERE type = 'session' AND status = 'active'
WITH {"num_replica": 1};
-- Adaptive index for dynamic query patterns
CREATE INDEX idx_adaptive_products
ON `production-data`(DISTINCT PAIRS(self))
WHERE type = 'product'
WITH {"num_replica": 1};
-- Check index status
SELECT name, state, num_docs_indexed, num_docs_pending
FROM system:indexes
WHERE keyspace_id = 'production-data';
-- Analyze query execution plan
EXPLAIN SELECT u.name, u.email, COUNT(o.id) AS order_count
FROM `user-profiles` u
JOIN `production-data` o ON o.customer_id = u.id
WHERE u.status = 'active' AND o.type = 'order'
GROUP BY u.name, u.email
ORDER BY order_count DESC
LIMIT 100;
-- Use ADVISE to get index recommendations
ADVISE SELECT * FROM `production-data`
WHERE type = 'order'
AND customer_id = 'cust-12345'
AND order_date BETWEEN '2026-01-01' AND '2026-04-12'
ORDER BY order_date DESC;
Query Optimization Tips
-- Use PREPARE for frequently executed queries (cached plan)
PREPARE get_user_orders AS
SELECT o.id, o.order_date, o.total_amount, o.status
FROM `production-data` o
WHERE o.type = 'order'
AND o.customer_id = $customer_id
ORDER BY o.order_date DESC
LIMIT $page_size OFFSET $page_offset;
-- Execute prepared statement
EXECUTE get_user_orders
USING {"customer_id": "cust-12345", "page_size": 20, "page_offset": 0};
-- Use META().id for direct key-value lookups (fastest path)
SELECT META().id, *
FROM `production-data`
USE KEYS ["order::2026-001", "order::2026-002", "order::2026-003"];
-- Correlated subquery with USE KEYS for joins
SELECT u.name,
(SELECT o.id, o.total_amount
FROM `production-data` o
USE KEYS u.order_ids
WHERE o.status = 'completed') AS completed_orders
FROM `user-profiles` u
WHERE META(u).id = 'user::12345';
-- Use INFER to understand document schema
INFER `production-data` WITH {"sample_size": 10000, "similarity_metric": 0.6};
Memory Management and Bucket Configuration
Couchbase's memory-first architecture means RAM allocation directly impacts performance. Each service has its own memory quota, and buckets share the Data Service quota. Proper sizing prevents cache evictions that degrade latency.
# Configure cluster-level memory quotas
/opt/couchbase/bin/couchbase-cli setting-cluster \
--cluster localhost:8091 \
--username Administrator \
--password password \
--cluster-ramsize 8192 \
--cluster-index-ramsize 4096 \
--cluster-fts-ramsize 2048 \
--cluster-eventing-ramsize 2048 \
--cluster-analytics-ramsize 4096
# Memory allocation guidelines:
# Data Service: 60% of available node RAM
# Index Service: 20% of available node RAM
# Search Service: 10% of available node RAM
# OS/overhead: 10% reserved
# Bucket memory sizing formula:
# Required RAM = (avg_doc_size * num_docs * 2.5) / num_data_nodes
# The 2.5 multiplier accounts for:
# - Metadata overhead (~56 bytes per document)
# - Internal fragmentation
# - Replica copies in memory
# Create an optimized production bucket
curl -X POST http://localhost:8091/pools/default/buckets \
-u Administrator:password \
-d name=production-data \
-d ramQuota=4096 \
-d bucketType=couchbase \
-d replicaNumber=2 \
-d threadsNumber=8 \
-d evictionPolicy=valueOnly \
-d compressionMode=active \
-d maxTTL=0 \
-d conflictResolutionType=lww \
-d flushEnabled=0 \
-d durabilityMinLevel=majorityAndPersistActive
# Eviction policies:
# valueOnly - Evicts document values but keeps metadata in RAM
# Best for workloads where key access patterns are predictable
# fullEviction - Evicts both values and metadata
# Best for very large datasets that exceed available RAM
# noEviction - (Ephemeral buckets only) Rejects writes when RAM is full
# Best for caching use cases
TLS Encryption and RBAC
Securing Couchbase in production requires encryption of data in transit (TLS), fine-grained role-based access control (RBAC), and audit logging.
# Enable TLS for all Couchbase services
/opt/couchbase/bin/couchbase-cli ssl-manage \
--cluster localhost:8091 \
--username Administrator \
--password password \
--set-node-certificate
# Enforce minimum TLS version
/opt/couchbase/bin/couchbase-cli setting-security \
--cluster localhost:8091 \
--username Administrator \
--password password \
--set \
--tls-min-version tlsv1.2 \
--tls-honor-cipher-order 1 \
--hsts-max-age 31536000 \
--hsts-preload-enabled 1
# Create application-specific RBAC users
/opt/couchbase/bin/couchbase-cli user-manage \
--cluster localhost:8091 \
--username Administrator \
--password password \
--set \
--rbac-username app-service \
--rbac-password "$(openssl rand -base64 32)" \
--rbac-name "Application Service Account" \
--roles 'data_reader[production-data],data_writer[production-data],query_select[production-data],query_insert[production-data],query_update[production-data],query_delete[production-data]' \
--auth-domain local
# Create a read-only analytics user
/opt/couchbase/bin/couchbase-cli user-manage \
--cluster localhost:8091 \
--username Administrator \
--password password \
--set \
--rbac-username analytics-reader \
--rbac-password "$(openssl rand -base64 32)" \
--roles 'data_reader[production-data],query_select[production-data],analytics_reader[production-data]' \
--auth-domain local
# Enable audit logging
/opt/couchbase/bin/couchbase-cli setting-audit \
--cluster localhost:8091 \
--username Administrator \
--password password \
--set \
--audit-enabled 1 \
--audit-log-path /opt/couchbase/var/lib/couchbase/logs \
--audit-log-rotate-interval 86400 \
--audit-log-rotate-size 20971520
Kubernetes TLS with cert-manager
# Certificate for Couchbase server TLS
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: couchbase-server-tls
namespace: couchbase
spec:
secretName: couchbase-server-tls
duration: 8760h # 1 year
renewBefore: 720h # 30 days before expiry
privateKey:
algorithm: RSA
size: 4096
usages:
- server auth
- client auth
dnsNames:
- "*.cb-production.couchbase.svc.cluster.local"
- "*.cb-production.couchbase.svc"
- "cb-production-srv.couchbase.svc.cluster.local"
- "localhost"
issuerRef:
name: couchbase-ca-issuer
kind: ClusterIssuer
Monitoring with Prometheus Exporter
Couchbase exposes rich metrics through its REST API. The couchbase-exporter translates these into Prometheus format for comprehensive monitoring.
# Deploy Couchbase Prometheus Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
name: couchbase-exporter
namespace: couchbase
spec:
replicas: 1
selector:
matchLabels:
app: couchbase-exporter
template:
metadata:
labels:
app: couchbase-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9091"
spec:
containers:
- name: exporter
image: couchbase/exporter:1.0.9
args:
- --couchbase-address=cb-production-srv.couchbase.svc.cluster.local
- --couchbase-port=8091
- --couchbase-username=$(CB_USERNAME)
- --couchbase-password=$(CB_PASSWORD)
- --server-address=0.0.0.0:9091
- --per-node-refresh=5
env:
- name: CB_USERNAME
valueFrom:
secretKeyRef:
name: cb-admin-credentials
key: username
- name: CB_PASSWORD
valueFrom:
secretKeyRef:
name: cb-admin-credentials
key: password
ports:
- containerPort: 9091
name: metrics
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: couchbase-monitor
namespace: couchbase
spec:
selector:
matchLabels:
app: couchbase-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
Key Couchbase Metrics to Monitor
- cb_bucket_ops_per_sec — Total operations per second per bucket. Baseline your normal throughput and alert on anomalies.
- cb_bucket_mem_used_bytes — Bucket memory usage. Alert when approaching the RAM quota to prevent evictions.
- cb_bucket_cache_miss_ratio — Ratio of requests that miss the cache and require disk fetch. Should stay below 2% for optimal performance.
- cb_bucket_disk_queue_items — Disk write queue depth. A growing queue indicates disk I/O cannot keep up with write throughput.
- cb_xdcr_changes_left — Number of mutations pending XDCR replication. Indicates cross-region replication lag.
- cb_xdcr_docs_written — Documents replicated per second through XDCR.
- cb_node_cpu_utilization_percent — Per-node CPU usage. Couchbase is CPU-intensive for compaction and indexing.
- cb_bucket_vbucket_active_num — Number of active vBuckets per node. Should be roughly even across data nodes.
- cb_index_num_docs_pending — Documents pending index update. Indicates index build lag.
- cb_n1ql_requests_per_sec — N1QL query throughput. Combined with average latency, identifies query performance issues.
Prometheus Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: couchbase-alerts
namespace: couchbase
spec:
groups:
- name: couchbase.rules
rules:
- alert: CouchbaseNodeDown
expr: cb_node_healthy == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Couchbase node {{ $labels.node }} is unhealthy"
- alert: CouchbaseHighCacheMissRate
expr: cb_bucket_cache_miss_ratio > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Cache miss rate {{ $value | humanizePercentage }} on bucket {{ $labels.bucket }}"
- alert: CouchbaseXDCRLag
expr: cb_xdcr_changes_left > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "XDCR replication lag: {{ $value }} pending mutations"
- alert: CouchbaseDiskQueueGrowing
expr: rate(cb_bucket_disk_queue_items[5m]) > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Disk queue growing on bucket {{ $labels.bucket }}"
- alert: CouchbaseMemoryPressure
expr: cb_bucket_mem_used_bytes / cb_bucket_mem_quota_bytes > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage at {{ $value | humanizePercentage }} for bucket {{ $labels.bucket }}"
SDK Connection String Configuration for HA
Couchbase SDKs are topology-aware — they maintain an internal cluster map and route operations directly to the correct node. Proper SDK configuration is critical for HA, ensuring fast failover detection and automatic retry on transient errors.
// Node.js SDK configuration for HA
const couchbase = require('couchbase');
const clusterConnStr = 'couchbases://cb-node1.example.com,cb-node2.example.com,cb-node3.example.com';
const cluster = await couchbase.connect(clusterConnStr, {
username: process.env.CB_USERNAME,
password: process.env.CB_PASSWORD,
timeouts: {
kvTimeout: 2500, // Key-value operation timeout (ms)
kvDurableTimeout: 10000, // Durable write timeout
queryTimeout: 75000, // N1QL query timeout
searchTimeout: 75000, // FTS search timeout
analyticsTimeout: 75000, // Analytics query timeout
connectTimeout: 10000, // Initial connection timeout
managementTimeout: 75000 // Management API timeout
},
security: {
trustStorePath: '/etc/couchbase/ca.pem'
},
transactions: {
durabilityLevel: couchbase.DurabilityLevel.MajorityAndPersistToActive,
timeout: 15000
}
});
const bucket = cluster.bucket('production-data');
const collection = bucket.defaultCollection();
// Durable write with observe-based durability
await collection.upsert('order::2026-001', orderDocument, {
durabilityLevel: couchbase.DurabilityLevel.MajorityAndPersistToActive,
timeout: 10000
});
// Read with replica fallback for HA
try {
const result = await collection.get('user::12345');
} catch (err) {
if (err instanceof couchbase.errors.TimeoutError) {
const replicaResult = await collection.getAnyReplica('user::12345');
}
}
# Java SDK configuration for HA
import com.couchbase.client.java.*;
import com.couchbase.client.java.env.*;
import java.time.Duration;
ClusterEnvironment env = ClusterEnvironment.builder()
.timeoutConfig(TimeoutConfig.builder()
.kvTimeout(Duration.ofMillis(2500))
.kvDurableTimeout(Duration.ofSeconds(10))
.queryTimeout(Duration.ofSeconds(75))
.connectTimeout(Duration.ofSeconds(10))
.build())
.ioConfig(IoConfig.builder()
.numKvConnections(4)
.enableMutationTokens(true)
.enableDnsSrv(true)
.build())
.securityConfig(SecurityConfig.builder()
.enableTls(true)
.trustCertificate(Paths.get("/etc/couchbase/ca.pem"))
.build())
.build();
Cluster cluster = Cluster.connect(
"couchbases://cb-node1.example.com,cb-node2.example.com",
ClusterOptions.clusterOptions("username", "password")
.environment(env)
);
Kubernetes Service DNS for SDK Connection
# When using the Autonomous Operator, connect via the headless service:
# couchbase://cb-production-srv.couchbase.svc.cluster.local
#
# The operator creates these services:
# cb-production-srv - Headless service for SDK auto-discovery
# cb-production-ui - Web Console (port 8091/18091)
# cb-production-cloud - External connectivity (NodePort/LoadBalancer)
#
# For external SDK access (outside Kubernetes), use:
# - NodePort with explicit node addresses
# - LoadBalancer with MetalLB (bare metal)
# - Ingress with TCP passthrough for port 11210 (SDK) and 11207 (SDK TLS)
Couchbase Mobile and Sync Gateway for Edge Deployments
Couchbase Mobile extends the Couchbase ecosystem to edge devices and mobile applications. Couchbase Lite runs embedded on iOS, Android, and IoT devices, while Sync Gateway acts as the synchronization middleware between Couchbase Lite and Couchbase Server.
// Sync Gateway configuration for production
{
"interface": ":4984",
"adminInterface": "127.0.0.1:4985",
"logging": {
"console": {
"log_level": "info",
"log_keys": ["HTTP", "Sync", "Auth", "Changes"]
}
},
"databases": {
"mobile-app": {
"server": "couchbases://cb-production-srv.couchbase.svc.cluster.local",
"bucket": "production-data",
"username": "sync-gateway",
"password": "${SG_PASSWORD}",
"enable_shared_bucket_access": true,
"import_docs": true,
"num_index_replicas": 1,
"delta_sync": {
"enabled": true,
"rev_max_age_seconds": 86400
},
"cache": {
"channel_cache": {
"max_number": 50000,
"compact_high_watermark_pct": 80,
"compact_low_watermark_pct": 60
},
"rev_cache": {
"size": 5000,
"shard_count": 16
}
},
"users": {
"GUEST": {"disabled": true}
},
"sync": "function(doc, oldDoc) { if (doc.type === 'user-data') { channel(doc.channels); requireAccess(doc.channels); } else { channel('public'); } }"
}
}
}
# Deploy Sync Gateway on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: sync-gateway
namespace: couchbase
spec:
replicas: 3
selector:
matchLabels:
app: sync-gateway
template:
metadata:
labels:
app: sync-gateway
spec:
containers:
- name: sync-gateway
image: couchbase/sync-gateway:3.1.4-enterprise
args: ["/etc/sync-gateway/config.json"]
ports:
- containerPort: 4984
name: public
- containerPort: 4985
name: admin
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
volumeMounts:
- name: config
mountPath: /etc/sync-gateway
volumes:
- name: config
configMap:
name: sync-gateway-config
Capacity Planning and Sizing
Proper capacity planning is essential for Couchbase performance and cost optimization. The following table provides sizing guidelines based on workload tier:
| Workload Tier | Data Nodes | Index/Query | RAM per Node | Storage | Throughput |
|---|---|---|---|---|---|
| Development | 1 (all services) | Co-located | 4 GB | 20 GB SSD | <1k ops/s |
| Small Production | 3 Data | 2 Query+Index | 16 GB | 100 GB SSD | 10k ops/s |
| Medium Production | 5 Data | 3 Query+Index | 32 GB | 500 GB SSD | 50k ops/s |
| Large Production | 7-10 Data | 4+ Query+Index | 64 GB | 1 TB NVMe | 200k+ ops/s |
| Enterprise / Global | 10+ Data (multi-region) | 6+ Query+Index | 128 GB | 2+ TB NVMe | 500k+ ops/s |
Sizing Formula
# Data Service RAM sizing
# Required RAM per node = (num_documents * (doc_metadata_size + avg_value_size)) / num_data_nodes * (1 + num_replicas)
# doc_metadata_size = 56 bytes (fixed overhead per document)
# Include 25% headroom for fragmentation and growth
# Example: 100M documents, 1KB avg size, 3 data nodes, 1 replica
# RAM = (100,000,000 * (56 + 1024)) / 3 * 2 = ~72 GB per node
# With 25% headroom: ~90 GB per node
# Index Service RAM sizing (memory-optimized)
# RAM = total_index_size * 3 (for build/merge overhead)
# Use system:indexes to check current index sizes
# Disk sizing
# Disk = (num_documents * avg_doc_size * (1 + num_replicas)) * 3 (compaction headroom)
# Use SSD/NVMe with provisioned IOPS for predictable performance
Disaster Recovery and Failover Procedures
A comprehensive disaster recovery plan ensures business continuity when infrastructure failures exceed the scope of automatic failover.
Single Node Failure (Automatic)
# Auto-failover handles single node failures automatically.
# Verify failover occurred:
/opt/couchbase/bin/couchbase-cli server-list \
--cluster localhost:8091 \
--username Administrator \
--password password
# After replacing the failed node, add and rebalance:
/opt/couchbase/bin/couchbase-cli server-add \
--cluster localhost:8091 \
--username Administrator \
--password password \
--server-add new-node.example.com:8091 \
--server-add-username Administrator \
--server-add-password password \
--services data,index
/opt/couchbase/bin/couchbase-cli rebalance \
--cluster localhost:8091 \
--username Administrator \
--password password
Complete Cluster Failure (Manual)
# Scenario: Primary region (US-EAST) completely lost
# Step 1: Verify XDCR target cluster (EU-WEST) has latest data
# Check XDCR replication status before failure
curl -s http://eu-west-node:8091/pools/default/remoteClusters \
-u Administrator:password | jq .
# Step 2: Pause XDCR replications pointing to the failed cluster
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
--cluster cb-eu-west.example.com:8091 \
--username Administrator \
--password password \
--pause \
--xdcr-replicator <replication-id>
# Step 3: Update application connection strings to EU-WEST cluster
# (via DNS update, service mesh, or environment variable change)
# Step 4: Scale up EU-WEST cluster if needed to handle full production load
# In Kubernetes, update the CouchbaseCluster CRD:
kubectl patch couchbasecluster cb-eu-west -n couchbase --type merge \
-p '{"spec":{"servers":[{"name":"data-zone-a","size":4}]}}'
# Step 5: After US-EAST cluster is restored, re-establish XDCR
# and perform a full resync from EU-WEST back to US-EAST
Graceful Failover and Recovery
# Graceful failover (for maintenance, drains data before removal)
/opt/couchbase/bin/couchbase-cli failover \
--cluster localhost:8091 \
--username Administrator \
--password password \
--server-failover node-to-remove.example.com:8091
# Recovery (re-add the node after maintenance)
/opt/couchbase/bin/couchbase-cli recovery \
--cluster localhost:8091 \
--username Administrator \
--password password \
--server-recovery node-to-recover.example.com:8091 \
--recovery-type delta
# Delta recovery re-synchronizes only the changed data,
# which is much faster than full recovery.
# Full recovery rebuilds the node from scratch.
# Rebalance to complete the recovery
/opt/couchbase/bin/couchbase-cli rebalance \
--cluster localhost:8091 \
--username Administrator \
--password password
Helm Values for Complete Production Deployment
Below is a comprehensive Helm values file for deploying Couchbase with the Autonomous Operator in a production environment:
# helm-values-production.yaml
couchbase-operator:
operator:
image:
repository: couchbase/operator
tag: 2.7.1
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
admissionController:
enabled: true
resources:
requests:
cpu: 100m
memory: 128Mi
cluster:
image: couchbase/server:7.6.1-enterprise
antiAffinity: true
autoFailoverTimeout: 30s
autoFailoverMaxCount: 3
autoFailoverOnDataDiskIssues: true
autoFailoverServerGroup: true
security:
adminSecret: cb-admin-credentials
networking:
tls:
static:
serverSecret: couchbase-server-tls
operatorSecret: couchbase-operator-tls
exposeAdminConsole: true
adminConsoleServiceType: NodePort
buckets:
managed: true
servers:
data:
size: 4
services:
- data
- index
serverGroups:
- zone-a
- zone-b
resources:
requests:
cpu: "4"
memory: 16Gi
limits:
cpu: "8"
memory: 20Gi
volumeMounts:
default: couchbase-data
data: couchbase-data
index: couchbase-index
query:
size: 2
services:
- query
- search
resources:
requests:
cpu: "4"
memory: 8Gi
limits:
cpu: "8"
memory: 12Gi
volumeMounts:
default: couchbase-default
analytics:
size: 2
services:
- analytics
- eventing
serverGroups:
- zone-c
resources:
requests:
cpu: "8"
memory: 32Gi
limits:
cpu: "16"
memory: 40Gi
volumeMounts:
default: couchbase-analytics
analytics:
- couchbase-analytics
serverGroups:
- zone-a
- zone-b
- zone-c
volumeClaimTemplates:
- metadata:
name: couchbase-data
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 100Gi
- metadata:
name: couchbase-index
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 50Gi
- metadata:
name: couchbase-default
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 20Gi
- metadata:
name: couchbase-analytics
spec:
storageClassName: ebs-gp3-couchbase
resources:
requests:
storage: 200Gi
Conclusion
Couchbase Server's architecture—built around vBucket-based sharding, memory-first data access, and integrated multi-model services—provides a uniquely powerful foundation for high availability production deployments. The combination of intra-cluster replication with automatic failover ensures that single-node failures are handled transparently, while XDCR extends this resilience across geographic regions for global applications.
The Couchbase Autonomous Operator for Kubernetes transforms what would be complex manual operations into declarative, self-healing deployments. Server groups provide rack/zone awareness, the operator manages rebalance operations during scaling events, and integrated backup CRDs automate disaster recovery preparation.
Key takeaways from this guide:
- Leverage Multi-Dimensional Scaling — Separate Data, Index, Query, Search, Analytics, and Eventing services onto dedicated node pools for independent scaling and resource isolation.
- Configure XDCR for multi-region resilience — Bidirectional XDCR with timestamp-based conflict resolution enables active-active deployments across AWS, Azure, and GCP. Always ensure NTP synchronization.
- Use server groups for zone awareness — Map server groups to availability zones or racks to guarantee that active and replica vBuckets are in different failure domains.
- Size memory carefully — Couchbase's performance is directly tied to how much of the working set fits in RAM. Use the sizing formulas and monitor cache miss ratios.
- Implement comprehensive monitoring — Deploy the Prometheus exporter from day one. XDCR replication lag, cache miss ratio, disk queue depth, and node health are your critical signals.
- Automate backups with cbbackupmgr — Combine full and incremental backups with cloud snapshots. Test restore procedures regularly.
- Secure with TLS and RBAC — Enable node-to-node and client-to-node TLS encryption. Use fine-grained RBAC roles for every application service account.
- Configure SDKs for HA — Use multiple bootstrap nodes, configure appropriate timeouts, implement replica reads as fallback, and leverage durable writes for critical data.
- Plan for disaster recovery — Document and rehearse failover procedures for single-node, multi-node, and complete cluster failure scenarios. XDCR standby clusters should be ready for promotion at all times.
With this comprehensive foundation, you are equipped to deploy and operate Couchbase Server in high availability production environments across any infrastructure—from managed Kubernetes on AWS, Azure, and GCP to bare metal k3s clusters managed by Rancher. The combination of Couchbase's native distributed architecture with Kubernetes orchestration delivers a database platform that meets the demands of modern, globally distributed applications.