Couchbase HA Production | XDCR Multi-Region & K8s Operator

Introduction: Why Couchbase for High Availability Production Workloads

Couchbase Server is a distributed, multi-model NoSQL database engineered for interactive applications that demand consistent low-latency performance at any scale. Unlike databases that bolt on distributed features as an afterthought, Couchbase was architected from its inception around a shared-nothing, peer-to-peer topology where every node is equal and data is automatically sharded across the cluster using a deterministic hashing mechanism called vBuckets. This architectural choice eliminates single points of failure at the data layer and enables horizontal scaling without application changes.

What sets Couchbase apart in the high availability landscape is its Cross Data Center Replication (XDCR) — a built-in, asynchronous replication engine that continuously streams mutations between geographically distributed clusters. Combined with automatic failover, rack/zone awareness, and a rich set of integrated services (Data, Index, Query, Search, Analytics, and Eventing), Couchbase provides a unified platform that can serve as both the operational database and the analytical engine for modern applications.

In this comprehensive guide, we will explore every aspect of running Couchbase Server in high availability production environments: the internal architecture that makes HA possible, XDCR configuration for multi-region deployments, the Couchbase Autonomous Operator for Kubernetes, cloud-specific deployment patterns for AWS EKS, Azure AKS, and GCP GKE, bare metal k3s deployments with Rancher and Longhorn, backup and restore strategies, N1QL query tuning, security hardening, monitoring, and Couchbase Mobile with Sync Gateway for edge deployments. By the end, you will have actionable knowledge to deploy and operate production-grade Couchbase clusters on any infrastructure.

Couchbase Server Architecture: Services, vBuckets, and Automatic Sharding

Understanding Couchbase's internal architecture is critical before deploying for high availability. Couchbase uses a Multi-Dimensional Scaling (MDS) architecture where different services can be independently deployed and scaled across cluster nodes. This gives operators fine-grained control over resource allocation and performance isolation.

The Six Core Services

Couchbase Server provides six integrated services, each handling a distinct workload type:

Data Service (KV) — The core key-value engine built on a memory-first architecture. It handles CRUD operations, manages vBucket distribution, and serves as the persistence layer. Data is stored in memory (managed cache) and asynchronously persisted to disk. This service must run on at least one node in every cluster.
Index Service (GSI) — Maintains Global Secondary Indexes that support N1QL queries. Indexes are stored separately from data, allowing independent scaling. Supports standard and memory-optimized index storage modes.
Query Service (N1QL) — Executes N1QL (SQL++ for JSON) queries against the cluster. Stateless by design, making it easy to scale horizontally. Coordinates with the Data and Index services to plan and execute queries.
Search Service (FTS) — Provides full-text search capabilities powered by the Bleve search engine. Supports fuzzy matching, geospatial queries, faceted search, and custom analyzers. Indexes are partitioned and replicated across Search nodes.
Analytics Service (CBAS) — Runs complex analytical queries using a parallel processing engine based on Apache Asterix. Operates on its own copy of the data, ensuring that analytical workloads never impact operational latency.
Eventing Service — Executes server-side JavaScript functions in response to data mutations. Enables real-time data enrichment, transformations, cascade deletes, and integration triggers without external infrastructure.

vBucket Distribution and Automatic Sharding

Couchbase distributes data across the cluster using 1024 vBuckets (virtual buckets). Every document is mapped to a vBucket using a CRC32 hash of the document key modulo 1024. The cluster map — maintained by every node and cached by every SDK client — maps each vBucket to a specific node. This deterministic mapping means clients always know exactly which node holds any given document, enabling single-hop reads and writes with sub-millisecond latency.

When nodes are added or removed, Couchbase redistributes vBuckets automatically through a process called rebalance. During rebalance, the cluster moves vBuckets between nodes while remaining fully operational. The rebalance is carefully orchestrated to maintain the configured number of replicas at all times, and clients are seamlessly redirected to the new vBucket locations through cluster map updates.

Intra-Cluster Replication and Auto-Failover

Each vBucket has one active copy and up to three replica copies distributed across different nodes. When a client writes a document, the write goes to the active vBucket on the responsible node. The Data Service then replicates the mutation to replica vBuckets on other nodes through the internal DCP (Database Change Protocol) stream. By default, Couchbase configures one replica, but for production HA deployments, two replicas are recommended:

# Configure bucket with 2 replicas via CLI
/opt/couchbase/bin/couchbase-cli bucket-create \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --bucket production-data \
  --bucket-type couchbase \
  --bucket-ramsize 4096 \
  --bucket-replica 2 \
  --bucket-priority high \
  --bucket-eviction-policy valueOnly \
  --enable-flush 0 \
  --compression-mode active \
  --max-ttl 0 \
  --durability-min-level majorityAndPersistActive

Auto-failover is Couchbase's mechanism for automatically detecting and recovering from node failures. When a node becomes unresponsive, the cluster orchestrator waits for a configurable timeout (minimum 5 seconds, recommended 30 seconds for production), then promotes the replica vBuckets on surviving nodes to active status. This happens without any application-side intervention — SDK clients receive an updated cluster map and immediately route requests to the new active vBuckets.

# Configure auto-failover settings
/opt/couchbase/bin/couchbase-cli setting-autofailover \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --enable-auto-failover 1 \
  --auto-failover-timeout 30 \
  --max-failovers 3 \
  --enable-failover-of-server-groups 1 \
  --failover-on-data-disk-issues 1 \
  --failover-data-disk-period 120 \
  --can-abort-rebalance 1

Key auto-failover parameters:

auto-failover-timeout — Seconds to wait before triggering failover. Lower values reduce downtime but increase false-positive risk. 30 seconds is the recommended production setting.
max-failovers — Maximum number of sequential auto-failovers before requiring manual intervention. Set to 3 for a 5-node cluster (to maintain quorum).
enable-failover-of-server-groups — Enables failover of an entire server group (rack/zone), critical for zone-aware deployments.
failover-on-data-disk-issues — Triggers failover when the Data Service detects persistent disk I/O errors.

XDCR: Cross Data Center Replication

XDCR is Couchbase's flagship multi-region replication technology. Unlike database-level replication found in traditional RDBMS systems, XDCR operates at the bucket level and streams individual document mutations between independent Couchbase clusters. Each cluster remains fully autonomous — it can accept reads and writes independently, making XDCR ideal for active-active multi-region deployments where users need low-latency access from any geography.

Unidirectional vs Bidirectional XDCR

Unidirectional XDCR replicates mutations from a source cluster to a target cluster in one direction. This is suitable for disaster recovery scenarios, read replicas in remote regions, or feeding data from an operational cluster to an analytics cluster.

Bidirectional XDCR creates replication links in both directions between two clusters, enabling active-active deployments where both clusters accept writes. This is the most powerful configuration but requires careful conflict resolution planning.

Setting Up XDCR Replication

Configuring XDCR involves creating a remote cluster reference and then defining replication links at the bucket level. Below are the CLI commands and REST API calls for a complete bidirectional setup:

# Step 1: Create remote cluster reference on the US-EAST cluster
/opt/couchbase/bin/couchbase-cli xdcr-setup \
  --cluster cb-us-east.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name eu-west-cluster \
  --xdcr-hostname cb-eu-west.example.com:8091 \
  --xdcr-username Administrator \
  --xdcr-password password \
  --xdcr-demand-encryption 1 \
  --xdcr-encryption-type full \
  --xdcr-certificate /path/to/eu-west-ca.pem

# Step 2: Create replication from US-EAST to EU-WEST for the 'app' bucket
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster cb-us-east.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name eu-west-cluster \
  --xdcr-from-bucket app \
  --xdcr-to-bucket app \
  --xdcr-replication-mode xmem \
  --enable-compression 1 \
  --filter-expression "" \
  --priority high \
  --network-usage-limit 0

# Step 3: Create the reverse replication on EU-WEST cluster (bidirectional)
/opt/couchbase/bin/couchbase-cli xdcr-setup \
  --cluster cb-eu-west.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name us-east-cluster \
  --xdcr-hostname cb-us-east.example.com:8091 \
  --xdcr-username Administrator \
  --xdcr-password password \
  --xdcr-demand-encryption 1 \
  --xdcr-encryption-type full \
  --xdcr-certificate /path/to/us-east-ca.pem

/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster cb-eu-west.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name us-east-cluster \
  --xdcr-from-bucket app \
  --xdcr-to-bucket app \
  --xdcr-replication-mode xmem \
  --enable-compression 1

Conflict Resolution Strategies

In bidirectional XDCR, the same document can be modified concurrently on different clusters, creating conflicts. Couchbase provides multiple conflict resolution strategies:

Timestamp-based (LWW — Last Write Wins) — The mutation with the most recent timestamp wins. This is the default and works well for most use cases. Requires NTP synchronization across all clusters (within 5 seconds skew). Set at bucket creation time and cannot be changed later.
Sequence Number-based — Uses the internal sequence number (revision ID) to determine the winner. The mutation with the higher revision count wins. Useful when timestamp synchronization is unreliable.
Custom Conflict Resolution (Enterprise) — Couchbase Enterprise Edition supports custom merge functions that execute server-side JavaScript to resolve conflicts with application-specific logic. This enables scenarios like merging shopping cart items from different regions or applying domain-specific conflict resolution rules.

# Create a bucket with timestamp-based conflict resolution
curl -X POST http://localhost:8091/pools/default/buckets \
  -u Administrator:password \
  -d name=app \
  -d ramQuota=4096 \
  -d replicaNumber=2 \
  -d bucketType=couchbase \
  -d conflictResolutionType=lww \
  -d compressionMode=active \
  -d durabilityMinLevel=majorityAndPersistActive

# XDCR advanced settings via REST API
curl -X POST http://localhost:8091/settings/replications/<replication-id> \
  -u Administrator:password \
  -d optimisticReplicationThreshold=256 \
  -d sourceNozzlePerNode=4 \
  -d targetNozzlePerNode=4 \
  -d checkpointInterval=600 \
  -d batchCount=500 \
  -d batchSize=2048 \
  -d failureRestartInterval=10 \
  -d docBatchSizeKb=2048 \
  -d networkUsageLimit=0 \
  -d priority=High

XDCR Filtering

XDCR supports filtering so you can replicate only a subset of documents. Filters use regular expressions against document keys and can also filter based on document expiration or deletion:

# Replicate only documents with keys starting with 'user::'
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name remote-cluster \
  --xdcr-from-bucket app \
  --xdcr-to-bucket app-users \
  --filter-expression "^user::" \
  --filter-skip-restream 0

# Replicate documents matching a complex pattern
# (orders from 2026 with specific type)
--filter-expression "REGEXP_CONTAINS(META().id, '^order::2026') AND type='premium'"

Couchbase Autonomous Operator for Kubernetes

The Couchbase Autonomous Operator (CAO) is an enterprise-grade Kubernetes operator that automates the deployment, management, scaling, and recovery of Couchbase Server clusters. Unlike simple StatefulSet deployments, the Autonomous Operator understands Couchbase's internal topology — it manages rebalance operations, coordinates rolling upgrades, handles server group awareness, and integrates with Kubernetes scheduling primitives to ensure optimal placement of Couchbase pods.

CouchbaseCluster CRD Specification

The CouchbaseCluster CRD is the central configuration that declares the desired state of your Couchbase deployment. The Autonomous Operator reconciles this into StatefulSets, Services, PVCs, Secrets, and RBAC resources. Below is a production-ready CRD:

apiVersion: couchbase.com/v2
kind: CouchbaseCluster
metadata:
  name: cb-production
  namespace: couchbase
spec:
  image: couchbase/server:7.6.1-enterprise
  antiAffinity: true
  platform: aws
  cluster:
    autoFailoverTimeout: 30s
    autoFailoverMaxCount: 3
    autoFailoverOnDataDiskIssues: true
    autoFailoverOnDataDiskIssuesTimePeriod: 120s
    autoFailoverServerGroup: true
    clusterName: cb-production
    dataServiceMemoryQuota: 8Gi
    indexServiceMemoryQuota: 4Gi
    searchServiceMemoryQuota: 2Gi
    analyticsServiceMemoryQuota: 4Gi
    eventingServiceMemoryQuota: 2Gi
    indexStorageSetting: memory_optimized
    autoCompaction:
      databaseFragmentationThreshold:
        percent: 30
        size: 1Gi
      viewFragmentationThreshold:
        percent: 30
        size: 1Gi
      parallelCompaction: false
      timeWindow:
        start: "02:00"
        end: "06:00"
        abortCompactionOutsideWindow: true
  security:
    adminSecret: cb-admin-credentials
    rbac:
      managed: true
      selector:
        matchLabels:
          cluster: cb-production
    ldap:
      hosts:
      - ldap.example.com
      port: 636
      encryption: TLS
  networking:
    tls:
      static:
        serverSecret: couchbase-server-tls
        operatorSecret: couchbase-operator-tls
    exposeAdminConsole: true
    adminConsoleServices:
    - data
    adminConsoleServiceType: NodePort
    exposedFeatures:
    - client
    - xdcr
    exposedFeatureServiceType: NodePort
  buckets:
    managed: true
    selector:
      matchLabels:
        cluster: cb-production
  servers:
  - name: data-zone-a
    size: 2
    services:
    - data
    - index
    serverGroups:
    - zone-a
    pod:
      metadata:
        labels:
          couchbase-service: data-index
        annotations:
          prometheus.io/scrape: "true"
          prometheus.io/port: "9091"
      spec:
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1a
        tolerations:
        - key: couchbase
          operator: Equal
          value: "true"
          effect: NoSchedule
        resources:
          requests:
            cpu: "4"
            memory: 16Gi
          limits:
            cpu: "8"
            memory: 20Gi
    volumeMounts:
      default: couchbase-data
      data: couchbase-data
      index: couchbase-index
  - name: data-zone-b
    size: 2
    services:
    - data
    - index
    serverGroups:
    - zone-b
    pod:
      spec:
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1b
        tolerations:
        - key: couchbase
          operator: Equal
          value: "true"
          effect: NoSchedule
        resources:
          requests:
            cpu: "4"
            memory: 16Gi
          limits:
            cpu: "8"
            memory: 20Gi
    volumeMounts:
      default: couchbase-data
      data: couchbase-data
      index: couchbase-index
  - name: query-search
    size: 2
    services:
    - query
    - search
    serverGroups:
    - zone-a
    - zone-b
    pod:
      spec:
        resources:
          requests:
            cpu: "4"
            memory: 8Gi
          limits:
            cpu: "8"
            memory: 12Gi
    volumeMounts:
      default: couchbase-default
  - name: analytics-eventing
    size: 2
    services:
    - analytics
    - eventing
    serverGroups:
    - zone-c
    pod:
      spec:
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1c
        resources:
          requests:
            cpu: "8"
            memory: 32Gi
          limits:
            cpu: "16"
            memory: 40Gi
    volumeMounts:
      default: couchbase-analytics
      analytics:
      - couchbase-analytics
  serverGroups:
  - zone-a
  - zone-b
  - zone-c
  volumeClaimTemplates:
  - metadata:
      name: couchbase-data
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: couchbase-index
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 50Gi
  - metadata:
      name: couchbase-default
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 20Gi
  - metadata:
      name: couchbase-analytics
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 200Gi

Operator Installation via Helm

# Add Couchbase Helm repository
helm repo add couchbase https://couchbase-partners.github.io/helm-charts/
helm repo update

# Install the Couchbase Autonomous Operator
helm install couchbase-operator couchbase/couchbase-operator \
  --namespace couchbase \
  --create-namespace \
  --set operator.image.repository=couchbase/operator \
  --set operator.image.tag=2.7.1 \
  --set admissionController.enabled=true

# Create the admin credentials secret
kubectl create secret generic cb-admin-credentials \
  --namespace couchbase \
  --from-literal=username=Administrator \
  --from-literal=password=$(openssl rand -base64 24)

# Deploy the CouchbaseCluster CRD
kubectl apply -f couchbase-cluster.yaml

# Verify deployment
kubectl get couchbaseclusters -n couchbase
kubectl get pods -n couchbase -l app=couchbase
kubectl get svc -n couchbase

Server Groups and Rack/Zone Awareness

Server groups are Couchbase's mechanism for ensuring that active vBuckets and their replicas are placed in different failure domains (availability zones, racks, or data centers). When server groups are configured, Couchbase guarantees that no active and replica pair for the same vBucket resides in the same server group. This means a complete zone failure will not result in data loss.

The Autonomous Operator maps server groups to Kubernetes node topology labels, automatically scheduling pods in the correct zones. Combined with pod anti-affinity rules, this ensures that Couchbase pods are distributed across physical infrastructure for maximum resilience.

AWS EKS Deployment

Amazon EKS requires specific configuration for optimal Couchbase performance. The key considerations are storage (EBS gp3 for throughput), instance types (memory-optimized r6i/r7i for Data nodes), and networking (VPC CNI for pod-level networking).

# EBS gp3 StorageClass optimized for Couchbase
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3-couchbase
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "6000"
  throughput: "500"
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/mrk-abcdef"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# Recommended EKS node groups for Couchbase
# Data nodes:     r6i.2xlarge (8 vCPU, 64 GiB) or r7i.2xlarge
# Index/Query:    m6i.2xlarge (8 vCPU, 32 GiB)
# Analytics:      r6i.4xlarge (16 vCPU, 128 GiB)
# Eventing:       m6i.xlarge  (4 vCPU, 16 GiB)

# EKS managed node group with taints for Couchbase
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: couchbase-eks
  region: us-east-1
managedNodeGroups:
- name: cb-data
  instanceType: r6i.2xlarge
  desiredCapacity: 4
  minSize: 4
  maxSize: 8
  volumeSize: 200
  volumeType: gp3
  volumeIOPS: 6000
  volumeThroughput: 500
  availabilityZones: ["us-east-1a", "us-east-1b"]
  labels:
    workload: couchbase-data
  taints:
  - key: couchbase
    value: "true"
    effect: NoSchedule
  iam:
    attachPolicyARNs:
    - arn:aws:iam::policy/AmazonEBSCSIDriverPolicy
- name: cb-query
  instanceType: m6i.2xlarge
  desiredCapacity: 2
  minSize: 2
  maxSize: 4
  availabilityZones: ["us-east-1a", "us-east-1b"]
  labels:
    workload: couchbase-query

Azure AKS Deployment

Azure AKS uses Premium SSD v2 or Ultra Disk for Couchbase's I/O demands and Azure Private Link for secure XDCR connectivity between regions.

# Azure Premium SSD v2 StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azure-premium-couchbase
provisioner: disk.csi.azure.com
parameters:
  skuName: PremiumV2_LRS
  DiskIOPSReadWrite: "6000"
  DiskMBpsReadWrite: "500"
  cachingMode: None
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# Ultra Disk StorageClass for high-performance workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azure-ultra-couchbase
provisioner: disk.csi.azure.com
parameters:
  skuName: UltraSSD_LRS
  DiskIOPSReadWrite: "10000"
  DiskMBpsReadWrite: "1000"
  cachingMode: None
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# AKS recommended VM sizes:
# Data nodes:      Standard_E8s_v5  (8 vCPU, 64 GiB)
# Index/Query:     Standard_D8s_v5  (8 vCPU, 32 GiB)
# Analytics:       Standard_E16s_v5 (16 vCPU, 128 GiB)

GCP GKE Deployment

Google Kubernetes Engine uses SSD Persistent Disks and Workload Identity for secure access to Google Cloud Storage for backups.

# GKE SSD Persistent Disk StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd-couchbase
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  provisioned-iops-on-create: "6000"
  provisioned-throughput-on-create: "500"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# GKE Hyperdisk Balanced for cost-effective performance
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hyperdisk-couchbase
provisioner: pd.csi.storage.gke.io
parameters:
  type: hyperdisk-balanced
  provisioned-iops-on-create: "6000"
  provisioned-throughput-on-create: "500"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# GKE recommended machine types:
# Data nodes:      n2-highmem-8   (8 vCPU, 64 GB)
# Index/Query:     n2-standard-8  (8 vCPU, 32 GB)
# Analytics:       n2-highmem-16  (16 vCPU, 128 GB)

Bare Metal k3s/Rancher with Longhorn

For organizations that need full infrastructure control without cloud vendor lock-in, bare metal k3s with Rancher management and Longhorn distributed storage provides an excellent foundation for Couchbase HA. This architecture is popular in regulated industries, edge computing scenarios, and cost-sensitive environments.

# k3s bare metal setup for Couchbase

# Install k3s on master nodes (HA with embedded etcd)
curl -sfL https://get.k3s.io | sh -s - server \
  --cluster-init \
  --disable traefik \
  --disable servicelb \
  --write-kubeconfig-mode 644 \
  --node-taint couchbase=true:NoSchedule \
  --node-label topology.kubernetes.io/zone=rack-1

# Join additional server nodes
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://master-1:6443 \
  --token $(cat /var/lib/rancher/k3s/server/node-token) \
  --node-taint couchbase=true:NoSchedule \
  --node-label topology.kubernetes.io/zone=rack-2

# Join agent nodes
curl -sfL https://get.k3s.io | sh -s - agent \
  --server https://master-1:6443 \
  --token $(cat /var/lib/rancher/k3s/server/node-token) \
  --node-taint couchbase=true:NoSchedule \
  --node-label topology.kubernetes.io/zone=rack-3

# Install Longhorn for distributed storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --create-namespace \
  --set defaultSettings.defaultReplicaCount=3 \
  --set defaultSettings.defaultDataPath=/mnt/longhorn \
  --set defaultSettings.guaranteedInstanceManagerCPU=12

# Create Longhorn StorageClass for Couchbase
kubectl apply -f - <



Backup and Restore with cbbackupmgr

Couchbase provides cbbackupmgr, an enterprise backup tool that supports full, incremental, and differential backups with optional compression and encryption. For production HA deployments, a robust backup strategy combines Couchbase-level backups with cloud snapshot capabilities.

Backup Configuration

# Initialize a backup repository
/opt/couchbase/bin/cbbackupmgr config \
  --archive /backup/couchbase \
  --repo production-backup \
  --include-data production-data \
  --include-data user-profiles \
  --exclude-data _system

# Run a full backup
/opt/couchbase/bin/cbbackupmgr backup \
  --archive /backup/couchbase \
  --repo production-backup \
  --cluster couchbase://localhost \
  --username Administrator \
  --password "$CB_PASSWORD" \
  --threads 4 \
  --no-progress-bar

# Run an incremental backup (only mutations since last backup)
/opt/couchbase/bin/cbbackupmgr backup \
  --archive /backup/couchbase \
  --repo production-backup \
  --cluster couchbase://localhost \
  --username Administrator \
  --password "$CB_PASSWORD" \
  --threads 4

# List available backups
/opt/couchbase/bin/cbbackupmgr list \
  --archive /backup/couchbase \
  --repo production-backup

# Restore from a specific backup
/opt/couchbase/bin/cbbackupmgr restore \
  --archive /backup/couchbase \
  --repo production-backup \
  --cluster couchbase://target-cluster:8091 \
  --username Administrator \
  --password "$CB_PASSWORD" \
  --start 2026-04-12T00_00_00 \
  --end 2026-04-12T14_30_00 \
  --threads 4

Automated Backup Script for Kubernetes

#!/bin/bash
# couchbase-backup.sh — Automated backup to S3-compatible storage

set -euo pipefail

CLUSTER_HOST="cb-production-srv.couchbase.svc.cluster.local"
BACKUP_DIR="/backup/couchbase"
REPO_NAME="prod-$(date +%Y%m%d)"
S3_BUCKET="s3://couchbase-backups/production"
RETENTION_DAYS=14

log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }

log "Starting Couchbase backup for cluster: $CLUSTER_HOST"

if [ ! -d "$BACKUP_DIR/$REPO_NAME" ]; then
  log "Configuring new backup repository: $REPO_NAME"
  cbbackupmgr config \
    --archive "$BACKUP_DIR" \
    --repo "$REPO_NAME" \
    --include-data production-data \
    --include-data user-profiles
fi

log "Running incremental backup..."
cbbackupmgr backup \
  --archive "$BACKUP_DIR" \
  --repo "$REPO_NAME" \
  --cluster "couchbase://$CLUSTER_HOST" \
  --username "$CB_USERNAME" \
  --password "$CB_PASSWORD" \
  --threads 4 \
  --no-progress-bar

BACKUP_SIZE=$(du -sh "$BACKUP_DIR/$REPO_NAME" | cut -f1)
log "Backup complete. Size: $BACKUP_SIZE"

log "Syncing to S3: $S3_BUCKET/$REPO_NAME"
aws s3 sync "$BACKUP_DIR/$REPO_NAME" "$S3_BUCKET/$REPO_NAME" \
  --storage-class STANDARD_IA \
  --sse aws:kms

log "Cleaning up backups older than $RETENTION_DAYS days..."
find "$BACKUP_DIR" -maxdepth 1 -name "prod-*" -mtime +"$RETENTION_DAYS" -exec rm -rf {} \;

log "Backup pipeline complete."

CouchbaseBackup CRD (Operator-Managed)

The Autonomous Operator provides CRDs for automated backup management:

apiVersion: couchbase.com/v2
kind: CouchbaseBackup
metadata:
  name: cb-daily-backup
  namespace: couchbase
spec:
  strategy: full_incremental
  full:
    schedule: "0 2 * * 0"   # Full backup every Sunday at 2 AM
  incremental:
    schedule: "0 2 * * 1-6" # Incremental Mon-Sat at 2 AM
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 3
  backOffLimit: 3
  logRetention: 168h
  size: 100Gi
  s3bucket: s3://couchbase-backups/production
---
apiVersion: couchbase.com/v2
kind: CouchbaseBackupRestore
metadata:
  name: cb-restore-pitr
  namespace: couchbase
spec:
  backup: cb-daily-backup
  repo: "20260412"
  start:
    int: 1
  end:
    int: 5
  backOffLimit: 3

N1QL Query Performance Tuning

N1QL (SQL++ for JSON) is Couchbase's query language. Tuning N1QL performance requires understanding the query planner, index design, and server-side optimizations.

Index Strategies: GSI and FTS

-- Global Secondary Index (GSI) for common query patterns

-- Composite index for user lookups
CREATE INDEX idx_users_email_status
  ON `user-profiles`(email, status)
  WHERE type = 'user'
  WITH {"num_replica": 1, "defer_build": false};

-- Covering index (includes all queried fields to avoid fetch)
CREATE INDEX idx_orders_covering
  ON `production-data`(customer_id, order_date, total_amount, status)
  WHERE type = 'order'
  WITH {"num_replica": 1};

-- Array index for nested documents
CREATE INDEX idx_order_items
  ON `production-data`(DISTINCT ARRAY item.product_id FOR item IN items END)
  WHERE type = 'order'
  WITH {"num_replica": 1};

-- Partial index for active records only
CREATE INDEX idx_active_sessions
  ON `production-data`(user_id, created_at)
  WHERE type = 'session' AND status = 'active'
  WITH {"num_replica": 1};

-- Adaptive index for dynamic query patterns
CREATE INDEX idx_adaptive_products
  ON `production-data`(DISTINCT PAIRS(self))
  WHERE type = 'product'
  WITH {"num_replica": 1};

-- Check index status
SELECT name, state, num_docs_indexed, num_docs_pending
FROM system:indexes
WHERE keyspace_id = 'production-data';

-- Analyze query execution plan
EXPLAIN SELECT u.name, u.email, COUNT(o.id) AS order_count
FROM `user-profiles` u
JOIN `production-data` o ON o.customer_id = u.id
WHERE u.status = 'active' AND o.type = 'order'
GROUP BY u.name, u.email
ORDER BY order_count DESC
LIMIT 100;

-- Use ADVISE to get index recommendations
ADVISE SELECT * FROM `production-data`
WHERE type = 'order'
  AND customer_id = 'cust-12345'
  AND order_date BETWEEN '2026-01-01' AND '2026-04-12'
ORDER BY order_date DESC;

Query Optimization Tips

-- Use PREPARE for frequently executed queries (cached plan)
PREPARE get_user_orders AS
SELECT o.id, o.order_date, o.total_amount, o.status
FROM `production-data` o
WHERE o.type = 'order'
  AND o.customer_id = $customer_id
ORDER BY o.order_date DESC
LIMIT $page_size OFFSET $page_offset;

-- Execute prepared statement
EXECUTE get_user_orders
USING {"customer_id": "cust-12345", "page_size": 20, "page_offset": 0};

-- Use META().id for direct key-value lookups (fastest path)
SELECT META().id, *
FROM `production-data`
USE KEYS ["order::2026-001", "order::2026-002", "order::2026-003"];

-- Correlated subquery with USE KEYS for joins
SELECT u.name,
  (SELECT o.id, o.total_amount
   FROM `production-data` o
   USE KEYS u.order_ids
   WHERE o.status = 'completed') AS completed_orders
FROM `user-profiles` u
WHERE META(u).id = 'user::12345';

-- Use INFER to understand document schema
INFER `production-data` WITH {"sample_size": 10000, "similarity_metric": 0.6};

Memory Management and Bucket Configuration

Couchbase's memory-first architecture means RAM allocation directly impacts performance. Each service has its own memory quota, and buckets share the Data Service quota. Proper sizing prevents cache evictions that degrade latency.

# Configure cluster-level memory quotas
/opt/couchbase/bin/couchbase-cli setting-cluster \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --cluster-ramsize 8192 \
  --cluster-index-ramsize 4096 \
  --cluster-fts-ramsize 2048 \
  --cluster-eventing-ramsize 2048 \
  --cluster-analytics-ramsize 4096

# Memory allocation guidelines:
# Data Service:      60% of available node RAM
# Index Service:     20% of available node RAM
# Search Service:    10% of available node RAM
# OS/overhead:       10% reserved

# Bucket memory sizing formula:
# Required RAM = (avg_doc_size * num_docs * 2.5) / num_data_nodes
# The 2.5 multiplier accounts for:
#   - Metadata overhead (~56 bytes per document)
#   - Internal fragmentation
#   - Replica copies in memory

# Create an optimized production bucket
curl -X POST http://localhost:8091/pools/default/buckets \
  -u Administrator:password \
  -d name=production-data \
  -d ramQuota=4096 \
  -d bucketType=couchbase \
  -d replicaNumber=2 \
  -d threadsNumber=8 \
  -d evictionPolicy=valueOnly \
  -d compressionMode=active \
  -d maxTTL=0 \
  -d conflictResolutionType=lww \
  -d flushEnabled=0 \
  -d durabilityMinLevel=majorityAndPersistActive

# Eviction policies:
# valueOnly  - Evicts document values but keeps metadata in RAM
#              Best for workloads where key access patterns are predictable
# fullEviction - Evicts both values and metadata
#                Best for very large datasets that exceed available RAM
# noEviction - (Ephemeral buckets only) Rejects writes when RAM is full
#              Best for caching use cases

TLS Encryption and RBAC

Securing Couchbase in production requires encryption of data in transit (TLS), fine-grained role-based access control (RBAC), and audit logging.

# Enable TLS for all Couchbase services
/opt/couchbase/bin/couchbase-cli ssl-manage \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set-node-certificate

# Enforce minimum TLS version
/opt/couchbase/bin/couchbase-cli setting-security \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --tls-min-version tlsv1.2 \
  --tls-honor-cipher-order 1 \
  --hsts-max-age 31536000 \
  --hsts-preload-enabled 1

# Create application-specific RBAC users
/opt/couchbase/bin/couchbase-cli user-manage \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --rbac-username app-service \
  --rbac-password "$(openssl rand -base64 32)" \
  --rbac-name "Application Service Account" \
  --roles 'data_reader[production-data],data_writer[production-data],query_select[production-data],query_insert[production-data],query_update[production-data],query_delete[production-data]' \
  --auth-domain local

# Create a read-only analytics user
/opt/couchbase/bin/couchbase-cli user-manage \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --rbac-username analytics-reader \
  --rbac-password "$(openssl rand -base64 32)" \
  --roles 'data_reader[production-data],query_select[production-data],analytics_reader[production-data]' \
  --auth-domain local

# Enable audit logging
/opt/couchbase/bin/couchbase-cli setting-audit \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --audit-enabled 1 \
  --audit-log-path /opt/couchbase/var/lib/couchbase/logs \
  --audit-log-rotate-interval 86400 \
  --audit-log-rotate-size 20971520

Kubernetes TLS with cert-manager

# Certificate for Couchbase server TLS
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: couchbase-server-tls
  namespace: couchbase
spec:
  secretName: couchbase-server-tls
  duration: 8760h   # 1 year
  renewBefore: 720h  # 30 days before expiry
  privateKey:
    algorithm: RSA
    size: 4096
  usages:
  - server auth
  - client auth
  dnsNames:
  - "*.cb-production.couchbase.svc.cluster.local"
  - "*.cb-production.couchbase.svc"
  - "cb-production-srv.couchbase.svc.cluster.local"
  - "localhost"
  issuerRef:
    name: couchbase-ca-issuer
    kind: ClusterIssuer

Monitoring with Prometheus Exporter

Couchbase exposes rich metrics through its REST API. The couchbase-exporter translates these into Prometheus format for comprehensive monitoring.

# Deploy Couchbase Prometheus Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
  name: couchbase-exporter
  namespace: couchbase
spec:
  replicas: 1
  selector:
    matchLabels:
      app: couchbase-exporter
  template:
    metadata:
      labels:
        app: couchbase-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9091"
    spec:
      containers:
      - name: exporter
        image: couchbase/exporter:1.0.9
        args:
        - --couchbase-address=cb-production-srv.couchbase.svc.cluster.local
        - --couchbase-port=8091
        - --couchbase-username=$(CB_USERNAME)
        - --couchbase-password=$(CB_PASSWORD)
        - --server-address=0.0.0.0:9091
        - --per-node-refresh=5
        env:
        - name: CB_USERNAME
          valueFrom:
            secretKeyRef:
              name: cb-admin-credentials
              key: username
        - name: CB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: cb-admin-credentials
              key: password
        ports:
        - containerPort: 9091
          name: metrics
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: couchbase-monitor
  namespace: couchbase
spec:
  selector:
    matchLabels:
      app: couchbase-exporter
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Key Couchbase Metrics to Monitor


  cb_bucket_ops_per_sec — Total operations per second per bucket. Baseline your normal throughput and alert on anomalies.
  cb_bucket_mem_used_bytes — Bucket memory usage. Alert when approaching the RAM quota to prevent evictions.
  cb_bucket_cache_miss_ratio — Ratio of requests that miss the cache and require disk fetch. Should stay below 2% for optimal performance.
  cb_bucket_disk_queue_items — Disk write queue depth. A growing queue indicates disk I/O cannot keep up with write throughput.
  cb_xdcr_changes_left — Number of mutations pending XDCR replication. Indicates cross-region replication lag.
  cb_xdcr_docs_written — Documents replicated per second through XDCR.
  cb_node_cpu_utilization_percent — Per-node CPU usage. Couchbase is CPU-intensive for compaction and indexing.
  cb_bucket_vbucket_active_num — Number of active vBuckets per node. Should be roughly even across data nodes.
  cb_index_num_docs_pending — Documents pending index update. Indicates index build lag.
  cb_n1ql_requests_per_sec — N1QL query throughput. Combined with average latency, identifies query performance issues.


Prometheus Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: couchbase-alerts
  namespace: couchbase
spec:
  groups:
  - name: couchbase.rules
    rules:
    - alert: CouchbaseNodeDown
      expr: cb_node_healthy == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Couchbase node {{ $labels.node }} is unhealthy"
    - alert: CouchbaseHighCacheMissRate
      expr: cb_bucket_cache_miss_ratio > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Cache miss rate {{ $value | humanizePercentage }} on bucket {{ $labels.bucket }}"
    - alert: CouchbaseXDCRLag
      expr: cb_xdcr_changes_left > 10000
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "XDCR replication lag: {{ $value }} pending mutations"
    - alert: CouchbaseDiskQueueGrowing
      expr: rate(cb_bucket_disk_queue_items[5m]) > 100
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Disk queue growing on bucket {{ $labels.bucket }}"
    - alert: CouchbaseMemoryPressure
      expr: cb_bucket_mem_used_bytes / cb_bucket_mem_quota_bytes > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Memory usage at {{ $value | humanizePercentage }} for bucket {{ $labels.bucket }}"

SDK Connection String Configuration for HA

Couchbase SDKs are topology-aware — they maintain an internal cluster map and route operations directly to the correct node. Proper SDK configuration is critical for HA, ensuring fast failover detection and automatic retry on transient errors.

// Node.js SDK configuration for HA
const couchbase = require('couchbase');

const clusterConnStr = 'couchbases://cb-node1.example.com,cb-node2.example.com,cb-node3.example.com';

const cluster = await couchbase.connect(clusterConnStr, {
  username: process.env.CB_USERNAME,
  password: process.env.CB_PASSWORD,
  timeouts: {
    kvTimeout: 2500,           // Key-value operation timeout (ms)
    kvDurableTimeout: 10000,   // Durable write timeout
    queryTimeout: 75000,       // N1QL query timeout
    searchTimeout: 75000,      // FTS search timeout
    analyticsTimeout: 75000,   // Analytics query timeout
    connectTimeout: 10000,     // Initial connection timeout
    managementTimeout: 75000   // Management API timeout
  },
  security: {
    trustStorePath: '/etc/couchbase/ca.pem'
  },
  transactions: {
    durabilityLevel: couchbase.DurabilityLevel.MajorityAndPersistToActive,
    timeout: 15000
  }
});

const bucket = cluster.bucket('production-data');
const collection = bucket.defaultCollection();

// Durable write with observe-based durability
await collection.upsert('order::2026-001', orderDocument, {
  durabilityLevel: couchbase.DurabilityLevel.MajorityAndPersistToActive,
  timeout: 10000
});

// Read with replica fallback for HA
try {
  const result = await collection.get('user::12345');
} catch (err) {
  if (err instanceof couchbase.errors.TimeoutError) {
    const replicaResult = await collection.getAnyReplica('user::12345');
  }
}

# Java SDK configuration for HA
import com.couchbase.client.java.*;
import com.couchbase.client.java.env.*;
import java.time.Duration;

ClusterEnvironment env = ClusterEnvironment.builder()
    .timeoutConfig(TimeoutConfig.builder()
        .kvTimeout(Duration.ofMillis(2500))
        .kvDurableTimeout(Duration.ofSeconds(10))
        .queryTimeout(Duration.ofSeconds(75))
        .connectTimeout(Duration.ofSeconds(10))
        .build())
    .ioConfig(IoConfig.builder()
        .numKvConnections(4)
        .enableMutationTokens(true)
        .enableDnsSrv(true)
        .build())
    .securityConfig(SecurityConfig.builder()
        .enableTls(true)
        .trustCertificate(Paths.get("/etc/couchbase/ca.pem"))
        .build())
    .build();

Cluster cluster = Cluster.connect(
    "couchbases://cb-node1.example.com,cb-node2.example.com",
    ClusterOptions.clusterOptions("username", "password")
        .environment(env)
);

Kubernetes Service DNS for SDK Connection

# When using the Autonomous Operator, connect via the headless service:
# couchbase://cb-production-srv.couchbase.svc.cluster.local
#
# The operator creates these services:
# cb-production-srv      - Headless service for SDK auto-discovery
# cb-production-ui       - Web Console (port 8091/18091)
# cb-production-cloud    - External connectivity (NodePort/LoadBalancer)
#
# For external SDK access (outside Kubernetes), use:
# - NodePort with explicit node addresses
# - LoadBalancer with MetalLB (bare metal)
# - Ingress with TCP passthrough for port 11210 (SDK) and 11207 (SDK TLS)

Couchbase Mobile and Sync Gateway for Edge Deployments

Couchbase Mobile extends the Couchbase ecosystem to edge devices and mobile applications. Couchbase Lite runs embedded on iOS, Android, and IoT devices, while Sync Gateway acts as the synchronization middleware between Couchbase Lite and Couchbase Server.

// Sync Gateway configuration for production
{
  "interface": ":4984",
  "adminInterface": "127.0.0.1:4985",
  "logging": {
    "console": {
      "log_level": "info",
      "log_keys": ["HTTP", "Sync", "Auth", "Changes"]
    }
  },
  "databases": {
    "mobile-app": {
      "server": "couchbases://cb-production-srv.couchbase.svc.cluster.local",
      "bucket": "production-data",
      "username": "sync-gateway",
      "password": "${SG_PASSWORD}",
      "enable_shared_bucket_access": true,
      "import_docs": true,
      "num_index_replicas": 1,
      "delta_sync": {
        "enabled": true,
        "rev_max_age_seconds": 86400
      },
      "cache": {
        "channel_cache": {
          "max_number": 50000,
          "compact_high_watermark_pct": 80,
          "compact_low_watermark_pct": 60
        },
        "rev_cache": {
          "size": 5000,
          "shard_count": 16
        }
      },
      "users": {
        "GUEST": {"disabled": true}
      },
      "sync": "function(doc, oldDoc) { if (doc.type === 'user-data') { channel(doc.channels); requireAccess(doc.channels); } else { channel('public'); } }"
    }
  }
}

# Deploy Sync Gateway on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sync-gateway
  namespace: couchbase
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sync-gateway
  template:
    metadata:
      labels:
        app: sync-gateway
    spec:
      containers:
      - name: sync-gateway
        image: couchbase/sync-gateway:3.1.4-enterprise
        args: ["/etc/sync-gateway/config.json"]
        ports:
        - containerPort: 4984
          name: public
        - containerPort: 4985
          name: admin
        resources:
          requests:
            cpu: "2"
            memory: 4Gi
          limits:
            cpu: "4"
            memory: 8Gi
        volumeMounts:
        - name: config
          mountPath: /etc/sync-gateway
      volumes:
      - name: config
        configMap:
          name: sync-gateway-config

Capacity Planning and Sizing

Proper capacity planning is essential for Couchbase performance and cost optimization. The following table provides sizing guidelines based on workload tier:


  
    
      Workload Tier
      Data Nodes
      Index/Query
      RAM per Node
      Storage
      Throughput
    
  
  
    
      Development
      1 (all services)
      Co-located
      4 GB
      20 GB SSD
      <1k ops/s
    
    
      Small Production
      3 Data
      2 Query+Index
      16 GB
      100 GB SSD
      10k ops/s
    
    
      Medium Production
      5 Data
      3 Query+Index
      32 GB
      500 GB SSD
      50k ops/s
    
    
      Large Production
      7-10 Data
      4+ Query+Index
      64 GB
      1 TB NVMe
      200k+ ops/s
    
    
      Enterprise / Global
      10+ Data (multi-region)
      6+ Query+Index
      128 GB
      2+ TB NVMe
      500k+ ops/s
    
  


Sizing Formula

# Data Service RAM sizing
# Required RAM per node = (num_documents * (doc_metadata_size + avg_value_size)) / num_data_nodes * (1 + num_replicas)
# doc_metadata_size = 56 bytes (fixed overhead per document)
# Include 25% headroom for fragmentation and growth

# Example: 100M documents, 1KB avg size, 3 data nodes, 1 replica
# RAM = (100,000,000 * (56 + 1024)) / 3 * 2 = ~72 GB per node
# With 25% headroom: ~90 GB per node

# Index Service RAM sizing (memory-optimized)
# RAM = total_index_size * 3 (for build/merge overhead)
# Use system:indexes to check current index sizes

# Disk sizing
# Disk = (num_documents * avg_doc_size * (1 + num_replicas)) * 3 (compaction headroom)
# Use SSD/NVMe with provisioned IOPS for predictable performance

Disaster Recovery and Failover Procedures

A comprehensive disaster recovery plan ensures business continuity when infrastructure failures exceed the scope of automatic failover.

Single Node Failure (Automatic)
# Auto-failover handles single node failures automatically.
# Verify failover occurred:
/opt/couchbase/bin/couchbase-cli server-list \
  --cluster localhost:8091 \
  --username Administrator \
  --password password

# After replacing the failed node, add and rebalance:
/opt/couchbase/bin/couchbase-cli server-add \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --server-add new-node.example.com:8091 \
  --server-add-username Administrator \
  --server-add-password password \
  --services data,index

/opt/couchbase/bin/couchbase-cli rebalance \
  --cluster localhost:8091 \
  --username Administrator \
  --password password

Complete Cluster Failure (Manual)
# Scenario: Primary region (US-EAST) completely lost

# Step 1: Verify XDCR target cluster (EU-WEST) has latest data
# Check XDCR replication status before failure
curl -s http://eu-west-node:8091/pools/default/remoteClusters \
  -u Administrator:password | jq .

# Step 2: Pause XDCR replications pointing to the failed cluster
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster cb-eu-west.example.com:8091 \
  --username Administrator \
  --password password \
  --pause \
  --xdcr-replicator <replication-id>

# Step 3: Update application connection strings to EU-WEST cluster
# (via DNS update, service mesh, or environment variable change)

# Step 4: Scale up EU-WEST cluster if needed to handle full production load
# In Kubernetes, update the CouchbaseCluster CRD:
kubectl patch couchbasecluster cb-eu-west -n couchbase --type merge \
  -p '{"spec":{"servers":[{"name":"data-zone-a","size":4}]}}'

# Step 5: After US-EAST cluster is restored, re-establish XDCR
# and perform a full resync from EU-WEST back to US-EAST

Graceful Failover and Recovery
# Graceful failover (for maintenance, drains data before removal)
/opt/couchbase/bin/couchbase-cli failover \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --server-failover node-to-remove.example.com:8091

# Recovery (re-add the node after maintenance)
/opt/couchbase/bin/couchbase-cli recovery \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --server-recovery node-to-recover.example.com:8091 \
  --recovery-type delta

# Delta recovery re-synchronizes only the changed data,
# which is much faster than full recovery.
# Full recovery rebuilds the node from scratch.

# Rebalance to complete the recovery
/opt/couchbase/bin/couchbase-cli rebalance \
  --cluster localhost:8091 \
  --username Administrator \
  --password password

Helm Values for Complete Production Deployment

Below is a comprehensive Helm values file for deploying Couchbase with the Autonomous Operator in a production environment:

# helm-values-production.yaml
couchbase-operator:
  operator:
    image:
      repository: couchbase/operator
      tag: 2.7.1
    resources:
      requests:
        cpu: 500m
        memory: 512Mi
      limits:
        cpu: "1"
        memory: 1Gi
  admissionController:
    enabled: true
    resources:
      requests:
        cpu: 100m
        memory: 128Mi

cluster:
  image: couchbase/server:7.6.1-enterprise
  antiAffinity: true
  autoFailoverTimeout: 30s
  autoFailoverMaxCount: 3
  autoFailoverOnDataDiskIssues: true
  autoFailoverServerGroup: true
  security:
    adminSecret: cb-admin-credentials
  networking:
    tls:
      static:
        serverSecret: couchbase-server-tls
        operatorSecret: couchbase-operator-tls
    exposeAdminConsole: true
    adminConsoleServiceType: NodePort
  buckets:
    managed: true
  servers:
    data:
      size: 4
      services:
      - data
      - index
      serverGroups:
      - zone-a
      - zone-b
      resources:
        requests:
          cpu: "4"
          memory: 16Gi
        limits:
          cpu: "8"
          memory: 20Gi
      volumeMounts:
        default: couchbase-data
        data: couchbase-data
        index: couchbase-index
    query:
      size: 2
      services:
      - query
      - search
      resources:
        requests:
          cpu: "4"
          memory: 8Gi
        limits:
          cpu: "8"
          memory: 12Gi
      volumeMounts:
        default: couchbase-default
    analytics:
      size: 2
      services:
      - analytics
      - eventing
      serverGroups:
      - zone-c
      resources:
        requests:
          cpu: "8"
          memory: 32Gi
        limits:
          cpu: "16"
          memory: 40Gi
      volumeMounts:
        default: couchbase-analytics
        analytics:
        - couchbase-analytics
  serverGroups:
  - zone-a
  - zone-b
  - zone-c
  volumeClaimTemplates:
  - metadata:
      name: couchbase-data
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: couchbase-index
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 50Gi
  - metadata:
      name: couchbase-default
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 20Gi
  - metadata:
      name: couchbase-analytics
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 200Gi

Conclusion

Couchbase Server's architecture—built around vBucket-based sharding, memory-first data access, and integrated multi-model services—provides a uniquely powerful foundation for high availability production deployments. The combination of intra-cluster replication with automatic failover ensures that single-node failures are handled transparently, while XDCR extends this resilience across geographic regions for global applications.

The Couchbase Autonomous Operator for Kubernetes transforms what would be complex manual operations into declarative, self-healing deployments. Server groups provide rack/zone awareness, the operator manages rebalance operations during scaling events, and integrated backup CRDs automate disaster recovery preparation.

Key takeaways from this guide:


  Leverage Multi-Dimensional Scaling — Separate Data, Index, Query, Search, Analytics, and Eventing services onto dedicated node pools for independent scaling and resource isolation.
  Configure XDCR for multi-region resilience — Bidirectional XDCR with timestamp-based conflict resolution enables active-active deployments across AWS, Azure, and GCP. Always ensure NTP synchronization.
  Use server groups for zone awareness — Map server groups to availability zones or racks to guarantee that active and replica vBuckets are in different failure domains.
  Size memory carefully — Couchbase's performance is directly tied to how much of the working set fits in RAM. Use the sizing formulas and monitor cache miss ratios.
  Implement comprehensive monitoring — Deploy the Prometheus exporter from day one. XDCR replication lag, cache miss ratio, disk queue depth, and node health are your critical signals.
  Automate backups with cbbackupmgr — Combine full and incremental backups with cloud snapshots. Test restore procedures regularly.
  Secure with TLS and RBAC — Enable node-to-node and client-to-node TLS encryption. Use fine-grained RBAC roles for every application service account.
  Configure SDKs for HA — Use multiple bootstrap nodes, configure appropriate timeouts, implement replica reads as fallback, and leverage durable writes for critical data.
  Plan for disaster recovery — Document and rehearse failover procedures for single-node, multi-node, and complete cluster failure scenarios. XDCR standby clusters should be ready for promotion at all times.


With this comprehensive foundation, you are equipped to deploy and operate Couchbase Server in high availability production environments across any infrastructure—from managed Kubernetes on AWS, Azure, and GCP to bare metal k3s clusters managed by Rancher. The combination of Couchbase's native distributed architecture with Kubernetes orchestration delivers a database platform that meets the demands of modern, globally distributed applications.

Workload Tier	Data Nodes	Index/Query	RAM per Node	Storage	Throughput
Development	1 (all services)	Co-located	4 GB	20 GB SSD	<1k ops/s
Small Production	3 Data	2 Query+Index	16 GB	100 GB SSD	10k ops/s
Medium Production	5 Data	3 Query+Index	32 GB	500 GB SSD	50k ops/s
Large Production	7-10 Data	4+ Query+Index	64 GB	1 TB NVMe	200k+ ops/s
Enterprise / Global	10+ Data (multi-region)	6+ Query+Index	128 GB	2+ TB NVMe	500k+ ops/s



Introduction: Why Couchbase for High Availability Production Workloads

Couchbase Server is a distributed, multi-model NoSQL database engineered for interactive applications that demand consistent low-latency performance at any scale. Unlike databases that bolt on distributed features as an afterthought, Couchbase was architected from its inception around a shared-nothing, peer-to-peer topology where every node is equal and data is automatically sharded across the cluster using a deterministic hashing mechanism called vBuckets. This architectural choice eliminates single points of failure at the data layer and enables horizontal scaling without application changes.

What sets Couchbase apart in the high availability landscape is its Cross Data Center Replication (XDCR) — a built-in, asynchronous replication engine that continuously streams mutations between geographically distributed clusters. Combined with automatic failover, rack/zone awareness, and a rich set of integrated services (Data, Index, Query, Search, Analytics, and Eventing), Couchbase provides a unified platform that can serve as both the operational database and the analytical engine for modern applications.

In this comprehensive guide, we will explore every aspect of running Couchbase Server in high availability production environments: the internal architecture that makes HA possible, XDCR configuration for multi-region deployments, the Couchbase Autonomous Operator for Kubernetes, cloud-specific deployment patterns for AWS EKS, Azure AKS, and GCP GKE, bare metal k3s deployments with Rancher and Longhorn, backup and restore strategies, N1QL query tuning, security hardening, monitoring, and Couchbase Mobile with Sync Gateway for edge deployments. By the end, you will have actionable knowledge to deploy and operate production-grade Couchbase clusters on any infrastructure.

Couchbase Server Architecture: Services, vBuckets, and Automatic Sharding

Understanding Couchbase's internal architecture is critical before deploying for high availability. Couchbase uses a Multi-Dimensional Scaling (MDS) architecture where different services can be independently deployed and scaled across cluster nodes. This gives operators fine-grained control over resource allocation and performance isolation.

The Six Core Services

Couchbase Server provides six integrated services, each handling a distinct workload type:


  Data Service (KV) — The core key-value engine built on a memory-first architecture. It handles CRUD operations, manages vBucket distribution, and serves as the persistence layer. Data is stored in memory (managed cache) and asynchronously persisted to disk. This service must run on at least one node in every cluster.
  Index Service (GSI) — Maintains Global Secondary Indexes that support N1QL queries. Indexes are stored separately from data, allowing independent scaling. Supports standard and memory-optimized index storage modes.
  Query Service (N1QL) — Executes N1QL (SQL++ for JSON) queries against the cluster. Stateless by design, making it easy to scale horizontally. Coordinates with the Data and Index services to plan and execute queries.
  Search Service (FTS) — Provides full-text search capabilities powered by the Bleve search engine. Supports fuzzy matching, geospatial queries, faceted search, and custom analyzers. Indexes are partitioned and replicated across Search nodes.
  Analytics Service (CBAS) — Runs complex analytical queries using a parallel processing engine based on Apache Asterix. Operates on its own copy of the data, ensuring that analytical workloads never impact operational latency.
  Eventing Service — Executes server-side JavaScript functions in response to data mutations. Enables real-time data enrichment, transformations, cascade deletes, and integration triggers without external infrastructure.




vBucket Distribution and Automatic Sharding

Couchbase distributes data across the cluster using 1024 vBuckets (virtual buckets). Every document is mapped to a vBucket using a CRC32 hash of the document key modulo 1024. The cluster map — maintained by every node and cached by every SDK client — maps each vBucket to a specific node. This deterministic mapping means clients always know exactly which node holds any given document, enabling single-hop reads and writes with sub-millisecond latency.

When nodes are added or removed, Couchbase redistributes vBuckets automatically through a process called rebalance. During rebalance, the cluster moves vBuckets between nodes while remaining fully operational. The rebalance is carefully orchestrated to maintain the configured number of replicas at all times, and clients are seamlessly redirected to the new vBucket locations through cluster map updates.

Intra-Cluster Replication and Auto-Failover

Each vBucket has one active copy and up to three replica copies distributed across different nodes. When a client writes a document, the write goes to the active vBucket on the responsible node. The Data Service then replicates the mutation to replica vBuckets on other nodes through the internal DCP (Database Change Protocol) stream. By default, Couchbase configures one replica, but for production HA deployments, two replicas are recommended:

# Configure bucket with 2 replicas via CLI
/opt/couchbase/bin/couchbase-cli bucket-create \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --bucket production-data \
  --bucket-type couchbase \
  --bucket-ramsize 4096 \
  --bucket-replica 2 \
  --bucket-priority high \
  --bucket-eviction-policy valueOnly \
  --enable-flush 0 \
  --compression-mode active \
  --max-ttl 0 \
  --durability-min-level majorityAndPersistActive

Auto-failover is Couchbase's mechanism for automatically detecting and recovering from node failures. When a node becomes unresponsive, the cluster orchestrator waits for a configurable timeout (minimum 5 seconds, recommended 30 seconds for production), then promotes the replica vBuckets on surviving nodes to active status. This happens without any application-side intervention — SDK clients receive an updated cluster map and immediately route requests to the new active vBuckets.

# Configure auto-failover settings
/opt/couchbase/bin/couchbase-cli setting-autofailover \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --enable-auto-failover 1 \
  --auto-failover-timeout 30 \
  --max-failovers 3 \
  --enable-failover-of-server-groups 1 \
  --failover-on-data-disk-issues 1 \
  --failover-data-disk-period 120 \
  --can-abort-rebalance 1

Key auto-failover parameters:

  auto-failover-timeout — Seconds to wait before triggering failover. Lower values reduce downtime but increase false-positive risk. 30 seconds is the recommended production setting.
  max-failovers — Maximum number of sequential auto-failovers before requiring manual intervention. Set to 3 for a 5-node cluster (to maintain quorum).
  enable-failover-of-server-groups — Enables failover of an entire server group (rack/zone), critical for zone-aware deployments.
  failover-on-data-disk-issues — Triggers failover when the Data Service detects persistent disk I/O errors.


XDCR: Cross Data Center Replication

XDCR is Couchbase's flagship multi-region replication technology. Unlike database-level replication found in traditional RDBMS systems, XDCR operates at the bucket level and streams individual document mutations between independent Couchbase clusters. Each cluster remains fully autonomous — it can accept reads and writes independently, making XDCR ideal for active-active multi-region deployments where users need low-latency access from any geography.

Unidirectional vs Bidirectional XDCR

Unidirectional XDCR replicates mutations from a source cluster to a target cluster in one direction. This is suitable for disaster recovery scenarios, read replicas in remote regions, or feeding data from an operational cluster to an analytics cluster.

Bidirectional XDCR creates replication links in both directions between two clusters, enabling active-active deployments where both clusters accept writes. This is the most powerful configuration but requires careful conflict resolution planning.



Setting Up XDCR Replication

Configuring XDCR involves creating a remote cluster reference and then defining replication links at the bucket level. Below are the CLI commands and REST API calls for a complete bidirectional setup:

# Step 1: Create remote cluster reference on the US-EAST cluster
/opt/couchbase/bin/couchbase-cli xdcr-setup \
  --cluster cb-us-east.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name eu-west-cluster \
  --xdcr-hostname cb-eu-west.example.com:8091 \
  --xdcr-username Administrator \
  --xdcr-password password \
  --xdcr-demand-encryption 1 \
  --xdcr-encryption-type full \
  --xdcr-certificate /path/to/eu-west-ca.pem

# Step 2: Create replication from US-EAST to EU-WEST for the 'app' bucket
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster cb-us-east.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name eu-west-cluster \
  --xdcr-from-bucket app \
  --xdcr-to-bucket app \
  --xdcr-replication-mode xmem \
  --enable-compression 1 \
  --filter-expression "" \
  --priority high \
  --network-usage-limit 0

# Step 3: Create the reverse replication on EU-WEST cluster (bidirectional)
/opt/couchbase/bin/couchbase-cli xdcr-setup \
  --cluster cb-eu-west.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name us-east-cluster \
  --xdcr-hostname cb-us-east.example.com:8091 \
  --xdcr-username Administrator \
  --xdcr-password password \
  --xdcr-demand-encryption 1 \
  --xdcr-encryption-type full \
  --xdcr-certificate /path/to/us-east-ca.pem

/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster cb-eu-west.example.com:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name us-east-cluster \
  --xdcr-from-bucket app \
  --xdcr-to-bucket app \
  --xdcr-replication-mode xmem \
  --enable-compression 1

Conflict Resolution Strategies

In bidirectional XDCR, the same document can be modified concurrently on different clusters, creating conflicts. Couchbase provides multiple conflict resolution strategies:


  Timestamp-based (LWW — Last Write Wins) — The mutation with the most recent timestamp wins. This is the default and works well for most use cases. Requires NTP synchronization across all clusters (within 5 seconds skew). Set at bucket creation time and cannot be changed later.
  Sequence Number-based — Uses the internal sequence number (revision ID) to determine the winner. The mutation with the higher revision count wins. Useful when timestamp synchronization is unreliable.
  Custom Conflict Resolution (Enterprise) — Couchbase Enterprise Edition supports custom merge functions that execute server-side JavaScript to resolve conflicts with application-specific logic. This enables scenarios like merging shopping cart items from different regions or applying domain-specific conflict resolution rules.


# Create a bucket with timestamp-based conflict resolution
curl -X POST http://localhost:8091/pools/default/buckets \
  -u Administrator:password \
  -d name=app \
  -d ramQuota=4096 \
  -d replicaNumber=2 \
  -d bucketType=couchbase \
  -d conflictResolutionType=lww \
  -d compressionMode=active \
  -d durabilityMinLevel=majorityAndPersistActive

# XDCR advanced settings via REST API
curl -X POST http://localhost:8091/settings/replications/<replication-id> \
  -u Administrator:password \
  -d optimisticReplicationThreshold=256 \
  -d sourceNozzlePerNode=4 \
  -d targetNozzlePerNode=4 \
  -d checkpointInterval=600 \
  -d batchCount=500 \
  -d batchSize=2048 \
  -d failureRestartInterval=10 \
  -d docBatchSizeKb=2048 \
  -d networkUsageLimit=0 \
  -d priority=High

XDCR Filtering

XDCR supports filtering so you can replicate only a subset of documents. Filters use regular expressions against document keys and can also filter based on document expiration or deletion:

# Replicate only documents with keys starting with 'user::'
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --create \
  --xdcr-cluster-name remote-cluster \
  --xdcr-from-bucket app \
  --xdcr-to-bucket app-users \
  --filter-expression "^user::" \
  --filter-skip-restream 0

# Replicate documents matching a complex pattern
# (orders from 2026 with specific type)
--filter-expression "REGEXP_CONTAINS(META().id, '^order::2026') AND type='premium'"

Couchbase Autonomous Operator for Kubernetes

The Couchbase Autonomous Operator (CAO) is an enterprise-grade Kubernetes operator that automates the deployment, management, scaling, and recovery of Couchbase Server clusters. Unlike simple StatefulSet deployments, the Autonomous Operator understands Couchbase's internal topology — it manages rebalance operations, coordinates rolling upgrades, handles server group awareness, and integrates with Kubernetes scheduling primitives to ensure optimal placement of Couchbase pods.



CouchbaseCluster CRD Specification

The CouchbaseCluster CRD is the central configuration that declares the desired state of your Couchbase deployment. The Autonomous Operator reconciles this into StatefulSets, Services, PVCs, Secrets, and RBAC resources. Below is a production-ready CRD:

apiVersion: couchbase.com/v2
kind: CouchbaseCluster
metadata:
  name: cb-production
  namespace: couchbase
spec:
  image: couchbase/server:7.6.1-enterprise
  antiAffinity: true
  platform: aws
  cluster:
    autoFailoverTimeout: 30s
    autoFailoverMaxCount: 3
    autoFailoverOnDataDiskIssues: true
    autoFailoverOnDataDiskIssuesTimePeriod: 120s
    autoFailoverServerGroup: true
    clusterName: cb-production
    dataServiceMemoryQuota: 8Gi
    indexServiceMemoryQuota: 4Gi
    searchServiceMemoryQuota: 2Gi
    analyticsServiceMemoryQuota: 4Gi
    eventingServiceMemoryQuota: 2Gi
    indexStorageSetting: memory_optimized
    autoCompaction:
      databaseFragmentationThreshold:
        percent: 30
        size: 1Gi
      viewFragmentationThreshold:
        percent: 30
        size: 1Gi
      parallelCompaction: false
      timeWindow:
        start: "02:00"
        end: "06:00"
        abortCompactionOutsideWindow: true
  security:
    adminSecret: cb-admin-credentials
    rbac:
      managed: true
      selector:
        matchLabels:
          cluster: cb-production
    ldap:
      hosts:
      - ldap.example.com
      port: 636
      encryption: TLS
  networking:
    tls:
      static:
        serverSecret: couchbase-server-tls
        operatorSecret: couchbase-operator-tls
    exposeAdminConsole: true
    adminConsoleServices:
    - data
    adminConsoleServiceType: NodePort
    exposedFeatures:
    - client
    - xdcr
    exposedFeatureServiceType: NodePort
  buckets:
    managed: true
    selector:
      matchLabels:
        cluster: cb-production
  servers:
  - name: data-zone-a
    size: 2
    services:
    - data
    - index
    serverGroups:
    - zone-a
    pod:
      metadata:
        labels:
          couchbase-service: data-index
        annotations:
          prometheus.io/scrape: "true"
          prometheus.io/port: "9091"
      spec:
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1a
        tolerations:
        - key: couchbase
          operator: Equal
          value: "true"
          effect: NoSchedule
        resources:
          requests:
            cpu: "4"
            memory: 16Gi
          limits:
            cpu: "8"
            memory: 20Gi
    volumeMounts:
      default: couchbase-data
      data: couchbase-data
      index: couchbase-index
  - name: data-zone-b
    size: 2
    services:
    - data
    - index
    serverGroups:
    - zone-b
    pod:
      spec:
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1b
        tolerations:
        - key: couchbase
          operator: Equal
          value: "true"
          effect: NoSchedule
        resources:
          requests:
            cpu: "4"
            memory: 16Gi
          limits:
            cpu: "8"
            memory: 20Gi
    volumeMounts:
      default: couchbase-data
      data: couchbase-data
      index: couchbase-index
  - name: query-search
    size: 2
    services:
    - query
    - search
    serverGroups:
    - zone-a
    - zone-b
    pod:
      spec:
        resources:
          requests:
            cpu: "4"
            memory: 8Gi
          limits:
            cpu: "8"
            memory: 12Gi
    volumeMounts:
      default: couchbase-default
  - name: analytics-eventing
    size: 2
    services:
    - analytics
    - eventing
    serverGroups:
    - zone-c
    pod:
      spec:
        nodeSelector:
          topology.kubernetes.io/zone: us-east-1c
        resources:
          requests:
            cpu: "8"
            memory: 32Gi
          limits:
            cpu: "16"
            memory: 40Gi
    volumeMounts:
      default: couchbase-analytics
      analytics:
      - couchbase-analytics
  serverGroups:
  - zone-a
  - zone-b
  - zone-c
  volumeClaimTemplates:
  - metadata:
      name: couchbase-data
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: couchbase-index
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 50Gi
  - metadata:
      name: couchbase-default
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 20Gi
  - metadata:
      name: couchbase-analytics
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 200Gi

Operator Installation via Helm

# Add Couchbase Helm repository
helm repo add couchbase https://couchbase-partners.github.io/helm-charts/
helm repo update

# Install the Couchbase Autonomous Operator
helm install couchbase-operator couchbase/couchbase-operator \
  --namespace couchbase \
  --create-namespace \
  --set operator.image.repository=couchbase/operator \
  --set operator.image.tag=2.7.1 \
  --set admissionController.enabled=true

# Create the admin credentials secret
kubectl create secret generic cb-admin-credentials \
  --namespace couchbase \
  --from-literal=username=Administrator \
  --from-literal=password=$(openssl rand -base64 24)

# Deploy the CouchbaseCluster CRD
kubectl apply -f couchbase-cluster.yaml

# Verify deployment
kubectl get couchbaseclusters -n couchbase
kubectl get pods -n couchbase -l app=couchbase
kubectl get svc -n couchbase

Server Groups and Rack/Zone Awareness

Server groups are Couchbase's mechanism for ensuring that active vBuckets and their replicas are placed in different failure domains (availability zones, racks, or data centers). When server groups are configured, Couchbase guarantees that no active and replica pair for the same vBucket resides in the same server group. This means a complete zone failure will not result in data loss.

The Autonomous Operator maps server groups to Kubernetes node topology labels, automatically scheduling pods in the correct zones. Combined with pod anti-affinity rules, this ensures that Couchbase pods are distributed across physical infrastructure for maximum resilience.

AWS EKS Deployment

Amazon EKS requires specific configuration for optimal Couchbase performance. The key considerations are storage (EBS gp3 for throughput), instance types (memory-optimized r6i/r7i for Data nodes), and networking (VPC CNI for pod-level networking).

# EBS gp3 StorageClass optimized for Couchbase
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-gp3-couchbase
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "6000"
  throughput: "500"
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/mrk-abcdef"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# Recommended EKS node groups for Couchbase
# Data nodes:     r6i.2xlarge (8 vCPU, 64 GiB) or r7i.2xlarge
# Index/Query:    m6i.2xlarge (8 vCPU, 32 GiB)
# Analytics:      r6i.4xlarge (16 vCPU, 128 GiB)
# Eventing:       m6i.xlarge  (4 vCPU, 16 GiB)

# EKS managed node group with taints for Couchbase
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: couchbase-eks
  region: us-east-1
managedNodeGroups:
- name: cb-data
  instanceType: r6i.2xlarge
  desiredCapacity: 4
  minSize: 4
  maxSize: 8
  volumeSize: 200
  volumeType: gp3
  volumeIOPS: 6000
  volumeThroughput: 500
  availabilityZones: ["us-east-1a", "us-east-1b"]
  labels:
    workload: couchbase-data
  taints:
  - key: couchbase
    value: "true"
    effect: NoSchedule
  iam:
    attachPolicyARNs:
    - arn:aws:iam::policy/AmazonEBSCSIDriverPolicy
- name: cb-query
  instanceType: m6i.2xlarge
  desiredCapacity: 2
  minSize: 2
  maxSize: 4
  availabilityZones: ["us-east-1a", "us-east-1b"]
  labels:
    workload: couchbase-query

Azure AKS Deployment

Azure AKS uses Premium SSD v2 or Ultra Disk for Couchbase's I/O demands and Azure Private Link for secure XDCR connectivity between regions.

# Azure Premium SSD v2 StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azure-premium-couchbase
provisioner: disk.csi.azure.com
parameters:
  skuName: PremiumV2_LRS
  DiskIOPSReadWrite: "6000"
  DiskMBpsReadWrite: "500"
  cachingMode: None
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# Ultra Disk StorageClass for high-performance workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azure-ultra-couchbase
provisioner: disk.csi.azure.com
parameters:
  skuName: UltraSSD_LRS
  DiskIOPSReadWrite: "10000"
  DiskMBpsReadWrite: "1000"
  cachingMode: None
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# AKS recommended VM sizes:
# Data nodes:      Standard_E8s_v5  (8 vCPU, 64 GiB)
# Index/Query:     Standard_D8s_v5  (8 vCPU, 32 GiB)
# Analytics:       Standard_E16s_v5 (16 vCPU, 128 GiB)

GCP GKE Deployment

Google Kubernetes Engine uses SSD Persistent Disks and Workload Identity for secure access to Google Cloud Storage for backups.

# GKE SSD Persistent Disk StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd-couchbase
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  provisioned-iops-on-create: "6000"
  provisioned-throughput-on-create: "500"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# GKE Hyperdisk Balanced for cost-effective performance
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hyperdisk-couchbase
provisioner: pd.csi.storage.gke.io
parameters:
  type: hyperdisk-balanced
  provisioned-iops-on-create: "6000"
  provisioned-throughput-on-create: "500"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
# GKE recommended machine types:
# Data nodes:      n2-highmem-8   (8 vCPU, 64 GB)
# Index/Query:     n2-standard-8  (8 vCPU, 32 GB)
# Analytics:       n2-highmem-16  (16 vCPU, 128 GB)

Bare Metal k3s/Rancher with Longhorn

For organizations that need full infrastructure control without cloud vendor lock-in, bare metal k3s with Rancher management and Longhorn distributed storage provides an excellent foundation for Couchbase HA. This architecture is popular in regulated industries, edge computing scenarios, and cost-sensitive environments.



# k3s bare metal setup for Couchbase

# Install k3s on master nodes (HA with embedded etcd)
curl -sfL https://get.k3s.io | sh -s - server \
  --cluster-init \
  --disable traefik \
  --disable servicelb \
  --write-kubeconfig-mode 644 \
  --node-taint couchbase=true:NoSchedule \
  --node-label topology.kubernetes.io/zone=rack-1

# Join additional server nodes
curl -sfL https://get.k3s.io | sh -s - server \
  --server https://master-1:6443 \
  --token $(cat /var/lib/rancher/k3s/server/node-token) \
  --node-taint couchbase=true:NoSchedule \
  --node-label topology.kubernetes.io/zone=rack-2

# Join agent nodes
curl -sfL https://get.k3s.io | sh -s - agent \
  --server https://master-1:6443 \
  --token $(cat /var/lib/rancher/k3s/server/node-token) \
  --node-taint couchbase=true:NoSchedule \
  --node-label topology.kubernetes.io/zone=rack-3

# Install Longhorn for distributed storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --create-namespace \
  --set defaultSettings.defaultReplicaCount=3 \
  --set defaultSettings.defaultDataPath=/mnt/longhorn \
  --set defaultSettings.guaranteedInstanceManagerCPU=12

# Create Longhorn StorageClass for Couchbase
kubectl apply -f - <


Backup and Restore with cbbackupmgr

Couchbase provides cbbackupmgr, an enterprise backup tool that supports full, incremental, and differential backups with optional compression and encryption. For production HA deployments, a robust backup strategy combines Couchbase-level backups with cloud snapshot capabilities.

Backup Configuration

# Initialize a backup repository
/opt/couchbase/bin/cbbackupmgr config \
  --archive /backup/couchbase \
  --repo production-backup \
  --include-data production-data \
  --include-data user-profiles \
  --exclude-data _system

# Run a full backup
/opt/couchbase/bin/cbbackupmgr backup \
  --archive /backup/couchbase \
  --repo production-backup \
  --cluster couchbase://localhost \
  --username Administrator \
  --password "$CB_PASSWORD" \
  --threads 4 \
  --no-progress-bar

# Run an incremental backup (only mutations since last backup)
/opt/couchbase/bin/cbbackupmgr backup \
  --archive /backup/couchbase \
  --repo production-backup \
  --cluster couchbase://localhost \
  --username Administrator \
  --password "$CB_PASSWORD" \
  --threads 4

# List available backups
/opt/couchbase/bin/cbbackupmgr list \
  --archive /backup/couchbase \
  --repo production-backup

# Restore from a specific backup
/opt/couchbase/bin/cbbackupmgr restore \
  --archive /backup/couchbase \
  --repo production-backup \
  --cluster couchbase://target-cluster:8091 \
  --username Administrator \
  --password "$CB_PASSWORD" \
  --start 2026-04-12T00_00_00 \
  --end 2026-04-12T14_30_00 \
  --threads 4

Automated Backup Script for Kubernetes

#!/bin/bash
# couchbase-backup.sh — Automated backup to S3-compatible storage

set -euo pipefail

CLUSTER_HOST="cb-production-srv.couchbase.svc.cluster.local"
BACKUP_DIR="/backup/couchbase"
REPO_NAME="prod-$(date +%Y%m%d)"
S3_BUCKET="s3://couchbase-backups/production"
RETENTION_DAYS=14

log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }

log "Starting Couchbase backup for cluster: $CLUSTER_HOST"

if [ ! -d "$BACKUP_DIR/$REPO_NAME" ]; then
  log "Configuring new backup repository: $REPO_NAME"
  cbbackupmgr config \
    --archive "$BACKUP_DIR" \
    --repo "$REPO_NAME" \
    --include-data production-data \
    --include-data user-profiles
fi

log "Running incremental backup..."
cbbackupmgr backup \
  --archive "$BACKUP_DIR" \
  --repo "$REPO_NAME" \
  --cluster "couchbase://$CLUSTER_HOST" \
  --username "$CB_USERNAME" \
  --password "$CB_PASSWORD" \
  --threads 4 \
  --no-progress-bar

BACKUP_SIZE=$(du -sh "$BACKUP_DIR/$REPO_NAME" | cut -f1)
log "Backup complete. Size: $BACKUP_SIZE"

log "Syncing to S3: $S3_BUCKET/$REPO_NAME"
aws s3 sync "$BACKUP_DIR/$REPO_NAME" "$S3_BUCKET/$REPO_NAME" \
  --storage-class STANDARD_IA \
  --sse aws:kms

log "Cleaning up backups older than $RETENTION_DAYS days..."
find "$BACKUP_DIR" -maxdepth 1 -name "prod-*" -mtime +"$RETENTION_DAYS" -exec rm -rf {} \;

log "Backup pipeline complete."

CouchbaseBackup CRD (Operator-Managed)

The Autonomous Operator provides CRDs for automated backup management:

apiVersion: couchbase.com/v2
kind: CouchbaseBackup
metadata:
  name: cb-daily-backup
  namespace: couchbase
spec:
  strategy: full_incremental
  full:
    schedule: "0 2 * * 0"   # Full backup every Sunday at 2 AM
  incremental:
    schedule: "0 2 * * 1-6" # Incremental Mon-Sat at 2 AM
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 3
  backOffLimit: 3
  logRetention: 168h
  size: 100Gi
  s3bucket: s3://couchbase-backups/production
---
apiVersion: couchbase.com/v2
kind: CouchbaseBackupRestore
metadata:
  name: cb-restore-pitr
  namespace: couchbase
spec:
  backup: cb-daily-backup
  repo: "20260412"
  start:
    int: 1
  end:
    int: 5
  backOffLimit: 3

N1QL Query Performance Tuning

N1QL (SQL++ for JSON) is Couchbase's query language. Tuning N1QL performance requires understanding the query planner, index design, and server-side optimizations.

Index Strategies: GSI and FTS

-- Global Secondary Index (GSI) for common query patterns

-- Composite index for user lookups
CREATE INDEX idx_users_email_status
  ON `user-profiles`(email, status)
  WHERE type = 'user'
  WITH {"num_replica": 1, "defer_build": false};

-- Covering index (includes all queried fields to avoid fetch)
CREATE INDEX idx_orders_covering
  ON `production-data`(customer_id, order_date, total_amount, status)
  WHERE type = 'order'
  WITH {"num_replica": 1};

-- Array index for nested documents
CREATE INDEX idx_order_items
  ON `production-data`(DISTINCT ARRAY item.product_id FOR item IN items END)
  WHERE type = 'order'
  WITH {"num_replica": 1};

-- Partial index for active records only
CREATE INDEX idx_active_sessions
  ON `production-data`(user_id, created_at)
  WHERE type = 'session' AND status = 'active'
  WITH {"num_replica": 1};

-- Adaptive index for dynamic query patterns
CREATE INDEX idx_adaptive_products
  ON `production-data`(DISTINCT PAIRS(self))
  WHERE type = 'product'
  WITH {"num_replica": 1};

-- Check index status
SELECT name, state, num_docs_indexed, num_docs_pending
FROM system:indexes
WHERE keyspace_id = 'production-data';

-- Analyze query execution plan
EXPLAIN SELECT u.name, u.email, COUNT(o.id) AS order_count
FROM `user-profiles` u
JOIN `production-data` o ON o.customer_id = u.id
WHERE u.status = 'active' AND o.type = 'order'
GROUP BY u.name, u.email
ORDER BY order_count DESC
LIMIT 100;

-- Use ADVISE to get index recommendations
ADVISE SELECT * FROM `production-data`
WHERE type = 'order'
  AND customer_id = 'cust-12345'
  AND order_date BETWEEN '2026-01-01' AND '2026-04-12'
ORDER BY order_date DESC;

Query Optimization Tips

-- Use PREPARE for frequently executed queries (cached plan)
PREPARE get_user_orders AS
SELECT o.id, o.order_date, o.total_amount, o.status
FROM `production-data` o
WHERE o.type = 'order'
  AND o.customer_id = $customer_id
ORDER BY o.order_date DESC
LIMIT $page_size OFFSET $page_offset;

-- Execute prepared statement
EXECUTE get_user_orders
USING {"customer_id": "cust-12345", "page_size": 20, "page_offset": 0};

-- Use META().id for direct key-value lookups (fastest path)
SELECT META().id, *
FROM `production-data`
USE KEYS ["order::2026-001", "order::2026-002", "order::2026-003"];

-- Correlated subquery with USE KEYS for joins
SELECT u.name,
  (SELECT o.id, o.total_amount
   FROM `production-data` o
   USE KEYS u.order_ids
   WHERE o.status = 'completed') AS completed_orders
FROM `user-profiles` u
WHERE META(u).id = 'user::12345';

-- Use INFER to understand document schema
INFER `production-data` WITH {"sample_size": 10000, "similarity_metric": 0.6};

Memory Management and Bucket Configuration

Couchbase's memory-first architecture means RAM allocation directly impacts performance. Each service has its own memory quota, and buckets share the Data Service quota. Proper sizing prevents cache evictions that degrade latency.

# Configure cluster-level memory quotas
/opt/couchbase/bin/couchbase-cli setting-cluster \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --cluster-ramsize 8192 \
  --cluster-index-ramsize 4096 \
  --cluster-fts-ramsize 2048 \
  --cluster-eventing-ramsize 2048 \
  --cluster-analytics-ramsize 4096

# Memory allocation guidelines:
# Data Service:      60% of available node RAM
# Index Service:     20% of available node RAM
# Search Service:    10% of available node RAM
# OS/overhead:       10% reserved

# Bucket memory sizing formula:
# Required RAM = (avg_doc_size * num_docs * 2.5) / num_data_nodes
# The 2.5 multiplier accounts for:
#   - Metadata overhead (~56 bytes per document)
#   - Internal fragmentation
#   - Replica copies in memory

# Create an optimized production bucket
curl -X POST http://localhost:8091/pools/default/buckets \
  -u Administrator:password \
  -d name=production-data \
  -d ramQuota=4096 \
  -d bucketType=couchbase \
  -d replicaNumber=2 \
  -d threadsNumber=8 \
  -d evictionPolicy=valueOnly \
  -d compressionMode=active \
  -d maxTTL=0 \
  -d conflictResolutionType=lww \
  -d flushEnabled=0 \
  -d durabilityMinLevel=majorityAndPersistActive

# Eviction policies:
# valueOnly  - Evicts document values but keeps metadata in RAM
#              Best for workloads where key access patterns are predictable
# fullEviction - Evicts both values and metadata
#                Best for very large datasets that exceed available RAM
# noEviction - (Ephemeral buckets only) Rejects writes when RAM is full
#              Best for caching use cases

TLS Encryption and RBAC

Securing Couchbase in production requires encryption of data in transit (TLS), fine-grained role-based access control (RBAC), and audit logging.

# Enable TLS for all Couchbase services
/opt/couchbase/bin/couchbase-cli ssl-manage \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set-node-certificate

# Enforce minimum TLS version
/opt/couchbase/bin/couchbase-cli setting-security \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --tls-min-version tlsv1.2 \
  --tls-honor-cipher-order 1 \
  --hsts-max-age 31536000 \
  --hsts-preload-enabled 1

# Create application-specific RBAC users
/opt/couchbase/bin/couchbase-cli user-manage \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --rbac-username app-service \
  --rbac-password "$(openssl rand -base64 32)" \
  --rbac-name "Application Service Account" \
  --roles 'data_reader[production-data],data_writer[production-data],query_select[production-data],query_insert[production-data],query_update[production-data],query_delete[production-data]' \
  --auth-domain local

# Create a read-only analytics user
/opt/couchbase/bin/couchbase-cli user-manage \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --rbac-username analytics-reader \
  --rbac-password "$(openssl rand -base64 32)" \
  --roles 'data_reader[production-data],query_select[production-data],analytics_reader[production-data]' \
  --auth-domain local

# Enable audit logging
/opt/couchbase/bin/couchbase-cli setting-audit \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --set \
  --audit-enabled 1 \
  --audit-log-path /opt/couchbase/var/lib/couchbase/logs \
  --audit-log-rotate-interval 86400 \
  --audit-log-rotate-size 20971520

Kubernetes TLS with cert-manager

# Certificate for Couchbase server TLS
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: couchbase-server-tls
  namespace: couchbase
spec:
  secretName: couchbase-server-tls
  duration: 8760h   # 1 year
  renewBefore: 720h  # 30 days before expiry
  privateKey:
    algorithm: RSA
    size: 4096
  usages:
  - server auth
  - client auth
  dnsNames:
  - "*.cb-production.couchbase.svc.cluster.local"
  - "*.cb-production.couchbase.svc"
  - "cb-production-srv.couchbase.svc.cluster.local"
  - "localhost"
  issuerRef:
    name: couchbase-ca-issuer
    kind: ClusterIssuer

Monitoring with Prometheus Exporter

Couchbase exposes rich metrics through its REST API. The couchbase-exporter translates these into Prometheus format for comprehensive monitoring.

# Deploy Couchbase Prometheus Exporter
apiVersion: apps/v1
kind: Deployment
metadata:
  name: couchbase-exporter
  namespace: couchbase
spec:
  replicas: 1
  selector:
    matchLabels:
      app: couchbase-exporter
  template:
    metadata:
      labels:
        app: couchbase-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9091"
    spec:
      containers:
      - name: exporter
        image: couchbase/exporter:1.0.9
        args:
        - --couchbase-address=cb-production-srv.couchbase.svc.cluster.local
        - --couchbase-port=8091
        - --couchbase-username=$(CB_USERNAME)
        - --couchbase-password=$(CB_PASSWORD)
        - --server-address=0.0.0.0:9091
        - --per-node-refresh=5
        env:
        - name: CB_USERNAME
          valueFrom:
            secretKeyRef:
              name: cb-admin-credentials
              key: username
        - name: CB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: cb-admin-credentials
              key: password
        ports:
        - containerPort: 9091
          name: metrics
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: couchbase-monitor
  namespace: couchbase
spec:
  selector:
    matchLabels:
      app: couchbase-exporter
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Key Couchbase Metrics to Monitor


  cb_bucket_ops_per_sec — Total operations per second per bucket. Baseline your normal throughput and alert on anomalies.
  cb_bucket_mem_used_bytes — Bucket memory usage. Alert when approaching the RAM quota to prevent evictions.
  cb_bucket_cache_miss_ratio — Ratio of requests that miss the cache and require disk fetch. Should stay below 2% for optimal performance.
  cb_bucket_disk_queue_items — Disk write queue depth. A growing queue indicates disk I/O cannot keep up with write throughput.
  cb_xdcr_changes_left — Number of mutations pending XDCR replication. Indicates cross-region replication lag.
  cb_xdcr_docs_written — Documents replicated per second through XDCR.
  cb_node_cpu_utilization_percent — Per-node CPU usage. Couchbase is CPU-intensive for compaction and indexing.
  cb_bucket_vbucket_active_num — Number of active vBuckets per node. Should be roughly even across data nodes.
  cb_index_num_docs_pending — Documents pending index update. Indicates index build lag.
  cb_n1ql_requests_per_sec — N1QL query throughput. Combined with average latency, identifies query performance issues.


Prometheus Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: couchbase-alerts
  namespace: couchbase
spec:
  groups:
  - name: couchbase.rules
    rules:
    - alert: CouchbaseNodeDown
      expr: cb_node_healthy == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Couchbase node {{ $labels.node }} is unhealthy"
    - alert: CouchbaseHighCacheMissRate
      expr: cb_bucket_cache_miss_ratio > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Cache miss rate {{ $value | humanizePercentage }} on bucket {{ $labels.bucket }}"
    - alert: CouchbaseXDCRLag
      expr: cb_xdcr_changes_left > 10000
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "XDCR replication lag: {{ $value }} pending mutations"
    - alert: CouchbaseDiskQueueGrowing
      expr: rate(cb_bucket_disk_queue_items[5m]) > 100
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Disk queue growing on bucket {{ $labels.bucket }}"
    - alert: CouchbaseMemoryPressure
      expr: cb_bucket_mem_used_bytes / cb_bucket_mem_quota_bytes > 0.9
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Memory usage at {{ $value | humanizePercentage }} for bucket {{ $labels.bucket }}"

SDK Connection String Configuration for HA

Couchbase SDKs are topology-aware — they maintain an internal cluster map and route operations directly to the correct node. Proper SDK configuration is critical for HA, ensuring fast failover detection and automatic retry on transient errors.

// Node.js SDK configuration for HA
const couchbase = require('couchbase');

const clusterConnStr = 'couchbases://cb-node1.example.com,cb-node2.example.com,cb-node3.example.com';

const cluster = await couchbase.connect(clusterConnStr, {
  username: process.env.CB_USERNAME,
  password: process.env.CB_PASSWORD,
  timeouts: {
    kvTimeout: 2500,           // Key-value operation timeout (ms)
    kvDurableTimeout: 10000,   // Durable write timeout
    queryTimeout: 75000,       // N1QL query timeout
    searchTimeout: 75000,      // FTS search timeout
    analyticsTimeout: 75000,   // Analytics query timeout
    connectTimeout: 10000,     // Initial connection timeout
    managementTimeout: 75000   // Management API timeout
  },
  security: {
    trustStorePath: '/etc/couchbase/ca.pem'
  },
  transactions: {
    durabilityLevel: couchbase.DurabilityLevel.MajorityAndPersistToActive,
    timeout: 15000
  }
});

const bucket = cluster.bucket('production-data');
const collection = bucket.defaultCollection();

// Durable write with observe-based durability
await collection.upsert('order::2026-001', orderDocument, {
  durabilityLevel: couchbase.DurabilityLevel.MajorityAndPersistToActive,
  timeout: 10000
});

// Read with replica fallback for HA
try {
  const result = await collection.get('user::12345');
} catch (err) {
  if (err instanceof couchbase.errors.TimeoutError) {
    const replicaResult = await collection.getAnyReplica('user::12345');
  }
}

# Java SDK configuration for HA
import com.couchbase.client.java.*;
import com.couchbase.client.java.env.*;
import java.time.Duration;

ClusterEnvironment env = ClusterEnvironment.builder()
    .timeoutConfig(TimeoutConfig.builder()
        .kvTimeout(Duration.ofMillis(2500))
        .kvDurableTimeout(Duration.ofSeconds(10))
        .queryTimeout(Duration.ofSeconds(75))
        .connectTimeout(Duration.ofSeconds(10))
        .build())
    .ioConfig(IoConfig.builder()
        .numKvConnections(4)
        .enableMutationTokens(true)
        .enableDnsSrv(true)
        .build())
    .securityConfig(SecurityConfig.builder()
        .enableTls(true)
        .trustCertificate(Paths.get("/etc/couchbase/ca.pem"))
        .build())
    .build();

Cluster cluster = Cluster.connect(
    "couchbases://cb-node1.example.com,cb-node2.example.com",
    ClusterOptions.clusterOptions("username", "password")
        .environment(env)
);

Kubernetes Service DNS for SDK Connection

# When using the Autonomous Operator, connect via the headless service:
# couchbase://cb-production-srv.couchbase.svc.cluster.local
#
# The operator creates these services:
# cb-production-srv      - Headless service for SDK auto-discovery
# cb-production-ui       - Web Console (port 8091/18091)
# cb-production-cloud    - External connectivity (NodePort/LoadBalancer)
#
# For external SDK access (outside Kubernetes), use:
# - NodePort with explicit node addresses
# - LoadBalancer with MetalLB (bare metal)
# - Ingress with TCP passthrough for port 11210 (SDK) and 11207 (SDK TLS)

Couchbase Mobile and Sync Gateway for Edge Deployments

Couchbase Mobile extends the Couchbase ecosystem to edge devices and mobile applications. Couchbase Lite runs embedded on iOS, Android, and IoT devices, while Sync Gateway acts as the synchronization middleware between Couchbase Lite and Couchbase Server.

// Sync Gateway configuration for production
{
  "interface": ":4984",
  "adminInterface": "127.0.0.1:4985",
  "logging": {
    "console": {
      "log_level": "info",
      "log_keys": ["HTTP", "Sync", "Auth", "Changes"]
    }
  },
  "databases": {
    "mobile-app": {
      "server": "couchbases://cb-production-srv.couchbase.svc.cluster.local",
      "bucket": "production-data",
      "username": "sync-gateway",
      "password": "${SG_PASSWORD}",
      "enable_shared_bucket_access": true,
      "import_docs": true,
      "num_index_replicas": 1,
      "delta_sync": {
        "enabled": true,
        "rev_max_age_seconds": 86400
      },
      "cache": {
        "channel_cache": {
          "max_number": 50000,
          "compact_high_watermark_pct": 80,
          "compact_low_watermark_pct": 60
        },
        "rev_cache": {
          "size": 5000,
          "shard_count": 16
        }
      },
      "users": {
        "GUEST": {"disabled": true}
      },
      "sync": "function(doc, oldDoc) { if (doc.type === 'user-data') { channel(doc.channels); requireAccess(doc.channels); } else { channel('public'); } }"
    }
  }
}

# Deploy Sync Gateway on Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sync-gateway
  namespace: couchbase
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sync-gateway
  template:
    metadata:
      labels:
        app: sync-gateway
    spec:
      containers:
      - name: sync-gateway
        image: couchbase/sync-gateway:3.1.4-enterprise
        args: ["/etc/sync-gateway/config.json"]
        ports:
        - containerPort: 4984
          name: public
        - containerPort: 4985
          name: admin
        resources:
          requests:
            cpu: "2"
            memory: 4Gi
          limits:
            cpu: "4"
            memory: 8Gi
        volumeMounts:
        - name: config
          mountPath: /etc/sync-gateway
      volumes:
      - name: config
        configMap:
          name: sync-gateway-config

Capacity Planning and Sizing

Proper capacity planning is essential for Couchbase performance and cost optimization. The following table provides sizing guidelines based on workload tier:


  
    
      Workload Tier
      Data Nodes
      Index/Query
      RAM per Node
      Storage
      Throughput
    
  
  
    
      Development
      1 (all services)
      Co-located
      4 GB
      20 GB SSD
      <1k ops/s
    
    
      Small Production
      3 Data
      2 Query+Index
      16 GB
      100 GB SSD
      10k ops/s
    
    
      Medium Production
      5 Data
      3 Query+Index
      32 GB
      500 GB SSD
      50k ops/s
    
    
      Large Production
      7-10 Data
      4+ Query+Index
      64 GB
      1 TB NVMe
      200k+ ops/s
    
    
      Enterprise / Global
      10+ Data (multi-region)
      6+ Query+Index
      128 GB
      2+ TB NVMe
      500k+ ops/s
    
  


Sizing Formula

# Data Service RAM sizing
# Required RAM per node = (num_documents * (doc_metadata_size + avg_value_size)) / num_data_nodes * (1 + num_replicas)
# doc_metadata_size = 56 bytes (fixed overhead per document)
# Include 25% headroom for fragmentation and growth

# Example: 100M documents, 1KB avg size, 3 data nodes, 1 replica
# RAM = (100,000,000 * (56 + 1024)) / 3 * 2 = ~72 GB per node
# With 25% headroom: ~90 GB per node

# Index Service RAM sizing (memory-optimized)
# RAM = total_index_size * 3 (for build/merge overhead)
# Use system:indexes to check current index sizes

# Disk sizing
# Disk = (num_documents * avg_doc_size * (1 + num_replicas)) * 3 (compaction headroom)
# Use SSD/NVMe with provisioned IOPS for predictable performance

Disaster Recovery and Failover Procedures

A comprehensive disaster recovery plan ensures business continuity when infrastructure failures exceed the scope of automatic failover.

Single Node Failure (Automatic)
# Auto-failover handles single node failures automatically.
# Verify failover occurred:
/opt/couchbase/bin/couchbase-cli server-list \
  --cluster localhost:8091 \
  --username Administrator \
  --password password

# After replacing the failed node, add and rebalance:
/opt/couchbase/bin/couchbase-cli server-add \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --server-add new-node.example.com:8091 \
  --server-add-username Administrator \
  --server-add-password password \
  --services data,index

/opt/couchbase/bin/couchbase-cli rebalance \
  --cluster localhost:8091 \
  --username Administrator \
  --password password

Complete Cluster Failure (Manual)
# Scenario: Primary region (US-EAST) completely lost

# Step 1: Verify XDCR target cluster (EU-WEST) has latest data
# Check XDCR replication status before failure
curl -s http://eu-west-node:8091/pools/default/remoteClusters \
  -u Administrator:password | jq .

# Step 2: Pause XDCR replications pointing to the failed cluster
/opt/couchbase/bin/couchbase-cli xdcr-replicate \
  --cluster cb-eu-west.example.com:8091 \
  --username Administrator \
  --password password \
  --pause \
  --xdcr-replicator <replication-id>

# Step 3: Update application connection strings to EU-WEST cluster
# (via DNS update, service mesh, or environment variable change)

# Step 4: Scale up EU-WEST cluster if needed to handle full production load
# In Kubernetes, update the CouchbaseCluster CRD:
kubectl patch couchbasecluster cb-eu-west -n couchbase --type merge \
  -p '{"spec":{"servers":[{"name":"data-zone-a","size":4}]}}'

# Step 5: After US-EAST cluster is restored, re-establish XDCR
# and perform a full resync from EU-WEST back to US-EAST

Graceful Failover and Recovery
# Graceful failover (for maintenance, drains data before removal)
/opt/couchbase/bin/couchbase-cli failover \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --server-failover node-to-remove.example.com:8091

# Recovery (re-add the node after maintenance)
/opt/couchbase/bin/couchbase-cli recovery \
  --cluster localhost:8091 \
  --username Administrator \
  --password password \
  --server-recovery node-to-recover.example.com:8091 \
  --recovery-type delta

# Delta recovery re-synchronizes only the changed data,
# which is much faster than full recovery.
# Full recovery rebuilds the node from scratch.

# Rebalance to complete the recovery
/opt/couchbase/bin/couchbase-cli rebalance \
  --cluster localhost:8091 \
  --username Administrator \
  --password password

Helm Values for Complete Production Deployment

Below is a comprehensive Helm values file for deploying Couchbase with the Autonomous Operator in a production environment:

# helm-values-production.yaml
couchbase-operator:
  operator:
    image:
      repository: couchbase/operator
      tag: 2.7.1
    resources:
      requests:
        cpu: 500m
        memory: 512Mi
      limits:
        cpu: "1"
        memory: 1Gi
  admissionController:
    enabled: true
    resources:
      requests:
        cpu: 100m
        memory: 128Mi

cluster:
  image: couchbase/server:7.6.1-enterprise
  antiAffinity: true
  autoFailoverTimeout: 30s
  autoFailoverMaxCount: 3
  autoFailoverOnDataDiskIssues: true
  autoFailoverServerGroup: true
  security:
    adminSecret: cb-admin-credentials
  networking:
    tls:
      static:
        serverSecret: couchbase-server-tls
        operatorSecret: couchbase-operator-tls
    exposeAdminConsole: true
    adminConsoleServiceType: NodePort
  buckets:
    managed: true
  servers:
    data:
      size: 4
      services:
      - data
      - index
      serverGroups:
      - zone-a
      - zone-b
      resources:
        requests:
          cpu: "4"
          memory: 16Gi
        limits:
          cpu: "8"
          memory: 20Gi
      volumeMounts:
        default: couchbase-data
        data: couchbase-data
        index: couchbase-index
    query:
      size: 2
      services:
      - query
      - search
      resources:
        requests:
          cpu: "4"
          memory: 8Gi
        limits:
          cpu: "8"
          memory: 12Gi
      volumeMounts:
        default: couchbase-default
    analytics:
      size: 2
      services:
      - analytics
      - eventing
      serverGroups:
      - zone-c
      resources:
        requests:
          cpu: "8"
          memory: 32Gi
        limits:
          cpu: "16"
          memory: 40Gi
      volumeMounts:
        default: couchbase-analytics
        analytics:
        - couchbase-analytics
  serverGroups:
  - zone-a
  - zone-b
  - zone-c
  volumeClaimTemplates:
  - metadata:
      name: couchbase-data
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 100Gi
  - metadata:
      name: couchbase-index
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 50Gi
  - metadata:
      name: couchbase-default
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 20Gi
  - metadata:
      name: couchbase-analytics
    spec:
      storageClassName: ebs-gp3-couchbase
      resources:
        requests:
          storage: 200Gi

Conclusion

Couchbase Server's architecture—built around vBucket-based sharding, memory-first data access, and integrated multi-model services—provides a uniquely powerful foundation for high availability production deployments. The combination of intra-cluster replication with automatic failover ensures that single-node failures are handled transparently, while XDCR extends this resilience across geographic regions for global applications.

The Couchbase Autonomous Operator for Kubernetes transforms what would be complex manual operations into declarative, self-healing deployments. Server groups provide rack/zone awareness, the operator manages rebalance operations during scaling events, and integrated backup CRDs automate disaster recovery preparation.

Key takeaways from this guide:


  Leverage Multi-Dimensional Scaling — Separate Data, Index, Query, Search, Analytics, and Eventing services onto dedicated node pools for independent scaling and resource isolation.
  Configure XDCR for multi-region resilience — Bidirectional XDCR with timestamp-based conflict resolution enables active-active deployments across AWS, Azure, and GCP. Always ensure NTP synchronization.
  Use server groups for zone awareness — Map server groups to availability zones or racks to guarantee that active and replica vBuckets are in different failure domains.
  Size memory carefully — Couchbase's performance is directly tied to how much of the working set fits in RAM. Use the sizing formulas and monitor cache miss ratios.
  Implement comprehensive monitoring — Deploy the Prometheus exporter from day one. XDCR replication lag, cache miss ratio, disk queue depth, and node health are your critical signals.
  Automate backups with cbbackupmgr — Combine full and incremental backups with cloud snapshots. Test restore procedures regularly.
  Secure with TLS and RBAC — Enable node-to-node and client-to-node TLS encryption. Use fine-grained RBAC roles for every application service account.
  Configure SDKs for HA — Use multiple bootstrap nodes, configure appropriate timeouts, implement replica reads as fallback, and leverage durable writes for critical data.
  Plan for disaster recovery — Document and rehearse failover procedures for single-node, multi-node, and complete cluster failure scenarios. XDCR standby clusters should be ready for promotion at all times.


With this comprehensive foundation, you are equipped to deploy and operate Couchbase Server in high availability production environments across any infrastructure—from managed Kubernetes on AWS, Azure, and GCP to bare metal k3s clusters managed by Rancher. The combination of Couchbase's native distributed architecture with Kubernetes orchestration delivers a database platform that meets the demands of modern, globally distributed applications.

Couchbase High Availability in Production: XDCR, Kubernetes Operator, and Multi-Region Deployment

Introduction: Why Couchbase for High Availability Production Workloads

Couchbase Server Architecture: Services, vBuckets, and Automatic Sharding

The Six Core Services

vBucket Distribution and Automatic Sharding

Intra-Cluster Replication and Auto-Failover

XDCR: Cross Data Center Replication

Unidirectional vs Bidirectional XDCR

Setting Up XDCR Replication

Conflict Resolution Strategies

XDCR Filtering

Couchbase Autonomous Operator for Kubernetes

CouchbaseCluster CRD Specification

Operator Installation via Helm

Server Groups and Rack/Zone Awareness

AWS EKS Deployment

Azure AKS Deployment

GCP GKE Deployment

Bare Metal k3s/Rancher with Longhorn

Backup and Restore with cbbackupmgr

Backup Configuration

Automated Backup Script for Kubernetes

CouchbaseBackup CRD (Operator-Managed)

N1QL Query Performance Tuning

Index Strategies: GSI and FTS

Query Optimization Tips

Memory Management and Bucket Configuration

TLS Encryption and RBAC

Kubernetes TLS with cert-manager

Monitoring with Prometheus Exporter

Key Couchbase Metrics to Monitor

Prometheus Alerting Rules

SDK Connection String Configuration for HA

Kubernetes Service DNS for SDK Connection

Couchbase Mobile and Sync Gateway for Edge Deployments

Capacity Planning and Sizing

Sizing Formula

Disaster Recovery and Failover Procedures

Single Node Failure (Automatic)

Complete Cluster Failure (Manual)

Graceful Failover and Recovery

Helm Values for Complete Production Deployment

Conclusion

Couchbase High Availability in Production: XDCR, Kubernetes Operator, and Multi-Region Deployment

Introduction: Why Couchbase for High Availability Production Workloads

Couchbase Server Architecture: Services, vBuckets, and Automatic Sharding

The Six Core Services

vBucket Distribution and Automatic Sharding

Intra-Cluster Replication and Auto-Failover

XDCR: Cross Data Center Replication

Unidirectional vs Bidirectional XDCR

Setting Up XDCR Replication

Conflict Resolution Strategies

XDCR Filtering

Couchbase Autonomous Operator for Kubernetes

CouchbaseCluster CRD Specification

Operator Installation via Helm

Server Groups and Rack/Zone Awareness

AWS EKS Deployment

Azure AKS Deployment

GCP GKE Deployment

Bare Metal k3s/Rancher with Longhorn

Backup and Restore with cbbackupmgr

Backup Configuration

Automated Backup Script for Kubernetes

CouchbaseBackup CRD (Operator-Managed)

N1QL Query Performance Tuning

Index Strategies: GSI and FTS

Query Optimization Tips

Memory Management and Bucket Configuration

TLS Encryption and RBAC

Kubernetes TLS with cert-manager

Monitoring with Prometheus Exporter

Key Couchbase Metrics to Monitor

Prometheus Alerting Rules

SDK Connection String Configuration for HA

Kubernetes Service DNS for SDK Connection

Couchbase Mobile and Sync Gateway for Edge Deployments

Capacity Planning and Sizing

Sizing Formula

`Couchbase High Availability in Production: XDCR, Kubernetes Operator, and Multi-Region Deployment`