Upgrades are where database reliability is either proven or broken. This article provides a paced, support-friendly runbook for upgrading Couchbase Server under the Couchbase Autonomous Operator (CAO), using the native spec.paused field to gate progress between nodes. The result: one node at a time, a stabilization window between swaps, clearer signals, and a larger rollback window.
Reference topology
The paced upgrade loop
Goals
- Upgrade Couchbase Server with minimal risk.
- Keep a deliberate pause + stabilize + health check window between nodes.
- Maintain rollback options for as long as practical.
Pre-upgrade checklist (do not skip)
- All green: cluster phase
Available, no active rebalance, no warning events. - Backups current and restorable: full backup completed; restore drill completed or time understood.
- Rollback tag recorded: verify the old image still exists and can be pulled.
- XDCR decision recorded: disable during prod upgrades for a clean signal (recommended), or keep running in pre-prod to exercise behaviour.
Quick verification commands
export ENV=dev
export REGION=west
export NS=couchbase-${ENV}-${REGION}
kubectl -n "$NS" get couchbasecluster -o wide
kubectl -n "$NS" get pods -l app=couchbase
kubectl -n "$NS" get events --field-selector type=Warning | tail -20
kubectl -n "$NS" get couchbasecluster "$NS" -o jsonpath='paused={.spec.paused} phase={.status.phase} rebalance={.status.rebalanceProgress}{"\n"}'
Execution paths
- Preferred: run the upgrade from your CI workflow (dry-run first, then real run).
- Fallback: run the paced upgrade script from a workstation (dry-run first, then real run).
Monitoring signals (what support should watch)
- Pod images: shifting old → new; one swap at a time is ideal.
- Pause state:
spec.pausedtoggles true during stabilization; never left true unattended. - Rebalance: returns to none between swaps; investigate persistent rebalances.
- XDCR:
changes_leftspikes during rebalance and drains during stabilization; failure to drain is an incident signal. - Restarts: any unexpected restarts post-swap are a red flag.
Rollback triggers
- Node fails to become healthy within your timeout window.
- Rebalance fails and does not resolve with a single retry after investigation.
- Application error rate exceeds the agreed tolerance.
- XDCR fails to recover after the agreed recovery window.
- Any bucket becomes unavailable (missing vbuckets) — treat as P1.
Post-upgrade validation (sign-off)
- All pods on the target image
- Cluster phase
Available - No new warning events for 30+ minutes
- Backup succeeded post-upgrade
- XDCR steady-state recovered (if used)
- Application dashboards green for 30+ minutes
Tip: If you want a shorter narrative version first, start with the blog overview: Couchbase upgrades with CAO pause gates.