Redis High Availability in Production: Sentinel, Cluster, and Kubernetes Operators
Redis HA with Sentinel, Cluster Mode, and Kubernetes Operators
Redis is the backbone of modern application infrastructure. It serves as a cache, session store, message broker, rate limiter, leaderboard engine, and real-time analytics pipeline for millions of applications worldwide. A single Redis instance can handle hundreds of thousands of operations per second with sub-millisecond latency — but a single instance is also a single point of failure. When Redis goes down, applications experience cascading failures: cache stampedes overwhelm backend databases, sessions are lost, rate limiters stop working, and real-time features go dark. Building a highly available Redis deployment is not optional for production systems — it is an engineering requirement.
This guide is a comprehensive, production-focused deep dive into Redis high availability. We will cover Redis replication fundamentals (asynchronous replication, the WAIT command, and partial resynchronisation), Redis Sentinel for automatic failover and service discovery, Redis Cluster for horizontal scaling with hash slot distribution, Kubernetes operators (Spotahome, OpsTree, and Redis Enterprise), persistence strategies (RDB snapshots, AOF, and hybrid persistence), cloud-managed deployments on AWS ElastiCache, Azure Cache for Redis, and GCP Memorystore, bare metal k3s deployments with Rancher and Longhorn, memory management and eviction policies, TLS encryption and ACL-based access control, Pub/Sub and Streams behaviour in HA configurations, Redis modules (RedisJSON, RediSearch, RedisTimeSeries) in HA, connection pooling and client configuration for failover resilience, backup and restore strategies, monitoring with Redis INFO, Prometheus exporter, and Grafana dashboards, Dragonfly and KeyDB as Redis-compatible alternatives, performance tuning with pipelining, Lua scripting, and memory optimisation, common failure scenarios and troubleshooting procedures, and capacity planning and scaling strategies.
Redis Replication Fundamentals
Redis replication is the foundation upon which all high availability architectures are built. A Redis master instance accepts writes and asynchronously propagates them to one or more replica instances. Replicas maintain a near-real-time copy of the master's dataset and serve read queries, providing both data redundancy and read scalability.
Unlike PostgreSQL's WAL-based streaming replication, Redis uses a command-based replication protocol. Every write command executed on the master is serialised into the replication stream and sent to connected replicas, which execute the same commands against their local datasets. This approach is simple and efficient but has important implications for consistency — since replication is asynchronous by default, there is always a window where acknowledged writes on the master have not yet reached replicas.
Asynchronous Replication and the WAIT Command
By default, Redis replication is fully asynchronous. The master acknowledges a write to the client immediately after applying it locally, without waiting for any replica to confirm receipt. This gives maximum write throughput but introduces a potential data loss window — if the master crashes before a write reaches any replica, that write is lost.
The WAIT command provides a synchronous replication primitive. After issuing a write, the client can call WAIT numreplicas timeout to block until the specified number of replicas have acknowledged the write or the timeout expires. This does not make Redis fully synchronous — WAIT only guarantees that replicas have received the data, not that it has been persisted to disk on the replicas. However, it significantly reduces the data loss window.
# Write a critical value and wait for 2 replicas to acknowledge
SET order:12345 '{"status":"confirmed","amount":599.99}'
WAIT 2 5000
# Returns the number of replicas that acknowledged within 5000ms
# Returns 0 if no replica acknowledged (timeout or no replicas connected)
Use WAIT selectively for critical writes (financial transactions, order confirmations) while allowing non-critical writes (cache updates, session refreshes) to proceed asynchronously. The per-command flexibility avoids the latency penalty of full synchronous replication.
Partial Resynchronisation (PSYNC)
When a replica disconnects briefly (network blip, restart), it does not need a full dataset transfer to rejoin. Redis maintains a replication backlog — a circular buffer of recent write commands — on the master. When the replica reconnects, it sends its replication offset to the master. If the offset is still within the backlog, the master sends only the missing commands (partial resync). If the offset has fallen outside the backlog, a full resynchronisation is triggered, which involves generating and transferring an RDB snapshot.
# redis.conf — Replication backlog configuration
repl-backlog-size 256mb # Size of the replication backlog buffer
repl-backlog-ttl 3600 # Seconds to retain backlog after last replica disconnects
repl-diskless-sync yes # Transfer RDB via socket instead of disk (faster for full sync)
repl-diskless-sync-delay 5 # Wait 5s for more replicas before starting diskless sync
repl-diskless-sync-period 0 # No periodic full sync
repl-diskless-load on-empty-db # Replica loads RDB from socket directly into memory
Sizing the replication backlog correctly is critical. It should be large enough to hold all write commands generated during the longest expected replica disconnection. For a Redis instance processing 50MB/s of write traffic, a 256MB backlog covers about 5 seconds of writes — increase it if your replicas might be offline for longer periods.
Configuring Master-Replica Replication
# redis.conf — Master configuration
bind 0.0.0.0
port 6379
protected-mode no
requirepass strong_master_password
masterauth strong_master_password
# Persistence
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfsync everysec
aof-use-rdb-preamble yes
# Replication
repl-backlog-size 256mb
repl-backlog-ttl 3600
repl-diskless-sync yes
min-replicas-to-write 1 # Refuse writes if fewer than 1 replica connected
min-replicas-max-lag 10 # Replica considered disconnected if lag > 10 seconds
# redis.conf — Replica configuration
bind 0.0.0.0
port 6379
protected-mode no
requirepass strong_master_password
masterauth strong_master_password
replicaof master-host 6379
replica-read-only yes
replica-serve-stale-data yes # Serve (possibly stale) data during sync
replica-priority 100 # Lower values get promoted first by Sentinel
The min-replicas-to-write and min-replicas-max-lag settings prevent the master from accepting writes when it cannot guarantee data durability on replicas. This is a critical safety net — without it, a network-partitioned master continues accepting writes that will be lost when Sentinel promotes a replica.
Redis Sentinel: Automatic Failover and Service Discovery
Redis Sentinel is a distributed system that monitors Redis master and replica instances, detects master failures, performs automatic failover by promoting a replica to master, and provides service discovery so clients can always find the current master. Sentinel runs as a separate process alongside Redis and operates through a consensus protocol — a quorum of Sentinel instances must agree that a master is unreachable before initiating failover.
Sentinel Configuration
Sentinel requires a minimum of three instances to tolerate one Sentinel failure and still maintain quorum. Each Sentinel monitors the same Redis master and communicates with other Sentinels through a gossip protocol to agree on the master's health status.
# /etc/redis/sentinel.conf — Sentinel instance configuration
port 26379
bind 0.0.0.0
protected-mode no
# Monitor the master named "mymaster" at 10.0.1.10:6379
# The quorum value (2) means 2 Sentinels must agree the master is down
sentinel monitor mymaster 10.0.1.10 6379 2
# Authentication
sentinel auth-pass mymaster strong_master_password
# Timing parameters
sentinel down-after-milliseconds mymaster 5000 # SDOWN after 5s of no PING response
sentinel failover-timeout mymaster 60000 # Max 60s for failover procedure
sentinel parallel-syncs mymaster 1 # Only 1 replica syncs from new master at a time
# Deny script execution for security
sentinel deny-scripts-reconfig yes
# Notification script (called on failover events)
# sentinel notification-script mymaster /opt/redis/notify.sh
# Client reconfiguration script (called when master changes)
# sentinel client-reconfig-script mymaster /opt/redis/reconfig.sh
# Logging
logfile /var/log/redis/sentinel.log
logevel notice
# Enable TLS for Sentinel communication
# tls-port 26379
# port 0
# tls-cert-file /etc/redis/tls/sentinel.crt
# tls-key-file /etc/redis/tls/sentinel.key
# tls-ca-cert-file /etc/redis/tls/ca.crt
# tls-replication yes
# tls-auth-clients optional
Sentinel's failure detection works in two phases. First, an individual Sentinel marks a master as Subjectively Down (SDOWN) when it receives no valid reply to PING within down-after-milliseconds. Then, when a quorum of Sentinels agree the master is unreachable, it is marked as Objectively Down (ODOWN), and the failover process begins. One Sentinel is elected as the failover leader, which selects the best replica (based on priority, replication offset, and runid), promotes it to master, reconfigures remaining replicas to follow the new master, and updates Sentinel state.
Sentinel Service Discovery and Client Configuration
The key advantage of Sentinel over static master-replica setups is service discovery. Clients do not connect to a fixed Redis address — they ask Sentinel for the current master address and subscribe to failover notifications. Every major Redis client library supports Sentinel natively.
# Node.js — ioredis with Sentinel support
const Redis = require('ioredis');
const redis = new Redis({
sentinels: [
{ host: '10.0.1.20', port: 26379 },
{ host: '10.0.1.21', port: 26379 },
{ host: '10.0.1.22', port: 26379 }
],
name: 'mymaster',
password: 'strong_master_password',
sentinelPassword: 'sentinel_password',
db: 0,
retryStrategy(times) {
const delay = Math.min(times * 200, 5000);
return delay;
},
reconnectOnError(err) {
const targetError = 'READONLY';
if (err.message.includes(targetError)) {
return true; // Reconnect on READONLY error (failover happened)
}
return false;
},
maxRetriesPerRequest: 3,
enableReadyCheck: true,
connectTimeout: 10000,
lazyConnect: false
});
redis.on('connect', () => console.log('Connected to Redis master'));
redis.on('error', (err) => console.error('Redis error:', err));
redis.on('+switch-master', (msg) => {
console.log('Master switched:', msg);
});
# Python — redis-py with Sentinel support
from redis.sentinel import Sentinel
import redis
sentinel = Sentinel(
[('10.0.1.20', 26379), ('10.0.1.21', 26379), ('10.0.1.22', 26379)],
socket_timeout=5,
password='strong_master_password',
sentinel_kwargs={'password': 'sentinel_password'}
)
# Get a connection to the current master (for writes)
master = sentinel.master_for(
'mymaster',
socket_timeout=5,
retry_on_timeout=True,
db=0
)
# Get a connection to a replica (for reads)
replica = sentinel.slave_for(
'mymaster',
socket_timeout=5,
db=0
)
# Usage
master.set('session:user123', '{"logged_in": true}')
result = replica.get('session:user123')
print(result)
// Go — go-redis with Sentinel support
package main
import (
"context"
"fmt"
"time"
"github.com/redis/go-redis/v9"
)
func main() {
ctx := context.Background()
rdb := redis.NewFailoverClient(&redis.FailoverOptions{
MasterName: "mymaster",
SentinelAddrs: []string{"10.0.1.20:26379", "10.0.1.21:26379", "10.0.1.22:26379"},
Password: "strong_master_password",
SentinelPassword: "sentinel_password",
DB: 0,
DialTimeout: 10 * time.Second,
ReadTimeout: 5 * time.Second,
WriteTimeout: 5 * time.Second,
PoolSize: 50,
MinIdleConns: 10,
MaxRetries: 3,
MinRetryBackoff: 200 * time.Millisecond,
MaxRetryBackoff: 5 * time.Second,
})
defer rdb.Close()
err := rdb.Set(ctx, "key", "value", 5*time.Minute).Err()
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
val, err := rdb.Get(ctx, "key").Result()
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("key = %s\n", val)
}
Redis Cluster: Horizontal Scaling with Hash Slots
While Sentinel provides high availability for a single dataset, Redis Cluster provides both HA and horizontal scaling. Redis Cluster partitions data across multiple master nodes using a hash slot mechanism — the keyspace is divided into 16,384 hash slots, and each master is responsible for a subset of those slots. Each master has one or more replicas for failover. The cluster collectively provides automatic sharding, built-in failover, and the ability to scale both storage and throughput linearly by adding nodes.
Setting Up a Redis Cluster
# Create a 6-node Redis Cluster (3 masters + 3 replicas)
# Each node needs a redis.conf with cluster-enabled
# redis.conf for each cluster node (adjust port per node)
port 7000
cluster-enabled yes
cluster-config-file nodes-7000.conf
cluster-node-timeout 5000
appendonly yes
appendfsync everysec
aof-use-rdb-preamble yes
requirepass cluster_password
masterauth cluster_password
bind 0.0.0.0
protected-mode no
repl-backlog-size 256mb
# Start all 6 Redis instances
redis-server /etc/redis/7000.conf
redis-server /etc/redis/7001.conf
# ... repeat for all 6 nodes
# Create the cluster
redis-cli --cluster create \
10.0.1.10:7000 10.0.1.11:7000 10.0.1.12:7000 \
10.0.1.13:7000 10.0.1.14:7000 10.0.1.15:7000 \
--cluster-replicas 1 \
-a cluster_password
# Verify cluster status
redis-cli -c -h 10.0.1.10 -p 7000 -a cluster_password cluster info
redis-cli -c -h 10.0.1.10 -p 7000 -a cluster_password cluster nodes
Resharding and Multi-Key Operations
Resharding moves hash slots between masters to rebalance data after adding or removing nodes. During resharding, keys in migrating slots may receive ASK redirections, which the client handles transparently.
# Add a new node to the cluster
redis-cli --cluster add-node 10.0.1.16:7000 10.0.1.10:7000 -a cluster_password
# Reshard slots to the new node
redis-cli --cluster reshard 10.0.1.10:7000 \
--cluster-from all \
--cluster-to NEW_NODE_ID \
--cluster-slots 4096 \
--cluster-yes \
-a cluster_password
# Rebalance the cluster automatically
redis-cli --cluster rebalance 10.0.1.10:7000 -a cluster_password
# Check cluster slot distribution
redis-cli -c -h 10.0.1.10 -p 7000 -a cluster_password cluster slots
Multi-key operations in Redis Cluster only work when all keys involved reside in the same hash slot. Use hash tags to ensure related keys map to the same slot: {user:123}.profile and {user:123}.sessions both hash on user:123, guaranteeing they land on the same node. This enables MGET, MSET, transactions, and Lua scripts across related keys.
# Hash tags ensure these keys are in the same slot
SET {order:5000}.details '{"item":"widget","qty":3}'
SET {order:5000}.payment '{"method":"card","status":"paid"}'
SET {order:5000}.shipping '{"carrier":"fedex","tracking":"FX123"}'
# Multi-key operations work because all keys share the {order:5000} hash tag
MGET {order:5000}.details {order:5000}.payment {order:5000}.shipping
# Transaction across same-slot keys
MULTI
SET {order:5000}.details '{"item":"widget","qty":3,"status":"confirmed"}'
SET {order:5000}.payment '{"method":"card","status":"captured"}'
EXEC
Redis Operator for Kubernetes
Running Redis in Kubernetes requires careful handling of persistent storage, network identity, graceful failover, and configuration management. Kubernetes operators encode this operational knowledge into custom controllers that manage Redis clusters declaratively through Custom Resource Definitions (CRDs).
Spotahome Redis Operator
The Spotahome operator (also known as redis-operator) is a mature, widely-used operator for deploying Redis Sentinel-based HA in Kubernetes. It manages Redis master-replica sets with Sentinel-based failover.
# Install the Spotahome Redis Operator
helm repo add spotahome https://spotahome.github.io/redis-operator
helm install redis-operator spotahome/redis-operator \
--namespace redis-system --create-namespace
# RedisFailover CRD — 3 Redis instances + 3 Sentinels
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
name: redis-ha
namespace: production
spec:
sentinel:
replicas: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
customConfig:
down-after-milliseconds: "5000"
failover-timeout: "60000"
redis:
replicas: 3
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
storage:
persistentVolumeClaim:
metadata:
name: redis-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: longhorn
customConfig:
maxmemory: "6gb"
maxmemory-policy: "allkeys-lru"
save: "900 1 300 10 60 10000"
appendonly: "yes"
appendfsync: "everysec"
aof-use-rdb-preamble: "yes"
repl-backlog-size: "256mb"
exporter:
enabled: true
image: oliver006/redis_exporter:latest
args:
- --include-system-metrics
OpsTree Redis Operator
The OpsTree operator supports both Redis Sentinel (standalone HA) and Redis Cluster (sharded HA) topologies, making it more versatile for different use cases.
# Install the OpsTree Redis Operator
helm repo add ot-helm https://ot-container-kit.github.io/helm-charts/
helm install redis-operator ot-helm/redis-operator \
--namespace redis-system --create-namespace
# Redis Cluster CRD — 3 masters + 3 replicas
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisCluster
metadata:
name: redis-cluster
namespace: production
spec:
clusterSize: 3
clusterVersion: v7
persistenceEnabled: true
kubernetesConfig:
image: redis:7.2-alpine
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
redisLeader:
replicas: 3
redisConfig:
additionalRedisConfig: |
maxmemory 6gb
maxmemory-policy allkeys-lru
appendonly yes
appendfsync everysec
redisFollower:
replicas: 3
redisConfig:
additionalRedisConfig: |
maxmemory 6gb
replica-read-only yes
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: longhorn
redisExporter:
enabled: true
image: quay.io/opstree/redis-exporter:v1.44.0
Redis Enterprise Operator
Redis Enterprise provides a commercial Kubernetes operator with advanced features including Active-Active geo-replication (CRDTs), Redis modules support, auto-tiering (RAM + flash), and automated cluster management. It is the recommended option for organisations that need enterprise-grade SLAs and support.
# Redis Enterprise Operator CRD
apiVersion: app.redislabs.com/v1
kind: RedisEnterpriseCluster
metadata:
name: redis-enterprise
namespace: redis-enterprise
spec:
nodes: 3
persistentSpec:
enabled: true
storageClassName: longhorn
volumeSize: 100Gi
redisEnterpriseNodeResources:
limits:
cpu: "8"
memory: 32Gi
requests:
cpu: "4"
memory: 16Gi
uiServiceType: ClusterIP
servicesRiggerSpec:
databaseServiceType: ClusterIP
---
apiVersion: app.redislabs.com/v1alpha1
kind: RedisEnterpriseDatabase
metadata:
name: redis-ha-db
namespace: redis-enterprise
spec:
memorySize: 10GB
replication: true
shardCount: 3
persistence: aofEverySecond
tlsMode: enabled
modulesList:
- name: search
version: latest
- name: json
version: latest
Persistence Strategies: RDB, AOF, and Hybrid
Redis offers three persistence mechanisms. Choosing the right strategy depends on your recovery point objective (RPO), performance requirements, and storage constraints.
RDB snapshots create point-in-time snapshots of the entire dataset at configured intervals. They are compact, fast to load on restart, and ideal for backups. However, data written between snapshots is lost on crash. RDB uses a forked child process, so snapshot creation does not block the main Redis thread — but the fork itself can cause a latency spike on large datasets due to copy-on-write memory allocation.
AOF (Append Only File) logs every write operation to disk. It provides much better durability than RDB — with appendfsync everysec, you lose at most one second of data on crash. With appendfsync always, you lose nothing, but at a significant performance cost. AOF files are larger than RDB and slower to load on restart.
Hybrid persistence (the recommended approach) combines both: aof-use-rdb-preamble yes writes an RDB snapshot at the beginning of the AOF file, followed by AOF entries for subsequent writes. This gives fast startup times (RDB section) with strong durability (AOF section).
# redis.conf — Hybrid persistence (recommended for production)
# RDB snapshots
save 900 1 # Snapshot if at least 1 write in 900 seconds
save 300 10 # Snapshot if at least 10 writes in 300 seconds
save 60 10000 # Snapshot if at least 10000 writes in 60 seconds
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /data/redis
# AOF
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec # Best balance of durability and performance
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-use-rdb-preamble yes # Hybrid: RDB preamble + AOF tail
aof-timestamp-enabled yes # Enable timestamps for PITR (Redis 7+)
# Recovery options
rdb-del-sync-files no
aof-load-truncated yes
Cloud-Managed Redis Deployments
AWS ElastiCache for Redis
AWS ElastiCache offers fully managed Redis with two HA modes: Cluster Mode Disabled (single shard, up to 5 replicas, Sentinel-like failover) and Cluster Mode Enabled (up to 500 shards, each with up to 5 replicas, hash slot distribution). Global Datastore provides cross-region replication for disaster recovery.
# AWS CLI — Create ElastiCache Redis Cluster Mode Enabled
aws elasticache create-replication-group \
--replication-group-id redis-ha-prod \
--replication-group-description "Production Redis HA Cluster" \
--engine redis \
--engine-version 7.1 \
--cache-node-type cache.r7g.2xlarge \
--num-node-groups 3 \
--replicas-per-node-group 2 \
--automatic-failover-enabled \
--multi-az-enabled \
--at-rest-encryption-enabled \
--transit-encryption-enabled \
--auth-token strong_auth_token \
--cache-subnet-group-name redis-subnet-group \
--security-group-ids sg-0123456789abcdef0 \
--snapshot-retention-limit 7 \
--snapshot-window "03:00-05:00" \
--preferred-maintenance-window "sun:05:00-sun:07:00" \
--cache-parameter-group-name redis-ha-params \
--log-delivery-configurations '[
{"LogType":"slow-log","DestinationType":"cloudwatch-logs","DestinationDetails":{"CloudWatchLogsDetails":{"LogGroup":"/aws/elasticache/redis-ha-prod"}}},
{"LogType":"engine-log","DestinationType":"cloudwatch-logs","DestinationDetails":{"CloudWatchLogsDetails":{"LogGroup":"/aws/elasticache/redis-ha-prod"}}}
]'
# Create Global Datastore for cross-region DR
aws elasticache create-global-replication-group \
--global-replication-group-id-suffix redis-global \
--primary-replication-group-id redis-ha-prod
# Add secondary region
aws elasticache create-replication-group \
--replication-group-id redis-ha-dr \
--replication-group-description "DR Redis in eu-west-2" \
--global-replication-group-id ldgnf-redis-global \
--cache-node-type cache.r7g.2xlarge \
--num-node-groups 3 \
--replicas-per-node-group 1 \
--region eu-west-2
# Custom parameter group for HA tuning
aws elasticache create-cache-parameter-group \
--cache-parameter-group-name redis-ha-params \
--cache-parameter-group-family redis7 \
--description "HA-optimised Redis 7 parameters"
aws elasticache modify-cache-parameter-group \
--cache-parameter-group-name redis-ha-params \
--parameter-name-values \
"ParameterName=maxmemory-policy,ParameterValue=allkeys-lru" \
"ParameterName=timeout,ParameterValue=300" \
"ParameterName=tcp-keepalive,ParameterValue=60" \
"ParameterName=activedefrag,ParameterValue=yes"
Azure Cache for Redis
Azure Cache for Redis provides three tiers: Basic (no replication), Standard (replicated), and Premium/Enterprise. The Premium tier supports clustering, geo-replication, zone redundancy, VNet injection, and data persistence. The Enterprise tier adds Redis modules and Active-Active geo-distribution.
# Azure CLI — Create Premium Azure Cache for Redis with clustering
az redis create \
--resource-group redis-ha-rg \
--name redis-ha-prod \
--location westeurope \
--sku Premium \
--vm-size P3 \
--shard-count 3 \
--replicas-per-master 1 \
--zones 1 2 3 \
--minimum-tls-version 1.2 \
--redis-version 7
# Enable geo-replication (link primary to secondary)
az redis server-link create \
--name redis-ha-prod \
--resource-group redis-ha-rg \
--server-to-link /subscriptions/.../redis-ha-dr \
--replication-role Secondary
# Configure data persistence
az redis update \
--name redis-ha-prod \
--resource-group redis-ha-rg \
--set redisConfiguration.rdb-backup-enabled=true \
--set redisConfiguration.rdb-backup-frequency=60 \
--set redisConfiguration.rdb-storage-connection-string="DefaultEndpointsProtocol=https;..."
# Enable diagnostics
az monitor diagnostic-settings create \
--name redis-diagnostics \
--resource /subscriptions/.../redis-ha-prod \
--workspace /subscriptions/.../log-analytics-workspace \
--metrics '[{"category":"AllMetrics","enabled":true}]'
GCP Memorystore for Redis
GCP offers two Memorystore tiers: Standard (single instance with automatic failover replica) and Redis Cluster (sharded, fully managed cluster with automatic scaling). The Standard tier is suitable for most HA use cases, while Redis Cluster handles large datasets requiring horizontal scaling.
# GCP — Create Standard tier Memorystore (HA with auto-failover)
gcloud redis instances create redis-ha-prod \
--size=26 \
--region=europe-west1 \
--zone=europe-west1-b \
--alternative-zone=europe-west1-c \
--tier=standard \
--redis-version=redis_7_2 \
--redis-config="maxmemory-policy=allkeys-lru,activedefrag=yes" \
--network=projects/my-project/global/networks/vpc-main \
--transit-encryption-mode=SERVER_AUTHENTICATION \
--enable-auth \
--persistence-mode=RDB \
--rdb-snapshot-period=12h \
--rdb-snapshot-start-time="2026-04-12T03:00:00Z" \
--maintenance-window-day=SUNDAY \
--maintenance-window-hour=4
# GCP — Create Memorystore Redis Cluster
gcloud redis clusters create redis-cluster-prod \
--region=europe-west1 \
--shard-count=3 \
--replica-count=1 \
--network=projects/my-project/global/networks/vpc-main \
--transit-encryption-mode=SERVER_AUTHENTICATION
Multi-Region Redis Deployment
Multi-region Redis deployment is essential for disaster recovery and global latency reduction. The approach varies by architecture: Redis Enterprise uses Active-Active with CRDTs (Conflict-free Replicated Data Types) for true multi-master writes, while open-source Redis and cloud-managed services use active-passive replication with read replicas in secondary regions.
Bare Metal k3s/Rancher with Longhorn
For organisations running their own hardware, deploying Redis HA on bare metal k3s provides full control over the infrastructure, eliminates cloud vendor lock-in, and can be significantly more cost-effective for large-scale deployments. k3s is a lightweight, certified Kubernetes distribution ideal for edge and bare metal environments, while Rancher provides a management plane for multi-cluster operations.
k3s Installation and Redis Operator Deployment
# Install k3s on the first server node
curl -sfL https://get.k3s.io | K3S_TOKEN=redis-cluster-token \
INSTALL_K3S_EXEC="server --cluster-init --disable traefik --disable servicelb" sh -
# Join additional server nodes
curl -sfL https://get.k3s.io | K3S_TOKEN=redis-cluster-token \
K3S_URL=https://10.0.0.1:6443 \
INSTALL_K3S_EXEC="server" sh -
# Install Longhorn for distributed block storage
helm repo add longhorn https://charts.longhorn.io
helm install longhorn longhorn/longhorn \
--namespace longhorn-system --create-namespace \
--set defaultSettings.defaultDataPath=/mnt/longhorn \
--set defaultSettings.replicaCount=3 \
--set defaultSettings.storageMinimalAvailablePercentage=15
# Install MetalLB for bare-metal LoadBalancer services
helm repo add metallb https://metallb.github.io/metallb
helm install metallb metallb/metallb --namespace metallb-system --create-namespace
# Configure MetalLB IP address pool
kubectl apply -f - <
Redis HA Helm Values for k3s
# values-redis-ha.yaml — Helm values for Redis HA on k3s/Longhorn
redis:
replicas: 3
resources:
requests:
cpu: "2"
memory: 8Gi
limits:
cpu: "4"
memory: 16Gi
storage:
persistentVolumeClaim:
metadata:
name: redis-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: longhorn
customConfig:
maxmemory: "6gb"
maxmemory-policy: "allkeys-lru"
save: "900 1 300 10 60 10000"
appendonly: "yes"
appendfsync: "everysec"
aof-use-rdb-preamble: "yes"
repl-backlog-size: "256mb"
tcp-keepalive: "60"
timeout: "300"
hz: "10"
activedefrag: "yes"
exporter:
enabled: true
image: oliver006/redis_exporter:latest
sentinel:
replicas: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
customConfig:
down-after-milliseconds: "5000"
failover-timeout: "60000"
parallel-syncs: "1"
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- redis
topologyKey: kubernetes.io/hostname
Memory Management and Eviction Policies
Redis stores all data in memory, making memory management the most critical operational concern. When Redis reaches the configured maxmemory limit, it must decide what to do with incoming write commands. The eviction policy controls this behaviour.
# redis.conf — Memory management
maxmemory 6gb
maxmemory-policy allkeys-lru
# Available eviction policies:
# noeviction — Return errors on writes when memory limit reached
# allkeys-lru — Evict least recently used keys (general-purpose cache)
# allkeys-lfu — Evict least frequently used keys (better for skewed access patterns)
# volatile-lru — Evict LRU keys with TTL set
# volatile-lfu — Evict LFU keys with TTL set
# volatile-ttl — Evict keys with shortest TTL first
# allkeys-random — Evict random keys
# volatile-random — Evict random keys with TTL set
# Active defragmentation (Redis 4.0+)
activedefrag yes
active-defrag-enabled yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25
active-defrag-max-scan-fields 1000
# Memory usage monitoring
# redis-cli INFO memory
# Key metrics:
# used_memory — Total bytes allocated by Redis
# used_memory_rss — Resident set size (OS-level memory)
# mem_fragmentation_ratio — RSS / used_memory (should be close to 1.0)
# maxmemory — Configured memory limit
# evicted_keys — Total keys evicted due to maxmemory
For HA deployments, set maxmemory to approximately 75% of the node's available RAM. The remaining 25% accommodates the replication output buffer, AOF rewrite buffer, copy-on-write memory during RDB snapshots, and OS overhead. For a 16GB pod, set maxmemory to 12GB. For a 64GB bare metal server, set it to 48GB.
TLS Encryption and ACLs
Production Redis deployments must encrypt data in transit with TLS and enforce fine-grained access control with ACLs (Access Control Lists), introduced in Redis 6.0.
# redis.conf — TLS configuration
tls-port 6380
port 0 # Disable non-TLS port entirely
tls-cert-file /etc/redis/tls/redis.crt
tls-key-file /etc/redis/tls/redis.key
tls-ca-cert-file /etc/redis/tls/ca.crt
tls-auth-clients optional # Require client certificates (mutual TLS)
tls-replication yes # Encrypt replication traffic
tls-cluster yes # Encrypt cluster bus traffic
tls-protocols "TLSv1.3" # Only allow TLS 1.3
# ACL configuration (Redis 6.0+)
# user on|off [>password] [~pattern] [+command|-command] [&channel]
user default off # Disable the default user
user admin on >strong_admin_pass ~* +@all
user appuser on >app_pass ~app:* ~session:* ~cache:* +@read +@write +@connection -@admin -@dangerous
user readonly on >readonly_pass ~* +@read +@connection -@write -@admin
user replicator on >repl_pass +psync +replconf +ping
# Load ACL from external file
aclfile /etc/redis/users.acl
Pub/Sub and Streams in HA Setups
Redis Pub/Sub and Streams behave differently in HA configurations. Understanding these differences is essential for building reliable event-driven systems.
Pub/Sub messages are fire-and-forget — they are not persisted, not replicated, and not buffered. In a Sentinel-based HA setup, subscribers connected to the master receive messages normally, but during a failover, the new master has no knowledge of previous subscriptions. Clients must re-subscribe after reconnecting. With Redis Cluster, Pub/Sub messages are broadcast to all nodes in the cluster, so subscribers connected to any node receive published messages (though this generates inter-node traffic).
Redis Streams are a persistent, replicated data structure that provides reliable message delivery in HA environments. Stream entries are replicated to replicas through the normal replication mechanism, survive failovers, and support consumer groups with at-least-once delivery semantics. For HA messaging, Streams should always be preferred over Pub/Sub.
# Redis Streams with consumer groups — HA-safe message processing
# Create a stream and consumer group
XGROUP CREATE events:orders orders-processors $ MKSTREAM
# Produce events
XADD events:orders * action "order_placed" order_id "12345" amount "599.99"
XADD events:orders * action "order_placed" order_id "12346" amount "149.99"
# Consume events (in consumer group — at-least-once delivery)
XREADGROUP GROUP orders-processors worker-1 COUNT 10 BLOCK 5000 STREAMS events:orders >
# Acknowledge processed events
XACK events:orders orders-processors 1681234567890-0
# Check pending messages (unacknowledged)
XPENDING events:orders orders-processors - + 10
# Claim abandoned messages (from a dead consumer)
XAUTOCLAIM events:orders orders-processors worker-2 60000 0-0 COUNT 10
# Trim stream to prevent unbounded growth
XTRIM events:orders MAXLEN ~ 100000
Redis Modules in HA
Redis modules extend Redis with specialised data structures and functionality. The three most popular — RedisJSON, RediSearch, and RedisTimeSeries — work with replication and Sentinel, but have specific considerations in HA deployments.
# redis.conf — Loading modules
loadmodule /opt/redis-stack/lib/rejson.so
loadmodule /opt/redis-stack/lib/redisearch.so
loadmodule /opt/redis-stack/lib/redistimeseries.so
# Modules are replicated to replicas via the command stream
# Ensure the same modules are installed on all nodes (master + replicas)
# RedisJSON — store and query JSON documents
JSON.SET user:1001 $ '{"name":"Alice","email":"alice@example.com","orders":42}'
JSON.GET user:1001 $.name
# RediSearch — full-text search with indexing
FT.CREATE idx:users ON JSON PREFIX 1 user: SCHEMA $.name AS name TEXT $.email AS email TAG $.orders AS orders NUMERIC
FT.SEARCH idx:users "@name:Alice"
# RedisTimeSeries — time-series data
TS.CREATE metrics:cpu:node1 RETENTION 86400000 LABELS host node1 metric cpu
TS.ADD metrics:cpu:node1 * 73.5
TS.RANGE metrics:cpu:node1 - + AGGREGATION avg 60000
When running Redis Stack (the module bundle) in HA, ensure all nodes — master and replicas — have identical module versions installed. Module commands are replicated via the standard replication stream, so replicas must be able to execute them. After a failover, RediSearch indexes exist on the promoted replica and serve queries immediately.
Connection Pooling and Client Configuration for HA
Proper connection pooling is essential for Redis HA performance and resilience. Connection pools reduce the overhead of establishing TCP connections, provide automatic retry logic, and enable graceful failover handling.
# Node.js — ioredis connection pool with cluster mode
const Redis = require('ioredis');
// Cluster mode connection
const cluster = new Redis.Cluster(
[
{ host: '10.0.1.10', port: 7000 },
{ host: '10.0.1.11', port: 7000 },
{ host: '10.0.1.12', port: 7000 }
],
{
redisOptions: {
password: 'cluster_password',
connectTimeout: 10000,
maxRetriesPerRequest: 3
},
scaleReads: 'slave', // Route reads to replicas
clusterRetryStrategy(times) {
return Math.min(times * 200, 5000);
},
slotsRefreshTimeout: 2000,
slotsRefreshInterval: 5000,
enableOfflineQueue: true,
enableReadyCheck: true,
natMap: {} // For NAT/port-forwarded environments
}
);
cluster.on('error', (err) => console.error('Cluster error:', err));
cluster.on('node error', (err, address) => {
console.error(`Node ${address} error:`, err);
});
# Python — redis-py connection pool with cluster mode
from redis.cluster import RedisCluster
from redis.backoff import ExponentialBackoff
from redis.retry import Retry
retry = Retry(ExponentialBackoff(cap=5, base=0.1), retries=5)
rc = RedisCluster(
startup_nodes=[
{"host": "10.0.1.10", "port": 7000},
{"host": "10.0.1.11", "port": 7000},
{"host": "10.0.1.12", "port": 7000}
],
password="cluster_password",
decode_responses=True,
read_from_replicas=True,
retry=retry,
retry_on_timeout=True,
socket_timeout=5,
socket_connect_timeout=5,
max_connections=50,
health_check_interval=30
)
rc.set("key", "value")
print(rc.get("key"))
// Go — go-redis cluster client with connection pooling
package main
import (
"context"
"time"
"github.com/redis/go-redis/v9"
)
func NewRedisCluster() *redis.ClusterClient {
return redis.NewClusterClient(&redis.ClusterOptions{
Addrs: []string{
"10.0.1.10:7000",
"10.0.1.11:7000",
"10.0.1.12:7000",
},
Password: "cluster_password",
ReadOnly: true,
RouteRandomly: true,
RouteByLatency: false,
PoolSize: 50,
MinIdleConns: 10,
DialTimeout: 10 * time.Second,
ReadTimeout: 5 * time.Second,
WriteTimeout: 5 * time.Second,
PoolTimeout: 10 * time.Second,
MaxRetries: 5,
MinRetryBackoff: 200 * time.Millisecond,
MaxRetryBackoff: 5 * time.Second,
})
}
Backup and Restore Strategies
Even with HA replication, regular backups are essential for disaster recovery, compliance, and protection against logical errors (accidental FLUSHALL, bad application writes). Redis backups are based on RDB snapshots and AOF files.
#!/bin/bash
# redis-backup.sh — Automated Redis backup script
REDIS_HOST="10.0.1.10"
REDIS_PORT="6379"
REDIS_PASS="strong_master_password"
BACKUP_DIR="/backups/redis"
S3_BUCKET="s3://redis-backups-prod"
RETENTION_DAYS=30
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p "$BACKUP_DIR"
# Trigger RDB snapshot on the master
redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -a "$REDIS_PASS" BGSAVE
# Wait for background save to complete
while [ "$(redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASS LASTSAVE)" = "$(redis-cli -h $REDIS_HOST -p $REDIS_PORT -a $REDIS_PASS LASTSAVE)" ]; do
sleep 1
done
sleep 2
# Copy the RDB file
RDB_FILE=$(redis-cli -h "$REDIS_HOST" -p "$REDIS_PORT" -a "$REDIS_PASS" CONFIG GET dir | tail -1)
cp "${RDB_FILE}/dump.rdb" "${BACKUP_DIR}/dump_${DATE}.rdb"
# Compress and upload to S3
gzip "${BACKUP_DIR}/dump_${DATE}.rdb"
aws s3 cp "${BACKUP_DIR}/dump_${DATE}.rdb.gz" "${S3_BUCKET}/daily/dump_${DATE}.rdb.gz" \
--storage-class STANDARD_IA
# Cleanup old local backups
find "$BACKUP_DIR" -name "dump_*.rdb.gz" -mtime +$RETENTION_DAYS -delete
# Verify backup integrity
redis-check-rdb "${BACKUP_DIR}/dump_${DATE}.rdb.gz" && \
echo "Backup verified: dump_${DATE}.rdb.gz" || \
echo "ERROR: Backup verification failed!"
echo "Backup complete: ${BACKUP_DIR}/dump_${DATE}.rdb.gz"
# Restore from RDB backup
# 1. Stop Redis
systemctl stop redis
# 2. Replace the RDB file
gunzip /backups/redis/dump_20260412_030000.rdb.gz
cp /backups/redis/dump_20260412_030000.rdb /data/redis/dump.rdb
chown redis:redis /data/redis/dump.rdb
# 3. Disable AOF temporarily (if enabled) to prevent AOF overriding RDB on startup
redis-cli -a password CONFIG SET appendonly no
# 4. Start Redis (loads RDB)
systemctl start redis
# 5. Re-enable AOF and rewrite it from the loaded data
redis-cli -a password CONFIG SET appendonly yes
redis-cli -a password BGREWRITEAOF
Monitoring with Redis INFO, Prometheus, and Grafana
Comprehensive monitoring is the foundation of operational excellence for Redis HA. Redis exposes rich internal metrics through the INFO command, which the Prometheus Redis exporter translates into time-series metrics for Grafana dashboards and alerting.
# Key Redis INFO sections for HA monitoring
redis-cli -a password INFO replication
# role:master
# connected_slaves:2
# slave0:ip=10.0.1.11,port=6379,state=online,offset=1234567,lag=0
# slave1:ip=10.0.1.12,port=6379,state=online,offset=1234560,lag=1
# master_replid:abc123...
# master_repl_offset:1234567
# repl_backlog_size:268435456
# repl_backlog_first_byte_offset:1000000
redis-cli -a password INFO memory
# used_memory:6442450944
# used_memory_human:6.00G
# used_memory_rss:6879707136
# mem_fragmentation_ratio:1.07
# maxmemory:6442450944
# maxmemory_policy:allkeys-lru
# evicted_keys:12345
redis-cli -a password INFO stats
# total_connections_received:50000
# total_commands_processed:12345678
# instantaneous_ops_per_sec:85432
# keyspace_hits:11000000
# keyspace_misses:1345678
# expired_keys:500000
# evicted_keys:12345
redis-cli -a password INFO clients
# connected_clients:150
# blocked_clients:0
# maxclients:10000
# Prometheus Redis Exporter deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-exporter
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: redis-exporter
template:
metadata:
labels:
app: redis-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9121"
spec:
containers:
- name: redis-exporter
image: oliver006/redis_exporter:latest
args:
- --redis.addr=redis://redis-ha-master:6379
- --redis.password=$(REDIS_PASSWORD)
- --include-system-metrics
- --is-cluster
env:
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-secret
key: password
ports:
- containerPort: 9121
name: metrics
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
# PrometheusRule for Redis HA alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: redis-ha-alerts
namespace: monitoring
spec:
groups:
- name: redis-availability
rules:
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis instance {{ $labels.instance }} is down"
- alert: RedisReplicaDisconnected
expr: redis_connected_slaves < 2
for: 2m
labels:
severity: warning
annotations:
summary: "Redis master has fewer than 2 connected replicas"
- alert: RedisReplicationLagHigh
expr: redis_replication_lag > 5
for: 3m
labels:
severity: warning
annotations:
summary: "Redis replication lag exceeds 5 seconds on {{ $labels.instance }}"
- alert: RedisMemoryUsageHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage above 90% on {{ $labels.instance }}"
- alert: RedisEvictionsHigh
expr: rate(redis_evicted_keys_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Redis evicting keys at >100/s on {{ $labels.instance }}"
- alert: RedisKeyspaceHitRateLow
expr: redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total) < 0.8
for: 10m
labels:
severity: info
annotations:
summary: "Redis cache hit rate below 80% on {{ $labels.instance }}"
- alert: RedisSentinelDown
expr: redis_sentinel_master_status != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Redis Sentinel reports master unhealthy"
- name: redis-performance
rules:
- alert: RedisSlowlogGrowing
expr: increase(redis_slowlog_length[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Redis slow log growing rapidly on {{ $labels.instance }}"
- alert: RedisConnectionsNearLimit
expr: redis_connected_clients / redis_config_maxclients > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis connected clients above 80% of maxclients"
Dragonfly and KeyDB: Redis-Compatible Alternatives
While Redis is the dominant in-memory data store, two Redis-compatible alternatives have gained traction for specific use cases.
Dragonfly
Dragonfly is a modern, multi-threaded Redis replacement that aims to be a drop-in replacement while leveraging all available CPU cores. Traditional Redis is single-threaded for command processing — Dragonfly uses a shared-nothing architecture with multiple threads to achieve significantly higher throughput on multi-core machines. It supports the Redis protocol, most Redis commands, and can replace Redis without application changes.
# Run Dragonfly as a Redis replacement
docker run -d --name dragonfly \
-p 6379:6379 \
-v /data/dragonfly:/data \
docker.dragonflydb.io/dragonflydb/dragonfly \
--maxmemory 12gb \
--proactor_threads 8 \
--dbfilename dump.rdb \
--requirepass strong_password
# Dragonfly in Kubernetes
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: dragonfly
namespace: production
spec:
replicas: 1
selector:
matchLabels:
app: dragonfly
template:
metadata:
labels:
app: dragonfly
spec:
containers:
- name: dragonfly
image: docker.dragonflydb.io/dragonflydb/dragonfly:latest
args:
- --maxmemory=12gb
- --proactor_threads=8
- --requirepass=strong_password
- --snapshot_cron=*/30 * * * *
ports:
- containerPort: 6379
resources:
requests:
cpu: "4"
memory: 16Gi
limits:
cpu: "8"
memory: 16Gi
volumeMounts:
- name: data
mountPath: /data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: longhorn
resources:
requests:
storage: 100Gi
Dragonfly's key advantages: multi-threaded (25x throughput on 8 cores vs single-threaded Redis), better memory efficiency (uses Dash hash tables instead of Redis dict), built-in snapshotting without fork() overhead, and native support for large datasets. However, as of 2026, Dragonfly's replication support is still maturing — it supports primary-replica replication but does not yet have a Sentinel-equivalent automatic failover system. For HA, use Kubernetes health checks and StatefulSet restart policies, or deploy behind a load balancer with application-level failover.
KeyDB
KeyDB is a multi-threaded fork of Redis maintained by Snap (the company behind Snapchat). It is fully compatible with Redis and adds multi-threading, active-active replication (multi-master), FLASH storage tiering, and subkey expiration. KeyDB's active-active replication is particularly interesting for HA — two KeyDB instances can accept writes simultaneously and replicate to each other, providing zero-downtime failover.
# keydb.conf — Multi-threaded configuration with active replication
server-threads 4 # Use 4 threads for command processing
bind 0.0.0.0
port 6379
requirepass strong_password
masterauth strong_password
# Active-active replication (multi-master)
active-replica yes
replicaof peer-host 6379 # Bidirectional replication
# On the peer node, configure the reverse:
# replicaof this-host 6379
# FLASH storage tiering (for datasets larger than RAM)
# storage-provider flash /mnt/flash-storage 100
# maxmemory 16gb
# Will keep hot data in RAM and spill cold data to SSD
# SubKey expiration (unique to KeyDB)
# Allows setting TTL on hash fields, not just top-level keys
# EXPIREMEMBER myhash field1 3600
KeyDB is a strong choice when you need multi-master replication for geographic distribution or zero-downtime maintenance, or when you need higher throughput than single-threaded Redis can provide but want to stay closer to the Redis codebase than Dragonfly.
Performance Tuning
Pipelining
Pipelining batches multiple commands into a single network round-trip, dramatically reducing latency for bulk operations. Instead of waiting for each response before sending the next command, the client sends all commands at once and reads all responses together.
# Python — Pipelining with redis-py
import redis
import time
r = redis.Redis(host='10.0.1.10', port=6379, password='password', decode_responses=True)
# Without pipelining: 1000 round-trips
start = time.time()
for i in range(1000):
r.set(f'key:{i}', f'value:{i}')
print(f'Without pipeline: {time.time() - start:.3f}s')
# With pipelining: 1 round-trip for 1000 commands
start = time.time()
pipe = r.pipeline(transaction=False)
for i in range(1000):
pipe.set(f'key:{i}', f'value:{i}')
pipe.execute()
print(f'With pipeline: {time.time() - start:.3f}s')
# Typically 5-10x faster
Lua Scripting
Lua scripts execute atomically on the Redis server, eliminating round-trips for complex operations and guaranteeing that no other command executes between the script's operations. In Redis Cluster, ensure all keys accessed by a Lua script reside in the same hash slot using hash tags.
# Lua script for atomic rate limiting
# KEYS[1] = rate limit key
# ARGV[1] = max requests
# ARGV[2] = window in seconds
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
return 0 -- Rate limited
end
return 1 -- Allowed
# Load and execute
redis-cli -a password EVAL "\
local current = redis.call('INCR', KEYS[1]) \
if current == 1 then redis.call('EXPIRE', KEYS[1], ARGV[2]) end \
if current > tonumber(ARGV[1]) then return 0 end \
return 1" 1 ratelimit:user:123 100 60
Memory Optimisation
# redis.conf — Memory optimisation settings
# Use ziplist encoding for small hashes, lists, sorted sets
hash-max-listpack-entries 128
hash-max-listpack-value 64
list-max-listpack-size -2 # 8KB per node
zset-max-listpack-entries 128
zset-max-listpack-value 64
set-max-intset-entries 512
# Lazy freeing (avoid blocking on large key deletion)
lazyfree-lazy-eviction yes
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
lazyfree-lazy-user-del yes
lazyfree-lazy-user-flush yes
# jemalloc tuning
# MALLOC_CONF="background_thread:true,dirty_decay_ms:5000,muzzy_decay_ms:5000"
# Analyse memory usage
redis-cli -a password MEMORY DOCTOR
redis-cli -a password MEMORY STATS
redis-cli -a password --bigkeys
redis-cli -a password --memkeys
Common Failure Scenarios and Troubleshooting
Scenario 1: Master Fails, Sentinel Promotes Replica
# Diagnosis
redis-cli -p 26379 SENTINEL master mymaster
# Check: flags should show 's_down' or 'o_down' if master is unreachable
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
# Returns the current master address (should be the promoted replica)
# Check Sentinel logs for failover events
tail -f /var/log/redis/sentinel.log
# Look for: +sdown, +odown, +try-failover, +elected-leader,
# +failover-state-select-slave, +selected-slave,
# +failover-state-send-slaveof-noone, +failover-end
Scenario 2: Split-Brain with Two Masters
# Diagnosis: check if min-replicas-to-write is configured
redis-cli -a password CONFIG GET min-replicas-to-write
redis-cli -a password CONFIG GET min-replicas-max-lag
# Prevention: configure min-replicas on all masters
redis-cli -a password CONFIG SET min-replicas-to-write 1
redis-cli -a password CONFIG SET min-replicas-max-lag 10
# If split-brain occurred: identify the stale master
# Compare replication offsets — the master with the higher offset has more data
redis-cli -h master1-ip -a password INFO replication | grep master_repl_offset
redis-cli -h master2-ip -a password INFO replication | grep master_repl_offset
# Force the stale master to become a replica
redis-cli -h stale-master-ip -a password REPLICAOF correct-master-ip 6379
Scenario 3: Full Resync Storm After Network Partition
# Diagnosis: check replication backlog
redis-cli -a password INFO replication | grep repl_backlog
# If repl_backlog_first_byte_offset is ahead of replica's offset, full sync triggers
# Prevention: increase backlog size
redis-cli -a password CONFIG SET repl-backlog-size 512mb
# Monitor for full syncs
redis-cli -a password INFO stats | grep sync_full
redis-cli -a password INFO stats | grep sync_partial_ok
redis-cli -a password INFO stats | grep sync_partial_err
Scenario 4: Memory Exhaustion and OOM Kill
# Diagnosis
redis-cli -a password INFO memory
# Check: used_memory vs maxmemory, mem_fragmentation_ratio
redis-cli -a password MEMORY DOCTOR
# Returns advice on memory issues
# Prevention: set proper maxmemory and eviction
redis-cli -a password CONFIG SET maxmemory 12gb
redis-cli -a password CONFIG SET maxmemory-policy allkeys-lru
# Find large keys consuming memory
redis-cli -a password --bigkeys
redis-cli -a password --memkeys --memkeys-samples 100
# Emergency: manually evict keys
redis-cli -a password SCAN 0 COUNT 1000 TYPE string
# Identify and DEL unnecessary large keys
Scenario 5: Slow Commands Blocking Replication
# Diagnosis: check slowlog
redis-cli -a password SLOWLOG GET 20
redis-cli -a password SLOWLOG LEN
# Check for blocking commands
redis-cli -a password CLIENT LIST | grep -E 'cmd=(keys|sort|smembers)'
# Prevention: configure slowlog threshold
redis-cli -a password CONFIG SET slowlog-log-slower-than 10000 # 10ms
redis-cli -a password CONFIG SET slowlog-max-len 256
# Rename dangerous commands
rename-command KEYS "" # Disable KEYS entirely
rename-command FLUSHALL "" # Disable FLUSHALL
rename-command FLUSHDB "" # Disable FLUSHDB
rename-command DEBUG "" # Disable DEBUG
Capacity Planning and Scaling Strategies
Capacity planning for Redis HA involves estimating memory requirements, network bandwidth, and CPU utilisation across master and replica nodes. The key metrics to plan around are dataset size, operations per second, average key/value size, and replication overhead.
# Capacity estimation formulas
# Memory per node:
# Base dataset size (use redis-cli DBSIZE and MEMORY USAGE on a sample)
# + Replication output buffer: ~64MB per replica
# + AOF rewrite buffer: ~64MB during rewrites
# + Copy-on-write overhead during BGSAVE: up to 2x during heavy writes
# + Client output buffers: ~1KB per client
# + OS overhead: ~1-2GB
# Rule of thumb: maxmemory = 75% of available RAM
# Network bandwidth:
# Replication: write_throughput_bytes * num_replicas
# Client traffic: ops_per_sec * avg_response_size
# Full sync: dataset_size (one-time during replica bootstrap or failover)
# Example sizing for 20GB dataset, 100K ops/sec:
# RAM per node: 20GB data + 4GB buffers + 2GB OS = 26GB -> 32GB node (75% = 24GB maxmemory)
# CPU: 1 core handles ~100K ops/sec for simple commands (GET/SET)
# Network: 100K ops * 1KB avg = 100MB/s client + 50MB/s replication = 150MB/s per master
# Scaling decision tree:
# Need more read throughput? -> Add replicas (up to 5 per master)
# Need more write throughput? -> Redis Cluster (add shards)
# Need more memory? -> Redis Cluster (distribute dataset across shards)
# Need lower latency? -> Reduce network hops (co-locate, use unix sockets)
# Need global distribution? -> Multi-region replication or Redis Enterprise Active-Active
Horizontal Scaling with Redis Cluster
# Add shards to an existing Redis Cluster
# 1. Start new Redis nodes
redis-server /etc/redis/new-master.conf
redis-server /etc/redis/new-replica.conf
# 2. Add the new master to the cluster
redis-cli --cluster add-node new-master:7000 existing-node:7000 -a password
# 3. Add the new replica to follow the new master
redis-cli --cluster add-node new-replica:7000 existing-node:7000 \
--cluster-slave --cluster-master-id NEW_MASTER_ID -a password
# 4. Reshard slots to the new master
redis-cli --cluster reshard existing-node:7000 \
--cluster-from all --cluster-to NEW_MASTER_ID \
--cluster-slots 4096 --cluster-yes -a password
# 5. Verify the new slot distribution
redis-cli -c -h existing-node -p 7000 -a password CLUSTER SLOTS
# Remove a shard (scale down)
# 1. Reshard all slots away from the node
redis-cli --cluster reshard existing-node:7000 \
--cluster-from REMOVING_NODE_ID --cluster-to TARGET_NODE_ID \
--cluster-slots 5461 --cluster-yes -a password
# 2. Remove the empty node
redis-cli --cluster del-node existing-node:7000 REMOVING_NODE_ID -a password
Vertical Scaling Considerations
# When to scale vertically vs horizontally:
# Scale UP (bigger instances) when:
# - Dataset fits in single-node memory
# - Workload uses multi-key operations (MGET, SUNION, Lua across keys)
# - Operational simplicity is more important than cost efficiency
# - Using Redis modules that don't support Cluster mode well
# Scale OUT (more shards) when:
# - Dataset exceeds single-node memory
# - Write throughput exceeds single-thread capacity (~200K ops/sec)
# - You need per-shard isolation for multi-tenant workloads
# - Cost per GB of RAM is a concern (many smaller nodes vs few large ones)
# Cloud instance recommendations:
# AWS: cache.r7g.xlarge (4 vCPU, 26GB) to cache.r7g.16xlarge (64 vCPU, 419GB)
# Azure: P1 (6GB) to P5 (120GB) per shard
# GCP: 5GB to 300GB per instance (Memorystore Standard)
# Bare metal:
# CPU: 2-4 cores dedicated to Redis (single-threaded, but background tasks use extra cores)
# RAM: 32-128GB per node (NVMe for swap-as-last-resort)
# Network: 10Gbps minimum, 25Gbps for large datasets
# Storage: NVMe SSD for AOF/RDB persistence (IOPS matters for fsync)
Conclusion
Redis high availability is not a single configuration choice — it is a comprehensive system design that spans replication topology, failure detection, automatic failover, client configuration, persistence strategy, memory management, security, monitoring, and operational procedures. The right HA architecture depends on your specific requirements.
For most applications, Redis Sentinel with three Sentinel instances monitoring a master and two replicas provides a proven, battle-tested HA solution with automatic failover in under 30 seconds. When you need horizontal scaling beyond a single master's throughput or memory capacity, Redis Cluster distributes the dataset across multiple shards while maintaining built-in failover per shard. For Kubernetes environments, operators like Spotahome and OpsTree encode operational best practices into declarative CRDs, while Redis Enterprise provides the most feature-rich option with Active-Active geo-replication and module support.
Cloud-managed services — AWS ElastiCache, Azure Cache for Redis, and GCP Memorystore — eliminate the operational burden of running Redis infrastructure but come with reduced flexibility and higher costs at scale. For organisations with bare metal infrastructure, k3s with Rancher and Longhorn provides a fully open-source, cloud-independent alternative that delivers enterprise-grade HA without vendor lock-in.
Alternatives like Dragonfly and KeyDB are worth evaluating for specific use cases — Dragonfly for raw multi-threaded throughput on large machines, and KeyDB for active-active multi-master replication. Both are Redis-compatible and can serve as drop-in replacements in many scenarios.
Regardless of which architecture you choose, the operational fundamentals remain constant: configure proper persistence (hybrid RDB + AOF), enforce min-replicas-to-write to prevent split-brain data loss, size your replication backlog for your write volume, encrypt all traffic with TLS, enforce least-privilege access with ACLs, monitor replication lag and memory usage with Prometheus and Grafana, back up RDB snapshots to durable off-site storage, and — most critically — test your failover regularly. A failover system that has never been tested is a system that does not work. Run monthly failover drills, inject failures with chaos engineering tools, and measure your actual recovery time. The confidence you gain from systematic testing is what separates a Redis deployment that survives production incidents from one that turns a server failure into a company-wide outage.