Runbook: Database Failover

Overview

This runbook covers the process of failing over to a standby database in case of primary database failure.

Symptoms

Primary database unreachable
Replication lag increasing indefinitely
Database corruption detected
Catastrophic hardware failure

Impact

Complete service outage
Data writes blocked
Potential data loss (depending on replication lag)

Detection

Alert: DatabasePrimaryDown
Alert: DatabaseReplicationLagHigh
Dashboard: Database Health

Pre-failover Checks

1. Verify Primary is Down

# Check connectivity
pg_isready -h primary.db.tenki.cloud -p 5432

# Check from multiple locations
for host in api-1 api-2 worker-1; do
  ssh $host "pg_isready -h primary.db.tenki.cloud"
done

2. Check Replication Status

-- On standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- Check if standby is receiving updates
SELECT * FROM pg_stat_replication;

3. Assess Data Loss Risk

Note the last transaction timestamp
Document replication lag
Make go/no-go decision based on business impact

Failover Process

1. Stop All Application Traffic

# Scale down applications
kubectl scale deployment api worker --replicas=0 -n production

# Verify no active connections
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';

2. Promote Standby

# On standby server
pg_ctl promote -D /var/lib/postgresql/data

# Or using managed service commands
gcloud sql instances promote-replica tenki-db-standby

3. Update Connection Strings

# Update DNS
terraform apply -var="database_host=standby.db.tenki.cloud"

# Or update environment variables
kubectl set env deployment/api deployment/worker \
  DATABASE_URL=postgres://user:pass@standby.db.tenki.cloud/tenki \
  -n production

4. Verify New Primary

-- Check if accepting writes
SELECT pg_is_in_recovery();  -- Should return false

-- Test write
INSERT INTO health_check (timestamp) VALUES (now());

5. Resume Application Traffic

# Scale up applications
kubectl scale deployment api --replicas=3 -n production
kubectl scale deployment worker --replicas=2 -n production

# Monitor for errors
kubectl logs -f deployment/api -n production

Post-Failover Tasks

1. Immediate

Monitor application health
Check for data inconsistencies
Communicate status to stakeholders

2. Within 1 Hour

Set up new standby from old primary (if recoverable)
Update monitoring to reflect new topology
Document timeline and impact

3. Within 24 Hours

Root cause analysis
Update disaster recovery procedures
Test backup restoration process

Rollback Procedure

If failover was premature or primary recovers:

Stop applications again
Ensure data consistency
- Compare transaction IDs
- Check for split-brain scenarios

Resync if needed

pg_rewind --target-pgdata=/var/lib/postgresql/data \
          --source-server="host=primary.db.tenki.cloud"

Switch back to primary
Resume traffic

Prevention

Regular failover drills
Monitor replication lag closely
Implement automatic failover with proper fencing
Use synchronous replication for critical data

Keyboard shortcuts

Tenki Cloud Documentation