Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: Database Failover

Overview

This runbook covers the process of failing over to a standby database in case of primary database failure.

Symptoms

  • Primary database unreachable
  • Replication lag increasing indefinitely
  • Database corruption detected
  • Catastrophic hardware failure

Impact

  • Complete service outage
  • Data writes blocked
  • Potential data loss (depending on replication lag)

Detection

  • Alert: DatabasePrimaryDown
  • Alert: DatabaseReplicationLagHigh
  • Dashboard: Database Health

Pre-failover Checks

1. Verify Primary is Down

# Check connectivity
pg_isready -h primary.db.tenki.cloud -p 5432

# Check from multiple locations
for host in api-1 api-2 worker-1; do
  ssh $host "pg_isready -h primary.db.tenki.cloud"
done

2. Check Replication Status

-- On standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- Check if standby is receiving updates
SELECT * FROM pg_stat_replication;

3. Assess Data Loss Risk

  • Note the last transaction timestamp
  • Document replication lag
  • Make go/no-go decision based on business impact

Failover Process

1. Stop All Application Traffic

# Scale down applications
kubectl scale deployment api worker --replicas=0 -n production

# Verify no active connections
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';

2. Promote Standby

# On standby server
pg_ctl promote -D /var/lib/postgresql/data

# Or using managed service commands
gcloud sql instances promote-replica tenki-db-standby

3. Update Connection Strings

# Update DNS
terraform apply -var="database_host=standby.db.tenki.cloud"

# Or update environment variables
kubectl set env deployment/api deployment/worker \
  DATABASE_URL=postgres://user:pass@standby.db.tenki.cloud/tenki \
  -n production

4. Verify New Primary

-- Check if accepting writes
SELECT pg_is_in_recovery();  -- Should return false

-- Test write
INSERT INTO health_check (timestamp) VALUES (now());

5. Resume Application Traffic

# Scale up applications
kubectl scale deployment api --replicas=3 -n production
kubectl scale deployment worker --replicas=2 -n production

# Monitor for errors
kubectl logs -f deployment/api -n production

Post-Failover Tasks

1. Immediate

  • Monitor application health
  • Check for data inconsistencies
  • Communicate status to stakeholders

2. Within 1 Hour

  • Set up new standby from old primary (if recoverable)
  • Update monitoring to reflect new topology
  • Document timeline and impact

3. Within 24 Hours

  • Root cause analysis
  • Update disaster recovery procedures
  • Test backup restoration process

Rollback Procedure

If failover was premature or primary recovers:

  1. Stop applications again
  2. Ensure data consistency
    • Compare transaction IDs
    • Check for split-brain scenarios
  3. Resync if needed
    pg_rewind --target-pgdata=/var/lib/postgresql/data \
              --source-server="host=primary.db.tenki.cloud"
    
  4. Switch back to primary
  5. Resume traffic

Prevention

  • Regular failover drills
  • Monitor replication lag closely
  • Implement automatic failover with proper fencing
  • Use synchronous replication for critical data