Runbook: Database Failover
Overview
This runbook covers the process of failing over to a standby database in case of primary database failure.
Symptoms
- Primary database unreachable
- Replication lag increasing indefinitely
- Database corruption detected
- Catastrophic hardware failure
Impact
- Complete service outage
- Data writes blocked
- Potential data loss (depending on replication lag)
Detection
- Alert:
DatabasePrimaryDown - Alert:
DatabaseReplicationLagHigh - Dashboard: Database Health
Pre-failover Checks
1. Verify Primary is Down
# Check connectivity
pg_isready -h primary.db.tenki.cloud -p 5432
# Check from multiple locations
for host in api-1 api-2 worker-1; do
ssh $host "pg_isready -h primary.db.tenki.cloud"
done
2. Check Replication Status
-- On standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- Check if standby is receiving updates
SELECT * FROM pg_stat_replication;
3. Assess Data Loss Risk
- Note the last transaction timestamp
- Document replication lag
- Make go/no-go decision based on business impact
Failover Process
1. Stop All Application Traffic
# Scale down applications
kubectl scale deployment api worker --replicas=0 -n production
# Verify no active connections
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';
2. Promote Standby
# On standby server
pg_ctl promote -D /var/lib/postgresql/data
# Or using managed service commands
gcloud sql instances promote-replica tenki-db-standby
3. Update Connection Strings
# Update DNS
terraform apply -var="database_host=standby.db.tenki.cloud"
# Or update environment variables
kubectl set env deployment/api deployment/worker \
DATABASE_URL=postgres://user:pass@standby.db.tenki.cloud/tenki \
-n production
4. Verify New Primary
-- Check if accepting writes
SELECT pg_is_in_recovery(); -- Should return false
-- Test write
INSERT INTO health_check (timestamp) VALUES (now());
5. Resume Application Traffic
# Scale up applications
kubectl scale deployment api --replicas=3 -n production
kubectl scale deployment worker --replicas=2 -n production
# Monitor for errors
kubectl logs -f deployment/api -n production
Post-Failover Tasks
1. Immediate
- Monitor application health
- Check for data inconsistencies
- Communicate status to stakeholders
2. Within 1 Hour
- Set up new standby from old primary (if recoverable)
- Update monitoring to reflect new topology
- Document timeline and impact
3. Within 24 Hours
- Root cause analysis
- Update disaster recovery procedures
- Test backup restoration process
Rollback Procedure
If failover was premature or primary recovers:
- Stop applications again
- Ensure data consistency
- Compare transaction IDs
- Check for split-brain scenarios
- Resync if needed
pg_rewind --target-pgdata=/var/lib/postgresql/data \ --source-server="host=primary.db.tenki.cloud" - Switch back to primary
- Resume traffic
Prevention
- Regular failover drills
- Monitor replication lag closely
- Implement automatic failover with proper fencing
- Use synchronous replication for critical data