Runbook: High Database CPU
Alert Details
- Alert Name:
HighDatabaseCPU - Severity: P2
- Team: Backend/Platform
- Last Updated: 2025-06-12
Symptoms
- Database CPU usage > 80% for 5+ minutes
- API response times > 500ms
- Increased error rates in logs
- Grafana dashboard shows CPU spike
Quick Diagnostics
# 1. Check current connections
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;"
# 2. Find slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT
substring(query, 1, 50) as query_start,
calls,
mean_exec_time,
total_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 100
ORDER BY mean_exec_time DESC
LIMIT 10;"
# 3. Check for locks
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT
pid,
usename,
pg_blocking_pids(pid) as blocked_by,
query_start,
substring(query, 1, 50) as query
FROM pg_stat_activity
WHERE pg_blocking_pids(pid)::text != '{}';"
Resolution Steps
1. Immediate Mitigation (5 mins)
# Scale up API to reduce per-instance load
kubectl scale deployment/engine --replicas=10
# Kill long-running queries (>5 minutes)
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 minutes'
AND query NOT LIKE '%pg_stat_activity%';"
2. Identify Root Cause (10 mins)
Check recent deployments:
kubectl get deployments -o wide | grep engine
kubectl rollout history deployment/engine
Review slow query log:
kubectl logs postgres-0 | grep "duration:" | tail -50
Check for missing indexes:
-- Run on affected tables
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM workflow_runs
WHERE status = 'pending'
AND created_at > NOW() - INTERVAL '1 hour';
3. Fix Implementation
If missing index:
-- Create index (be careful on large tables)
CREATE INDEX CONCURRENTLY idx_workflow_runs_status_created
ON workflow_runs(status, created_at)
WHERE status IN ('pending', 'running');
If bad query from recent deploy:
# Rollback to previous version
kubectl rollout undo deployment/engine
# Or deploy hotfix
git checkout main
git pull
# Fix query
git commit -am "fix: optimize workflow query"
git push
# Deploy via CI/CD
4. Verify Resolution
# Monitor CPU (should drop within 5 mins)
watch -n 5 "kubectl exec -it postgres-0 -- psql -U postgres -c 'SELECT round(100 * cpu_usage) as cpu_percent FROM pg_stat_database_stats;'"
# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.tenki.lab/health
# Verify no more slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '1 minute';"
Long-term Prevention
- Add query timeout to engine configuration
- Set up query monitoring in Datadog/NewRelic
- Regular ANALYZE on high-traffic tables
- Consider read replicas for analytics queries
- Implement connection pooling with PgBouncer
Escalation Path
- 15 mins: If CPU still high → Page backend on-call
- 30 mins: If impacting customers → Incident Commander
- 45 mins: If data corruption risk → CTO
Related Runbooks
Post-Incident
- Create incident report
- Add missing monitoring
- Update this runbook with findings
- Schedule postmortem if customer impact