Runbook: High Database CPU

Alert Details

Alert Name: HighDatabaseCPU
Severity: P2
Team: Backend/Platform
Last Updated: 2025-06-12

Symptoms

Database CPU usage > 80% for 5+ minutes
API response times > 500ms
Increased error rates in logs
Grafana dashboard shows CPU spike

Quick Diagnostics

# 1. Check current connections
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT count(*), state
  FROM pg_stat_activity
  GROUP BY state;"

# 2. Find slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT
    substring(query, 1, 50) as query_start,
    calls,
    mean_exec_time,
    total_exec_time
  FROM pg_stat_statements
  WHERE mean_exec_time > 100
  ORDER BY mean_exec_time DESC
  LIMIT 10;"

# 3. Check for locks
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT
    pid,
    usename,
    pg_blocking_pids(pid) as blocked_by,
    query_start,
    substring(query, 1, 50) as query
  FROM pg_stat_activity
  WHERE pg_blocking_pids(pid)::text != '{}';"

Resolution Steps

1. Immediate Mitigation (5 mins)

# Scale up API to reduce per-instance load
kubectl scale deployment/engine --replicas=10

# Kill long-running queries (>5 minutes)
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state != 'idle'
    AND query_start < now() - interval '5 minutes'
    AND query NOT LIKE '%pg_stat_activity%';"

2. Identify Root Cause (10 mins)

Check recent deployments:

kubectl get deployments -o wide | grep engine
kubectl rollout history deployment/engine

Review slow query log:

kubectl logs postgres-0 | grep "duration:" | tail -50

Check for missing indexes:

-- Run on affected tables
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM workflow_runs
WHERE status = 'pending'
  AND created_at > NOW() - INTERVAL '1 hour';

3. Fix Implementation

If missing index:

-- Create index (be careful on large tables)
CREATE INDEX CONCURRENTLY idx_workflow_runs_status_created
ON workflow_runs(status, created_at)
WHERE status IN ('pending', 'running');

If bad query from recent deploy:

# Rollback to previous version
kubectl rollout undo deployment/engine

# Or deploy hotfix
git checkout main
git pull
# Fix query
git commit -am "fix: optimize workflow query"
git push
# Deploy via CI/CD

4. Verify Resolution

# Monitor CPU (should drop within 5 mins)
watch -n 5 "kubectl exec -it postgres-0 -- psql -U postgres -c 'SELECT round(100 * cpu_usage) as cpu_percent FROM pg_stat_database_stats;'"

# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.tenki.lab/health

# Verify no more slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '1 minute';"

Long-term Prevention

Add query timeout to engine configuration
Set up query monitoring in Datadog/NewRelic
Regular ANALYZE on high-traffic tables
Consider read replicas for analytics queries
Implement connection pooling with PgBouncer

Escalation Path

15 mins: If CPU still high → Page backend on-call
30 mins: If impacting customers → Incident Commander
45 mins: If data corruption risk → CTO

Post-Incident

Create incident report
Add missing monitoring
Update this runbook with findings
Schedule postmortem if customer impact

Keyboard shortcuts

Tenki Cloud Documentation