Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Runbook: High API Latency

Overview

This runbook covers troubleshooting and resolving high API latency issues.

Symptoms

  • p95 latency > 500ms
  • User reports of slow loading
  • Timeout errors in client applications
  • Increased error rates due to timeouts

Impact

  • Poor user experience
  • Increased error rates
  • Potential cascading failures
  • Customer complaints

Detection

  • Alert: APILatencyHigh
  • Threshold: p95 > 500ms for 5 minutes
  • Dashboard: API Performance

Response

Immediate Actions

  1. Check current latency

    • View p50, p95, p99 latencies
    • Identify affected endpoints
    • Check error rates
  2. Verify system health

    # Check pod status
    kubectl get pods -n production
    
    # Check resource usage
    kubectl top pods -n production
    
    # Check recent deployments
    kubectl rollout history deployment/api -n production
    
  3. Enable detailed logging (temporarily)

    kubectl set env deployment/api LOG_LEVEL=debug -n production
    

Diagnosis

  1. Database performance

    • Check slow query log
    • Review connection pool status
    • Look for lock contention
  2. External dependencies

    • GitHub API response times
    • Payment processor latency
    • CDN performance
  3. Application issues

    • Memory leaks (increasing memory usage)
    • CPU bottlenecks
    • Inefficient algorithms

Common Causes and Fixes

1. Database Queries

Symptom: High database CPU, slow queries Fix:

-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_table_column ON table(column);

2. Cache Misses

Symptom: High cache miss rate Fix:

  • Warm up caches after deployment
  • Increase cache TTL for stable data
  • Review cache key generation

3. Resource Constraints

Symptom: High CPU/memory usage Fix:

# Scale horizontally
kubectl scale deployment api --replicas=6 -n production

# Or scale vertically (requires restart)
kubectl set resources deployment api -c api --requests=memory=2Gi,cpu=1000m -n production

4. Inefficient Code

Symptom: Specific endpoints consistently slow Fix:

  • Profile the endpoint
  • Optimize algorithms
  • Implement pagination
  • Add caching layer

Recovery

  1. Quick wins

    • Increase cache TTLs
    • Scale out services
    • Enable read replicas
  2. Rollback if needed

    kubectl rollout undo deployment/api -n production
    
  3. Communicate status

    • Update status page
    • Notify affected customers
    • Post in #incidents channel

Prevention

  • Load testing before major releases
  • Gradual rollouts with canary deployments
  • Query performance regression tests
  • Capacity planning reviews

Monitoring

Key metrics to watch:

  • API latency percentiles
  • Database query time
  • Cache hit rates
  • Resource utilization
  • Error rates