Runbook: High API Latency
Overview
This runbook covers troubleshooting and resolving high API latency issues.
Symptoms
- p95 latency > 500ms
- User reports of slow loading
- Timeout errors in client applications
- Increased error rates due to timeouts
Impact
- Poor user experience
- Increased error rates
- Potential cascading failures
- Customer complaints
Detection
- Alert:
APILatencyHigh - Threshold: p95 > 500ms for 5 minutes
- Dashboard: API Performance
Response
Immediate Actions
-
Check current latency
- View p50, p95, p99 latencies
- Identify affected endpoints
- Check error rates
-
Verify system health
# Check pod status kubectl get pods -n production # Check resource usage kubectl top pods -n production # Check recent deployments kubectl rollout history deployment/api -n production -
Enable detailed logging (temporarily)
kubectl set env deployment/api LOG_LEVEL=debug -n production
Diagnosis
-
Database performance
- Check slow query log
- Review connection pool status
- Look for lock contention
-
External dependencies
- GitHub API response times
- Payment processor latency
- CDN performance
-
Application issues
- Memory leaks (increasing memory usage)
- CPU bottlenecks
- Inefficient algorithms
Common Causes and Fixes
1. Database Queries
Symptom: High database CPU, slow queries Fix:
-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_table_column ON table(column);
2. Cache Misses
Symptom: High cache miss rate Fix:
- Warm up caches after deployment
- Increase cache TTL for stable data
- Review cache key generation
3. Resource Constraints
Symptom: High CPU/memory usage Fix:
# Scale horizontally
kubectl scale deployment api --replicas=6 -n production
# Or scale vertically (requires restart)
kubectl set resources deployment api -c api --requests=memory=2Gi,cpu=1000m -n production
4. Inefficient Code
Symptom: Specific endpoints consistently slow Fix:
- Profile the endpoint
- Optimize algorithms
- Implement pagination
- Add caching layer
Recovery
-
Quick wins
- Increase cache TTLs
- Scale out services
- Enable read replicas
-
Rollback if needed
kubectl rollout undo deployment/api -n production -
Communicate status
- Update status page
- Notify affected customers
- Post in #incidents channel
Prevention
- Load testing before major releases
- Gradual rollouts with canary deployments
- Query performance regression tests
- Capacity planning reviews
Monitoring
Key metrics to watch:
- API latency percentiles
- Database query time
- Cache hit rates
- Resource utilization
- Error rates