Runbook: High API Latency

Overview

This runbook covers troubleshooting and resolving high API latency issues.

Symptoms

p95 latency > 500ms
User reports of slow loading
Timeout errors in client applications
Increased error rates due to timeouts

Impact

Poor user experience
Increased error rates
Potential cascading failures
Customer complaints

Detection

Alert: APILatencyHigh
Threshold: p95 > 500ms for 5 minutes
Dashboard: API Performance

Response

Immediate Actions

Check current latency
- View p50, p95, p99 latencies
- Identify affected endpoints
- Check error rates

Verify system health

# Check pod status
kubectl get pods -n production

# Check resource usage
kubectl top pods -n production

# Check recent deployments
kubectl rollout history deployment/api -n production

Enable detailed logging (temporarily)

kubectl set env deployment/api LOG_LEVEL=debug -n production

Diagnosis

Database performance
- Check slow query log
- Review connection pool status
- Look for lock contention
External dependencies
- GitHub API response times
- Payment processor latency
- CDN performance
Application issues
- Memory leaks (increasing memory usage)
- CPU bottlenecks
- Inefficient algorithms

Common Causes and Fixes

1. Database Queries

Symptom: High database CPU, slow queries Fix:

-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_table_column ON table(column);

2. Cache Misses

Symptom: High cache miss rate Fix:

Warm up caches after deployment
Increase cache TTL for stable data
Review cache key generation

3. Resource Constraints

Symptom: High CPU/memory usage Fix:

# Scale horizontally
kubectl scale deployment api --replicas=6 -n production

# Or scale vertically (requires restart)
kubectl set resources deployment api -c api --requests=memory=2Gi,cpu=1000m -n production

4. Inefficient Code

Symptom: Specific endpoints consistently slow Fix:

Profile the endpoint
Optimize algorithms
Implement pagination
Add caching layer

Recovery

Quick wins
- Increase cache TTLs
- Scale out services
- Enable read replicas

Rollback if needed

kubectl rollout undo deployment/api -n production

Communicate status
- Update status page
- Notify affected customers
- Post in #incidents channel

Prevention

Load testing before major releases
Gradual rollouts with canary deployments
Query performance regression tests
Capacity planning reviews

Monitoring

Key metrics to watch:

API latency percentiles
Database query time
Cache hit rates
Resource utilization
Error rates

Keyboard shortcuts

Tenki Cloud Documentation