Monitoring Guide
Overview
This guide covers monitoring and observability practices for Tenki Cloud operations.
Stack
- Prometheus: Metrics collection
- Grafana: Visualization and dashboards
- Loki: Log aggregation
- Tempo: Distributed tracing
- Alertmanager: Alert routing
Metrics
Application Metrics
Key metrics to monitor:
- Request rate and latency
- Error rates (4xx, 5xx)
- Database connection pool stats
- Background job queue depth
- GitHub API rate limits
Infrastructure Metrics
- CPU and memory usage
- Disk I/O and space
- Network throughput
- Container health
- Database performance
Dashboards
Available Dashboards
- Application Overview: High-level health metrics
- API Performance: Request rates, latencies, errors
- Database Health: Connections, query performance
- GitHub Integration: Runner stats, API usage
- Billing System: Transaction volumes, failures
Creating Dashboards
- Use Grafana dashboard as code
- Store dashboards in
deployments/grafana/dashboards/ - Follow naming convention:
category-name.json - Include appropriate tags and metadata
Alerts
Alert Rules
Critical alerts:
- API availability < 99.9%
- Database CPU > 80%
- Disk space < 20%
- Error rate > 5%
- GitHub API rate limit < 1000
Alert Routing
- Critical: PagerDuty (immediate response)
- Warning: Slack #alerts channel
- Info: Email daily digest
Logs
Log Levels
- ERROR: Actionable errors requiring investigation
- WARN: Potential issues, degraded performance
- INFO: Important business events
- DEBUG: Detailed troubleshooting information
Structured Logging
Always use structured logging with consistent fields:
trace_id: Request correlation IDuser_id: User identifierorg_id: Organization identifiererror: Error message and stack trace
Tracing
Instrumentation
- Trace all API endpoints
- Include database queries
- Add custom spans for business logic
- Propagate trace context to external services
Sampling
- 100% sampling for errors
- 10% sampling for successful requests
- Adjust based on traffic volume
SLOs and SLIs
Service Level Indicators
- API latency (p50, p95, p99)
- Error rate
- Availability
- Database query time
Service Level Objectives
- 99.9% API availability
- p95 latency < 500ms
- Error rate < 0.1%
- Zero data loss