Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring Guide

Overview

This guide covers monitoring and observability practices for Tenki Cloud operations.

Stack

  • Prometheus: Metrics collection
  • Grafana: Visualization and dashboards
  • Loki: Log aggregation
  • Tempo: Distributed tracing
  • Alertmanager: Alert routing

Metrics

Application Metrics

Key metrics to monitor:

  • Request rate and latency
  • Error rates (4xx, 5xx)
  • Database connection pool stats
  • Background job queue depth
  • GitHub API rate limits

Infrastructure Metrics

  • CPU and memory usage
  • Disk I/O and space
  • Network throughput
  • Container health
  • Database performance

Dashboards

Available Dashboards

  1. Application Overview: High-level health metrics
  2. API Performance: Request rates, latencies, errors
  3. Database Health: Connections, query performance
  4. GitHub Integration: Runner stats, API usage
  5. Billing System: Transaction volumes, failures

Creating Dashboards

  1. Use Grafana dashboard as code
  2. Store dashboards in deployments/grafana/dashboards/
  3. Follow naming convention: category-name.json
  4. Include appropriate tags and metadata

Alerts

Alert Rules

Critical alerts:

  • API availability < 99.9%
  • Database CPU > 80%
  • Disk space < 20%
  • Error rate > 5%
  • GitHub API rate limit < 1000

Alert Routing

  1. Critical: PagerDuty (immediate response)
  2. Warning: Slack #alerts channel
  3. Info: Email daily digest

Logs

Log Levels

  • ERROR: Actionable errors requiring investigation
  • WARN: Potential issues, degraded performance
  • INFO: Important business events
  • DEBUG: Detailed troubleshooting information

Structured Logging

Always use structured logging with consistent fields:

  • trace_id: Request correlation ID
  • user_id: User identifier
  • org_id: Organization identifier
  • error: Error message and stack trace

Tracing

Instrumentation

  • Trace all API endpoints
  • Include database queries
  • Add custom spans for business logic
  • Propagate trace context to external services

Sampling

  • 100% sampling for errors
  • 10% sampling for successful requests
  • Adjust based on traffic volume

SLOs and SLIs

Service Level Indicators

  • API latency (p50, p95, p99)
  • Error rate
  • Availability
  • Database query time

Service Level Objectives

  • 99.9% API availability
  • p95 latency < 500ms
  • Error rate < 0.1%
  • Zero data loss