Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub Runners Architecture

This document provides a comprehensive overview of Tenki Cloud’s GitHub Actions runner system, detailing how we manage self-hosted runners at scale.

Overview

Tenki Cloud provides a managed GitHub Actions runner platform that allows users to run their CI/CD workflows on dedicated, scalable infrastructure. The system integrates deeply with GitHub through a GitHub App, orchestrates runner lifecycle through Temporal workflows, and manages the underlying Kubernetes infrastructure.

System Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│     GitHub      │────▶│  GitHub Proxy    │────▶│    Temporal     │
│    Webhooks     │     │   (Node.js)      │     │   Workflows     │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                           │
                                                           ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Kubernetes    │◀────│  Runner Service  │◀────│    Database     │
│   (Runners)     │     │     (Go)         │     │  (PostgreSQL)   │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Core Components

1. GitHub Proxy

The GitHub proxy serves as the entry point for all GitHub webhook events. Built with Node.js and Probot, it:

  • Receives webhook events from GitHub (installation, workflow_job, workflow_run, push)
  • Validates webhook signatures for security
  • Forwards events to Temporal workflows for processing
  • Preserves GitHub headers for workflow_job events

Key event handlers:

  • installation.created/deleted: Manages GitHub App installations
  • workflow_job: Processes individual CI/CD job events
  • workflow_run: Tracks overall workflow execution
  • push: Monitors changes to workflow files

2. Runner Service

The runner service is the core business logic layer, implemented in Go with Connect RPC:

  • Manages runner lifecycle: Creation, deletion, suspension
  • Handles GitHub integration: Repository synchronization, workflow analysis
  • Controls Kubernetes resources: Deployments, autoscalers, secrets
  • Tracks usage and billing: Job metrics, duration, failures

Key operations:

  • InstallRunners: Initialize a new GitHub App installation
  • CreateRunner: Provision custom runner configurations
  • GetRunnerMetrics: Performance analytics (p50/p90, failure rates)

3. Temporal Workflows

Temporal provides durable workflow orchestration for long-running operations:

Primary Workflows

Runner Installation Workflow

  • Long-running workflow per GitHub installation
  • Responds to signals: Install, Uninstall, Suspend, AddRepositories
  • Manages entire runner lifecycle
  • Handles failure recovery and retries

GitHub Job Workflow

  • Processes each GitHub Actions job
  • Tracks state transitions (queued → in_progress → completed)
  • Creates billing events for usage tracking
  • Forwards requests to Actions Runner Controller

GitHub Run Workflow

  • Monitors overall workflow execution
  • Provides visibility into CI/CD pipeline status
  • Updates database with run metadata

4. Data Models

Runner

message Runner {
  string id = 1;
  string name = 2;
  string namespace = 3;
  string runner_offering_id = 4;
  repeated string repositories = 5;
  string status = 6;
  bool is_custom = 7;
  // Resource specifications
  string cpu = 8;
  string memory = 9;
}

RunnerInstallation

message RunnerInstallation {
  int64 installation_id = 1;
  string workspace_id = 2;
  string state = 3;
  string github_account_type = 4;
  bool is_service_enabled = 5;
}

RunnerOffering

message RunnerOffering {
  string id = 1;
  string name = 2;
  string cpu = 3;
  string memory = 4;
  string image_repository = 5;
  bool is_autoscale = 6;
}

Event Flow

1. GitHub App Installation

sequenceDiagram
    participant GH as GitHub
    participant GP as GitHub Proxy
    participant T as Temporal
    participant RS as Runner Service
    participant K8s as Kubernetes

    GH->>GP: installation.created
    GP->>T: Start RunnerInstallWorkflow
    T->>RS: Install signal
    RS->>RS: Sync repositories
    RS->>K8s: Create namespace
    RS->>K8s: Deploy runners
    RS->>GH: Installation complete

2. Workflow Job Execution

sequenceDiagram
    participant GH as GitHub
    participant GP as GitHub Proxy
    participant T as Temporal
    participant RS as Runner Service
    participant ARC as Actions Controller
    participant B as Billing

    GH->>GP: workflow_job (queued)
    GP->>T: Start GithubJobWorkflow
    T->>RS: Create job record
    T->>ARC: Forward job request
    GH->>GP: workflow_job (completed)
    T->>B: Create usage event
    T->>RS: Update job metrics

Key Features

Multi-tenancy

  • Workspace isolation: Each workspace has dedicated resources
  • Project organization: Runners are scoped to projects
  • Kubernetes namespaces: Physical isolation at infrastructure level

Custom Runners

  • Container registry support: GCP, AWS, or custom registries
  • Custom images: Build and manage custom runner images
  • Resource configurations: Flexible CPU/memory specifications

Auto-scaling

  • Horizontal Pod Autoscaler: Scale based on job queue
  • Dynamic provisioning: Add runners based on repository activity
  • Cost optimization: Scale down when idle

Observability

  • Metrics collection: Job duration, success rates, queue times
  • Workflow tracking: Complete visibility into CI/CD pipelines
  • Performance analytics: P50/P90 latencies, failure analysis

Security Considerations

Authentication

  • GitHub App: OAuth-based authentication
  • Webhook validation: Signature verification on all events
  • Token management: Secure storage in Kubernetes secrets

Authorization

  • Workspace boundaries: Strict tenant isolation
  • Repository access: Fine-grained permissions per runner
  • RBAC integration: Keto-based permission system

Network Security

  • Private networking: Runners in isolated VPCs
  • Egress controls: Restricted outbound access
  • TLS everywhere: Encrypted communication throughout

Operational Aspects

Monitoring

  • Temporal UI: Workflow state and history
  • Prometheus metrics: Resource usage and performance
  • Application logs: Structured logging with trace IDs

Failure Handling

  • Temporal retries: Automatic retry with exponential backoff
  • Circuit breakers: Prevent cascading failures
  • Manual recovery: Reset workflows for reconciliation

Maintenance

  • Rolling updates: Zero-downtime deployments
  • Database migrations: Version-controlled schema changes
  • Backup strategies: Regular snapshots of critical data

Future Enhancements

  1. GPU Support: Enable ML/AI workloads
  2. Spot Instance Integration: Cost optimization with preemptible VMs
  3. Advanced Caching: Distributed cache for dependencies
  4. Windows Runners: Support for Windows-based workflows
  5. Enhanced Analytics: Deeper insights into CI/CD performance