GitHub Runners Architecture
This document provides a comprehensive overview of Tenki Cloud’s GitHub Actions runner system, detailing how we manage self-hosted runners at scale.
Overview
Tenki Cloud provides a managed GitHub Actions runner platform that allows users to run their CI/CD workflows on dedicated, scalable infrastructure. The system integrates deeply with GitHub through a GitHub App, orchestrates runner lifecycle through Temporal workflows, and manages the underlying Kubernetes infrastructure.
System Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ GitHub │────▶│ GitHub Proxy │────▶│ Temporal │
│ Webhooks │ │ (Node.js) │ │ Workflows │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Kubernetes │◀────│ Runner Service │◀────│ Database │
│ (Runners) │ │ (Go) │ │ (PostgreSQL) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Core Components
1. GitHub Proxy
The GitHub proxy serves as the entry point for all GitHub webhook events. Built with Node.js and Probot, it:
- Receives webhook events from GitHub (installation, workflow_job, workflow_run, push)
- Validates webhook signatures for security
- Forwards events to Temporal workflows for processing
- Preserves GitHub headers for workflow_job events
Key event handlers:
installation.created/deleted: Manages GitHub App installationsworkflow_job: Processes individual CI/CD job eventsworkflow_run: Tracks overall workflow executionpush: Monitors changes to workflow files
2. Runner Service
The runner service is the core business logic layer, implemented in Go with Connect RPC:
- Manages runner lifecycle: Creation, deletion, suspension
- Handles GitHub integration: Repository synchronization, workflow analysis
- Controls Kubernetes resources: Deployments, autoscalers, secrets
- Tracks usage and billing: Job metrics, duration, failures
Key operations:
InstallRunners: Initialize a new GitHub App installationCreateRunner: Provision custom runner configurationsGetRunnerMetrics: Performance analytics (p50/p90, failure rates)
3. Temporal Workflows
Temporal provides durable workflow orchestration for long-running operations:
Primary Workflows
Runner Installation Workflow
- Long-running workflow per GitHub installation
- Responds to signals: Install, Uninstall, Suspend, AddRepositories
- Manages entire runner lifecycle
- Handles failure recovery and retries
GitHub Job Workflow
- Processes each GitHub Actions job
- Tracks state transitions (queued → in_progress → completed)
- Creates billing events for usage tracking
- Forwards requests to Actions Runner Controller
GitHub Run Workflow
- Monitors overall workflow execution
- Provides visibility into CI/CD pipeline status
- Updates database with run metadata
4. Data Models
Runner
message Runner {
string id = 1;
string name = 2;
string namespace = 3;
string runner_offering_id = 4;
repeated string repositories = 5;
string status = 6;
bool is_custom = 7;
// Resource specifications
string cpu = 8;
string memory = 9;
}
RunnerInstallation
message RunnerInstallation {
int64 installation_id = 1;
string workspace_id = 2;
string state = 3;
string github_account_type = 4;
bool is_service_enabled = 5;
}
RunnerOffering
message RunnerOffering {
string id = 1;
string name = 2;
string cpu = 3;
string memory = 4;
string image_repository = 5;
bool is_autoscale = 6;
}
Event Flow
1. GitHub App Installation
sequenceDiagram
participant GH as GitHub
participant GP as GitHub Proxy
participant T as Temporal
participant RS as Runner Service
participant K8s as Kubernetes
GH->>GP: installation.created
GP->>T: Start RunnerInstallWorkflow
T->>RS: Install signal
RS->>RS: Sync repositories
RS->>K8s: Create namespace
RS->>K8s: Deploy runners
RS->>GH: Installation complete
2. Workflow Job Execution
sequenceDiagram
participant GH as GitHub
participant GP as GitHub Proxy
participant T as Temporal
participant RS as Runner Service
participant ARC as Actions Controller
participant B as Billing
GH->>GP: workflow_job (queued)
GP->>T: Start GithubJobWorkflow
T->>RS: Create job record
T->>ARC: Forward job request
GH->>GP: workflow_job (completed)
T->>B: Create usage event
T->>RS: Update job metrics
Key Features
Multi-tenancy
- Workspace isolation: Each workspace has dedicated resources
- Project organization: Runners are scoped to projects
- Kubernetes namespaces: Physical isolation at infrastructure level
Custom Runners
- Container registry support: GCP, AWS, or custom registries
- Custom images: Build and manage custom runner images
- Resource configurations: Flexible CPU/memory specifications
Auto-scaling
- Horizontal Pod Autoscaler: Scale based on job queue
- Dynamic provisioning: Add runners based on repository activity
- Cost optimization: Scale down when idle
Observability
- Metrics collection: Job duration, success rates, queue times
- Workflow tracking: Complete visibility into CI/CD pipelines
- Performance analytics: P50/P90 latencies, failure analysis
Security Considerations
Authentication
- GitHub App: OAuth-based authentication
- Webhook validation: Signature verification on all events
- Token management: Secure storage in Kubernetes secrets
Authorization
- Workspace boundaries: Strict tenant isolation
- Repository access: Fine-grained permissions per runner
- RBAC integration: Keto-based permission system
Network Security
- Private networking: Runners in isolated VPCs
- Egress controls: Restricted outbound access
- TLS everywhere: Encrypted communication throughout
Operational Aspects
Monitoring
- Temporal UI: Workflow state and history
- Prometheus metrics: Resource usage and performance
- Application logs: Structured logging with trace IDs
Failure Handling
- Temporal retries: Automatic retry with exponential backoff
- Circuit breakers: Prevent cascading failures
- Manual recovery: Reset workflows for reconciliation
Maintenance
- Rolling updates: Zero-downtime deployments
- Database migrations: Version-controlled schema changes
- Backup strategies: Regular snapshots of critical data
Future Enhancements
- GPU Support: Enable ML/AI workloads
- Spot Instance Integration: Cost optimization with preemptible VMs
- Advanced Caching: Distributed cache for dependencies
- Windows Runners: Support for Windows-based workflows
- Enhanced Analytics: Deeper insights into CI/CD performance