Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tenki Cloud Documentation

Welcome to Tenki Cloud’s documentation. This is your starting point for understanding the system architecture, development practices, and operational procedures.

Note: This documentation is built with mdBook. Run pnpm docs:dev to view it locally.

Documentation Organization

πŸ“ Architecture

System design, technical decisions, and architectural diagrams.

πŸ’» Development

Everything you need to start developing on Tenki Cloud.

πŸš€ Operations

Deployment, monitoring, and incident response.

πŸ“‹ Product

Product vision, roadmap, and requirements.

Contributing to Documentation

When to Add Documentation

  • Architecture changes β†’ Add an ADR
  • New features β†’ Add a PRD
  • Operational issues β†’ Add a runbook
  • API changes β†’ Update the relevant guide

Documentation Standards

  1. Keep it concise - Get to the point quickly
  2. Use examples - Show, don’t just tell
  3. Date your docs - Add β€œLast updated: YYYY-MM-DD” to guides
  4. Test your instructions - Make sure they actually work

Quick Doc Updates

# Install mdBook (first time only)
./docs/install-mdbook.sh

# Edit documentation
vim docs/src/development/getting-started.md

# Preview locally with hot reload
pnpm docs:dev

# Build static site
pnpm docs:build

# Submit changes
git add docs/
git commit -m "docs: update getting started guide"

Finding Information

By Role

Backend Engineer

Frontend Engineer

DevOps/SRE

Product Manager

By Task

β€œI need to…”

Maintenance

This documentation is maintained by the engineering team. Each team member is responsible for keeping their area of expertise documented.

  • Backend team owns: Backend guide, database docs, API patterns
  • Frontend team owns: Frontend guide, component docs
  • DevOps team owns: Deployment, monitoring, runbooks
  • Product team owns: Roadmap, PRDs, metrics

Last updated: 2025-06-12

Tenki Cloud System Architecture

Last updated: 2025-06-12

Overview

Tenki Cloud is a cloud compute marketplace that provides GitHub Actions runner management as a service. The system is built as a distributed microservices architecture with clear separation of concerns.

High-Level Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   GitHub.com    │────▢│  GitHub Proxy    │────▢│    Temporal     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β–Ό
β”‚   Next.js App   │────▢│   tRPC Gateway   β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  Backend Engine β”‚
                                β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β–Ό                          β”‚
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β–Ό
                        β”‚   Backend API    β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  (Connect RPC)   β”‚      β”‚   PostgreSQL    β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

Frontend Layer

Next.js Application (apps/app/)

  • Server-side rendered React application
  • TypeScript with tRPC for type-safe API calls
  • Tailwind CSS with Radix UI components
  • Authentication via Kratos sessions

API Gateway Layer

tRPC Router (apps/app/src/server/api/)

  • Type-safe RPC layer between frontend and backend
  • Handles session management and authentication
  • Routes requests to appropriate backend services

Backend Services

Engine (backend/cmd/engine/)

  • Main orchestrator for all backend operations
  • Implements Connect RPC (gRPC-Web compatible)
  • Manages service lifecycle and dependencies

Domain Services (backend/internal/domain/)

  • Identity: User authentication (Kratos) and authorization (Keto)
  • Workspace: Multi-tenant workspace and project management
  • Runner: GitHub Actions runner lifecycle management
  • Billing: Usage tracking, TigerBeetle ledger, Stripe integration
  • Compute: VM provisioning via CloudStack/Kubernetes

Event Processing

GitHub Proxy (backend/cmd/github-proxy/)

  • Receives GitHub webhooks
  • Validates and transforms events
  • Publishes to Temporal for processing

Temporal Workflows

  • Long-running business processes
  • Runner provisioning workflows
  • Billing cycle management
  • Retry and failure handling

Data Layer

PostgreSQL

  • Primary data store
  • Managed via migrations (backend/schema/)
  • Type-safe queries via sqlc

Redpanda

  • Event streaming platform
  • Audit log collection
  • Inter-service communication

TigerBeetle

  • Financial ledger for billing
  • Double-entry bookkeeping
  • High-performance transaction processing

Key Design Decisions

1. Monorepo Structure

See ADR-001

2. Temporal for Workflows

See ADR-002

3. Connect RPC over REST

See ADR-003

Security Architecture

Authentication Flow

User β†’ Next.js β†’ Kratos β†’ Session Cookie β†’ tRPC β†’ Backend

Authorization Model

  • Keto for fine-grained permissions
  • Workspace-based multi-tenancy
  • Project-level access control

Secrets Management

  • SOPS for encrypted configuration
  • Kubernetes secrets for runtime
  • No secrets in environment variables

Deployment Architecture

Kubernetes Deployment

  • GitOps via Flux
  • Horizontal pod autoscaling
  • Service mesh for inter-service communication

Infrastructure Components

  • Ingress: Traefik with automatic TLS
  • Monitoring: Prometheus + Grafana
  • Logging: Loki + Grafana
  • Tracing: Tempo

Data Flow Examples

Runner Provisioning

  1. GitHub sends webhook to proxy
  2. Proxy validates and publishes to Kafka
  3. Backend consumes event, starts Temporal workflow
  4. Workflow provisions runner in Kubernetes
  5. Runner registers with GitHub
  6. Status updates flow back via Temporal

Billing Flow

  1. Runner usage tracked via Temporal activities
  2. Usage events written to TigerBeetle
  3. Daily aggregation job calculates costs
  4. Monthly billing workflow generates invoices
  5. Stripe processes payments
  6. Payment status updates ledger

Scalability Considerations

Horizontal Scaling

  • Stateless services scale via replicas
  • Database uses read replicas for queries
  • Temporal workers scale independently

Performance Optimization

  • Redis for session caching
  • CDN for static assets
  • Database query optimization via indexes

Reliability

  • Circuit breakers for external services
  • Retry logic in Temporal workflows
  • Graceful degradation for non-critical features

Future Architecture Plans

  1. Multi-region deployment for global latency optimization
  2. GraphQL federation for more flexible API access
  3. Event sourcing for complete audit trail
  4. Service mesh for advanced traffic management

GitHub Runners Architecture

This document provides a comprehensive overview of Tenki Cloud’s GitHub Actions runner system, detailing how we manage self-hosted runners at scale.

Overview

Tenki Cloud provides a managed GitHub Actions runner platform that allows users to run their CI/CD workflows on dedicated, scalable infrastructure. The system integrates deeply with GitHub through a GitHub App, orchestrates runner lifecycle through Temporal workflows, and manages the underlying Kubernetes infrastructure.

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     GitHub      │────▢│  GitHub Proxy    │────▢│    Temporal     β”‚
β”‚    Webhooks     β”‚     β”‚   (Node.js)      β”‚     β”‚   Workflows     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Kubernetes    │◀────│  Runner Service  │◀────│    Database     β”‚
β”‚   (Runners)     β”‚     β”‚     (Go)         β”‚     β”‚  (PostgreSQL)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. GitHub Proxy

The GitHub proxy serves as the entry point for all GitHub webhook events. Built with Node.js and Probot, it:

  • Receives webhook events from GitHub (installation, workflow_job, workflow_run, push)
  • Validates webhook signatures for security
  • Forwards events to Temporal workflows for processing
  • Preserves GitHub headers for workflow_job events

Key event handlers:

  • installation.created/deleted: Manages GitHub App installations
  • workflow_job: Processes individual CI/CD job events
  • workflow_run: Tracks overall workflow execution
  • push: Monitors changes to workflow files

2. Runner Service

The runner service is the core business logic layer, implemented in Go with Connect RPC:

  • Manages runner lifecycle: Creation, deletion, suspension
  • Handles GitHub integration: Repository synchronization, workflow analysis
  • Controls Kubernetes resources: Deployments, autoscalers, secrets
  • Tracks usage and billing: Job metrics, duration, failures

Key operations:

  • InstallRunners: Initialize a new GitHub App installation
  • CreateRunner: Provision custom runner configurations
  • GetRunnerMetrics: Performance analytics (p50/p90, failure rates)

3. Temporal Workflows

Temporal provides durable workflow orchestration for long-running operations:

Primary Workflows

Runner Installation Workflow

  • Long-running workflow per GitHub installation
  • Responds to signals: Install, Uninstall, Suspend, AddRepositories
  • Manages entire runner lifecycle
  • Handles failure recovery and retries

GitHub Job Workflow

  • Processes each GitHub Actions job
  • Tracks state transitions (queued β†’ in_progress β†’ completed)
  • Creates billing events for usage tracking
  • Forwards requests to Actions Runner Controller

GitHub Run Workflow

  • Monitors overall workflow execution
  • Provides visibility into CI/CD pipeline status
  • Updates database with run metadata

4. Data Models

Runner

message Runner {
  string id = 1;
  string name = 2;
  string namespace = 3;
  string runner_offering_id = 4;
  repeated string repositories = 5;
  string status = 6;
  bool is_custom = 7;
  // Resource specifications
  string cpu = 8;
  string memory = 9;
}

RunnerInstallation

message RunnerInstallation {
  int64 installation_id = 1;
  string workspace_id = 2;
  string state = 3;
  string github_account_type = 4;
  bool is_service_enabled = 5;
}

RunnerOffering

message RunnerOffering {
  string id = 1;
  string name = 2;
  string cpu = 3;
  string memory = 4;
  string image_repository = 5;
  bool is_autoscale = 6;
}

Event Flow

1. GitHub App Installation

sequenceDiagram
    participant GH as GitHub
    participant GP as GitHub Proxy
    participant T as Temporal
    participant RS as Runner Service
    participant K8s as Kubernetes

    GH->>GP: installation.created
    GP->>T: Start RunnerInstallWorkflow
    T->>RS: Install signal
    RS->>RS: Sync repositories
    RS->>K8s: Create namespace
    RS->>K8s: Deploy runners
    RS->>GH: Installation complete

2. Workflow Job Execution

sequenceDiagram
    participant GH as GitHub
    participant GP as GitHub Proxy
    participant T as Temporal
    participant RS as Runner Service
    participant ARC as Actions Controller
    participant B as Billing

    GH->>GP: workflow_job (queued)
    GP->>T: Start GithubJobWorkflow
    T->>RS: Create job record
    T->>ARC: Forward job request
    GH->>GP: workflow_job (completed)
    T->>B: Create usage event
    T->>RS: Update job metrics

Key Features

Multi-tenancy

  • Workspace isolation: Each workspace has dedicated resources
  • Project organization: Runners are scoped to projects
  • Kubernetes namespaces: Physical isolation at infrastructure level

Custom Runners

  • Container registry support: GCP, AWS, or custom registries
  • Custom images: Build and manage custom runner images
  • Resource configurations: Flexible CPU/memory specifications

Auto-scaling

  • Horizontal Pod Autoscaler: Scale based on job queue
  • Dynamic provisioning: Add runners based on repository activity
  • Cost optimization: Scale down when idle

Observability

  • Metrics collection: Job duration, success rates, queue times
  • Workflow tracking: Complete visibility into CI/CD pipelines
  • Performance analytics: P50/P90 latencies, failure analysis

Security Considerations

Authentication

  • GitHub App: OAuth-based authentication
  • Webhook validation: Signature verification on all events
  • Token management: Secure storage in Kubernetes secrets

Authorization

  • Workspace boundaries: Strict tenant isolation
  • Repository access: Fine-grained permissions per runner
  • RBAC integration: Keto-based permission system

Network Security

  • Private networking: Runners in isolated VPCs
  • Egress controls: Restricted outbound access
  • TLS everywhere: Encrypted communication throughout

Operational Aspects

Monitoring

  • Temporal UI: Workflow state and history
  • Prometheus metrics: Resource usage and performance
  • Application logs: Structured logging with trace IDs

Failure Handling

  • Temporal retries: Automatic retry with exponential backoff
  • Circuit breakers: Prevent cascading failures
  • Manual recovery: Reset workflows for reconciliation

Maintenance

  • Rolling updates: Zero-downtime deployments
  • Database migrations: Version-controlled schema changes
  • Backup strategies: Regular snapshots of critical data

Future Enhancements

  1. GPU Support: Enable ML/AI workloads
  2. Spot Instance Integration: Cost optimization with preemptible VMs
  3. Advanced Caching: Distributed cache for dependencies
  4. Windows Runners: Support for Windows-based workflows
  5. Enhanced Analytics: Deeper insights into CI/CD performance

Billing System Architecture

This document provides a comprehensive overview of Tenki Cloud’s billing system, which handles usage-based billing, payment processing, and financial accounting for GitHub Actions runners.

Overview

Tenki Cloud’s billing system is designed to provide accurate, reliable, and scalable billing for compute usage. It integrates multiple systems:

  • TigerBeetle: High-performance financial database for double-entry bookkeeping
  • Stripe: Payment processing and invoice generation
  • Temporal: Workflow orchestration for billing cycles and retry logic
  • PostgreSQL: Storage for billing metadata and history

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GitHub Actions │────▢│ Runner Service   │────▢│ Usage Events    β”‚
β”‚     Jobs        β”‚     β”‚                  β”‚     β”‚   (Database)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                                                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Stripe      │◀────│ Billing Service  │◀────│    Temporal     β”‚
β”‚   (Payments)    β”‚     β”‚                  β”‚     β”‚   Workflows     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚   TigerBeetle   β”‚
                        β”‚  (Accounting)   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. Data Models

Customer

message Customer {
  string id = 1;
  string identity_id = 2;
  string workspace_id = 3;
  uint64 tb_account_id = 4;        // TigerBeetle account
  string stripe_customer_id = 5;    // Stripe customer
  string default_payment_method = 6;
  bool has_payment_method = 7;
  string payment_method_status = 8;
}

Invoice

message Invoice {
  string id = 1;
  string customer_id = 2;
  string billing_period = 3;        // YYYY-MM format
  string status = 4;                // draft, issued, paid, void
  int64 amount = 5;                 // in cents
  bytes pdf_content = 6;
  string pdf_url = 7;
  string stripe_invoice_id = 8;
  int32 retry_count = 9;
}

UsageEvent

message UsageEvent {
  string id = 1;
  string workspace_id = 2;
  string runner_id = 3;
  google.protobuf.Timestamp started_at = 4;
  google.protobuf.Timestamp finished_at = 5;
  int64 seconds = 6;
  string external_id = 7;           // Idempotency key
  uint64 tb_transfer_id = 8;        // TigerBeetle transfer
}

2. TigerBeetle Accounting

The system uses double-entry bookkeeping with predefined accounts:

Fixed Accounts

  • 1001 - TENKI_RECEIVABLE: Money owed to Tenki
  • 1010 - STRIPE_RECEIVABLE: Money in Stripe
  • 2001 - USER: Customer liability accounts
  • 4001 - REVENUE: Income account
  • 5010 - STRIPE_FEE: Payment processing fees
  • 5020 - MARKETING_EXPENSE: Promotional credits

Transfer Types

  • 1002 - T_StripePayment: Customer payments via Stripe
  • 2001 - T_RunnerCharge: GitHub Actions runner usage charge
  • 2002 - T_RunnerPromoCreditUsage: Promotional credit usage adjustment
  • 2003 - T_UsageReversal: Reversal of negative usage charges
  • 2010 - T_ComputeCharge: Charge for compute resources (future use)
  • 3001 - T_AccountSignup: Initial signup bonus credit
  • 3002 - T_MonthlyFreeCredit: Monthly free credit allowance
  • 3003 - T_PromoCredit: General promotional credit
  • 3004 - T_PromoCreditReversal: Reversal of promotional credits

Example Transactions

Usage Charge (Runner completes job):

Debit:  USER (Customer Account)     $5.00
Credit: REVENUE                     $5.00

Payment Received (Stripe payment):

Debit:  STRIPE_RECEIVABLE          $100.00
Credit: USER (Customer Account)    $100.00

Debit:  STRIPE_FEE                 $2.90
Credit: STRIPE_RECEIVABLE          $2.90

Promotional Credit:

Debit:  MARKETING_EXPENSE          $10.00
Credit: USER (Customer Account)    $10.00

Financial Flow Sequences

The following sequence diagram illustrates the complete financial flows in the Tenki Cloud billing system, showing how money moves between different accounts through various transfer codes:

sequenceDiagram
    participant USER as User/Customer
    participant SIGNUP as Signup Process
    participant GITHUB as GitHub Actions
    participant BILLING as Billing Service
    participant TB as TigerBeetle Ledger
    participant STRIPE as Stripe
    participant CYCLE as Billing Cycle
    participant AUDIT as Audit System

    Note over USER,AUDIT: Tenki Cloud Financial Flow System

    %% Phase 1: Account Creation & Initial Credits
    rect rgb(240, 248, 255)
        Note left of USER: Phase 1: Account Setup & Signup Credits
        USER->>SIGNUP: Create account
        SIGNUP->>BILLING: Create customer account
        BILLING->>TB: Create USER account (ACCOUNT_CODE_USER)

        SIGNUP->>BILLING: Add signup bonus
        BILLING->>TB: Transfer: T_AccountSignup<br/>MARKETING_EXPENSE β†’ USER<br/>($10 signup credit)
    end

    %% Phase 2: Service Usage
    rect rgb(255, 253, 240)
        Note left of GITHUB: Phase 2: Service Usage & Charges
        GITHUB->>BILLING: Job execution event
        BILLING->>BILLING: Calculate usage cost
        BILLING->>TB: Transfer: T_RunnerCharge<br/>USER β†’ REVENUE<br/>(Usage charges)
    end

    %% Phase 3: Payment Processing
    rect rgb(240, 255, 240)
        Note left of CYCLE: Phase 3: Billing Cycle & Payments

        Note over BILLING,TB: Step 1: Promotional credit adjustments
        CYCLE->>BILLING: Start billing cycle
        BILLING->>BILLING: Check promo credit usage for period
        BILLING->>TB: Transfer: T_RunnerPromoCreditUsage<br/>REVENUE β†’ MARKETING_EXPENSE<br/>(Move promo usage from revenue)

        Note over BILLING,STRIPE: Step 2: Invoice generation
        BILLING->>STRIPE: Create Stripe invoice
        STRIPE->>USER: Send payment request
        alt Payment Success
            USER->>STRIPE: Make payment
            STRIPE->>BILLING: Payment webhook

            Note over BILLING,AUDIT: Payment Success Workflow
            BILLING->>TB: Transfer: T_StripePayment<br/>STRIPE_RECEIVABLE β†’ USER<br/>(Payment received)

            BILLING->>TB: Transfer: T_MonthlyFreeCredit<br/>MARKETING_EXPENSE β†’ USER<br/>($10/month free credit reset)

            BILLING->>BILLING: Create payment record in database
            BILLING->>AUDIT: Create billing audit record<br/>(compliance tracking)

        else Payment Failed
            STRIPE->>BILLING: Payment failed webhook
            BILLING->>BILLING: Schedule retry attempts
            BILLING->>BILLING: Start service interruption timer
        end
    end

Complete Transfer Code Reference

The system uses the following transfer codes for different types of financial transactions:

Payment & Withdrawal Operations (1000s)

  • 1001 - T_BankWithdrawal: Cash withdrawal from bank account
  • 1002 - T_StripePayment: Payment received from Stripe (invoice payment)

Service Charges (2000s)

  • 2001 - T_RunnerCharge: Charge for GitHub Actions runner usage
  • 2002 - T_RunnerPromoCreditUsage: Adjustment to move promotional credit usage from revenue to marketing expense
  • 2003 - T_UsageReversal: Reversal of negative usage charges
  • 2010 - T_ComputeCharge: Charge for compute resources (future use)

Credits & Bonuses (3000s)

  • 3001 - T_AccountSignup: Initial signup bonus credit
  • 3002 - T_MonthlyFreeCredit: Monthly free credit allowance (e.g., $10/month)
  • 3003 - T_PromoCredit: General promotional credit (campaigns, support, etc.)
  • 3004 - T_PromoCreditReversal: Reversal of promotional credits (corrections, violations, etc.)

Key Financial Flow Patterns

  1. Customer Onboarding: New users receive signup credits (T_AccountSignup) and monthly free credits (T_MonthlyFreeCredit) from the marketing expense account.

  2. Usage Billing: GitHub Actions runner usage generates charges (T_RunnerCharge) that move money from customer accounts to revenue.

  3. Promotional Credit Accounting: When promotional credits are used for services, the system adjusts by moving the equivalent amount from revenue back to marketing expense (T_RunnerPromoCreditUsage).

  4. Payment Processing: Customer payments through Stripe (T_StripePayment) add funds to customer accounts from the Stripe receivable account.

  5. Administrative Corrections: The system supports reversals for both usage charges (T_UsageReversal) and promotional credits (T_PromoCreditReversal) for corrections and violations.

3. Billing Service

The billing service provides APIs for:

  • Customer Management: Creating and retrieving billing customers
  • Balance Operations: Checking workspace credits/debits
  • Invoice Management: Generating and managing monthly invoices
  • Usage Tracking: Recording compute usage events
  • Payment Methods: Managing cards and payment details
  • Stripe Integration: Setup intents and billing portal

Key service methods:

// Record runner usage
RecordUsage(ctx, workspaceID, runnerID, startTime, endTime)

// Process monthly billing
ProcessInvoiceAndCharge(ctx, workspaceID, billingPeriod)

// Add promotional credits
AddPromotionalCredits(ctx, workspaceID, amount, description)

Workflow Orchestration

1. Billing Cycle Workflow

Runs monthly for each workspace:

flowchart TD
    A[Start Monthly Billing] --> B[Generate Stripe Invoice]
    B --> C[Send Invoice Email]
    C --> D{Amount > 0?}
    D -->|Yes| E[Charge Payment Method]
    D -->|No| F[Complete]
    E --> G{Payment Success?}
    G -->|Yes| H[Payment Succeeded Workflow]
    G -->|No| I[Payment Failed Workflow]
    H --> F
    I --> F

2. Payment Processing Workflows

Payment Succeeded:

  1. Record payment in TigerBeetle
  2. Create payment record in database
  3. Update invoice status

Payment Failed:

  1. Send failure notification
  2. Schedule retry attempts (max 5)
  3. Start service interruption timer

3. Retry Logic

Failed payments are retried with exponential backoff:

  • Retry 1: 3 days later
  • Retry 2: 5 days later
  • Retry 3: 7 days later
  • Retry 4: 14 days later
  • Retry 5: 21 days later

If all retries fail by the 9th of the following month, services are suspended on the 10th.

4. Credit Management

Long-running workflow that handles credit operations via signals:

  • AddPromotionalCredits: Adds credits to workspace
  • DeductPromotionalCredits: Removes credits
  • Maintains audit trail in TigerBeetle

Usage Flow

1. Recording Usage

When a GitHub Actions job completes:

sequenceDiagram
    participant Job as GitHub Job
    participant Runner as Runner Service
    participant Billing as Billing Service
    participant TB as TigerBeetle

    Job->>Runner: Job completed
    Runner->>Billing: Record usage event
    Billing->>Billing: Calculate cost
    Billing->>TB: Create usage transfer
    TB->>TB: Debit user account
    TB->>TB: Credit revenue account
    Billing->>Runner: Usage recorded

2. Monthly Billing

At the start of each month:

sequenceDiagram
    participant Temporal
    participant Billing as Billing Service
    participant Stripe
    participant Customer

    Temporal->>Billing: Start billing cycle
    Billing->>Billing: Calculate usage for month
    Billing->>Stripe: Create invoice
    Stripe->>Customer: Send invoice email
    Billing->>Stripe: Charge payment method
    alt Payment successful
        Stripe->>Billing: Payment confirmed
        Billing->>Billing: Record in TigerBeetle
    else Payment failed
        Stripe->>Billing: Payment failed
        Billing->>Temporal: Schedule retry
    end

Key Features

Precision Accounting

  • All amounts stored as micro-cents (1/1,000,000 of a cent)
  • Prevents rounding errors in usage calculations
  • Supports high-frequency micro-transactions

Idempotency

  • External IDs prevent duplicate usage records
  • Workflow IDs ensure single execution
  • TigerBeetle provides transaction guarantees

Audit Trail

  • Every financial transaction recorded in TigerBeetle
  • Complete history of charges, payments, and credits
  • Immutable ledger for compliance

Self-Service

  • Stripe billing portal for payment method management
  • Invoice history and downloads
  • Usage reports by billing period

Graceful Degradation

  • Billing continues even if Stripe is unavailable
  • TigerBeetle ensures accounting accuracy
  • Workflows retry transient failures

Security Considerations

Payment Security

  • No credit card data stored in Tenki systems
  • All payment processing through PCI-compliant Stripe
  • Secure token-based payment method references

Access Control

  • Workspace-scoped billing operations
  • Admin-only credit management
  • Audit logs for all financial operations

Data Protection

  • Encrypted storage for sensitive data
  • TLS for all external communications
  • Regular backups of financial data

Operational Aspects

Monitoring

  • Temporal workflow status for billing cycles
  • TigerBeetle consistency checks
  • Stripe webhook processing metrics
  • Failed payment alerts

Troubleshooting

  • Workflow history in Temporal UI
  • TigerBeetle account balances
  • Stripe dashboard for payment issues
  • Database queries for usage history

Common Issues

  1. Payment failures: Check Stripe logs and retry status
  2. Missing usage: Verify runner job completion events
  3. Balance discrepancies: Audit TigerBeetle transfers
  4. Invoice generation: Check Temporal workflow status

Future Enhancements

  1. Volume Discounts: Tiered pricing based on usage
  2. Prepaid Packages: Bulk minute purchases
  3. Cost Alerts: Notifications for spending thresholds
  4. Multi-Currency: Support for international customers
  5. Advanced Analytics: Detailed cost breakdowns by repository/workflow

Architecture Decision Records

This directory contains Architecture Decision Records (ADRs) - documents that capture important architectural decisions made during the development of Tenki Cloud.

What is an ADR?

An ADR is a document that captures an important architectural decision made along with its context and consequences. Each ADR describes a single decision and is immutable once accepted.

ADR Template

# ADR-XXX: Title

## Status

[Proposed | Accepted | Deprecated | Superseded by ADR-YYY]

## Context

What is the issue that we're seeing that is motivating this decision or change?

## Decision

What is the change that we're proposing and/or doing?

## Consequences

What becomes easier or more difficult to do because of this change?

### Positive

- List of positive consequences

### Negative

- List of negative consequences

## Alternatives Considered

What other options were evaluated and why were they rejected?

Current ADRs

Creating a New ADR

  1. Copy the template above
  2. Create a new file: XXX-short-description.md (increment XXX)
  3. Fill out all sections
  4. Submit PR for review
  5. Once accepted, the ADR becomes immutable

When to Write an ADR

Write an ADR when:

  • Selecting key technologies (databases, frameworks, protocols)
  • Defining major architectural patterns
  • Making security decisions
  • Choosing between significant alternatives
  • Deprecating existing patterns

ADR Lifecycle

  1. Proposed - Under discussion
  2. Accepted - Decision made and being implemented
  3. Deprecated - No longer recommended but still in use
  4. Superseded - Replaced by another ADR

ADR-001: Monorepo Structure

Status

Accepted (2024-01-15)

Context

Tenki Cloud consists of multiple interconnected services:

  • Frontend applications (Next.js web app, future mobile apps)
  • Backend services (Go microservices)
  • Shared packages (TypeScript utilities, proto definitions)
  • Infrastructure code (Kubernetes manifests, Terraform)

We need a repository structure that:

  1. Enables code sharing between services
  2. Ensures coordinated deployments
  3. Maintains clear boundaries between services
  4. Provides good developer experience

Decision

We will use a monorepo structure with:

  • pnpm workspaces for TypeScript/JavaScript projects
  • Go modules with replace directives for Go services
  • Turborepo for orchestrated builds
  • Shared tooling across all services

Repository structure:

tenki.app/
β”œβ”€β”€ apps/           # Deployable applications
β”œβ”€β”€ backend/        # Go services
β”œβ”€β”€ packages/       # Shared libraries
β”œβ”€β”€ proto/          # Protocol buffer definitions
└── infra/          # Infrastructure code

Consequences

Positive

  1. Atomic changes - Features spanning multiple services can be implemented in a single commit
  2. Shared tooling - Linting, formatting, and testing tools configured once
  3. Simplified dependencies - No need for private package registries
  4. Consistent versioning - All services released together
  5. Easier refactoring - Moving code between services is straightforward
  6. Single source of truth - Proto definitions shared directly

Negative

  1. Larger repository - Clone and fetch times increase over time
  2. Complex CI/CD - Need to determine which services to build/deploy
  3. Steeper learning curve - New developers must understand entire structure
  4. Potential for coupling - Easier to create inappropriate dependencies
  5. Tooling requirements - Requires pnpm, Go, and other tools installed

Alternatives Considered

1. Separate Repositories

Rejected because:

  • Coordination overhead for cross-service changes
  • Dependency version management complexity
  • Need for private package registry
  • Difficult to maintain API contracts

2. Git Submodules

Rejected because:

  • Poor developer experience
  • Complex update workflows
  • Easy to get into inconsistent states
  • Limited tool support

3. Lerna (instead of Turborepo)

Rejected because:

  • Turborepo has better performance
  • Native pnpm workspace support
  • Better caching mechanisms
  • Simpler configuration

Implementation Notes

  1. Use pnpm filters for targeted operations:

    pnpm -F app dev          # Run only app
    pnpm -F "backend/*" test # Test all backend
    
  2. Go services use local replace:

    replace github.com/luxorlabs/proto => ../../proto
    
  3. CI uses Turborepo caching:

    {
      "pipeline": {
        "build": {
          "cache": true
        }
      }
    }
    

ADR-002: Temporal Workflows

Status

Accepted

Context

We need a reliable workflow orchestration system for managing complex, long-running processes like GitHub runner lifecycle management, billing operations, and asynchronous tasks.

Decision

We will use Temporal for workflow orchestration because it provides:

  • Durable execution with automatic retries
  • Built-in error handling and compensation
  • Strong consistency guarantees
  • Visibility into workflow state and history
  • Language-specific SDKs with good Go support

Consequences

Positive

  • Reliable execution of critical business processes
  • Built-in observability and debugging capabilities
  • Simplified error handling for distributed operations
  • Ability to handle long-running workflows (hours/days)

Negative

  • Additional infrastructure to maintain
  • Learning curve for developers new to Temporal
  • Potential vendor lock-in for workflow logic

Implementation

Temporal workflows will be used for:

  • GitHub runner provisioning and lifecycle management
  • Billing and subscription management
  • Asynchronous job processing
  • Scheduled maintenance tasks

ADR-003: gRPC Gateway

Status

Accepted

Context

We need to expose our internal gRPC services to web clients that don’t support gRPC directly. We also want to maintain a single source of truth for our API definitions while supporting both gRPC and REST/JSON clients.

Decision

We will use grpc-gateway to automatically generate a RESTful HTTP API from our gRPC service definitions. This allows us to:

  • Maintain a single API definition in protobuf
  • Support both gRPC and REST clients
  • Auto-generate OpenAPI documentation
  • Preserve strong typing across the stack

Consequences

Positive

  • Single source of truth for API definitions
  • Automatic REST API generation from protobuf
  • Built-in OpenAPI/Swagger documentation
  • Consistent API behavior between gRPC and REST
  • Strong typing preserved through code generation

Negative

  • Additional build step for gateway generation
  • Some gRPC features don’t map perfectly to REST
  • Slightly increased complexity in the API layer
  • Need to carefully design protos for good REST mappings

Implementation

The grpc-gateway will:

  • Run as a reverse proxy in front of gRPC services
  • Translate HTTP/JSON requests to gRPC
  • Use protobuf annotations for REST endpoint configuration
  • Generate OpenAPI specs for documentation

Architecture Diagrams

This directory contains architectural diagrams for the Tenki Cloud platform.

Overview

Our architecture diagrams use Mermaid for easy maintenance and version control. Each diagram is stored as a .md file with embedded Mermaid syntax.

Available Diagrams

  • System Overview: High-level view of all components
  • Data Flow: How data moves through the system
  • Deployment Architecture: Infrastructure and deployment topology
  • Security Model: Authentication and authorization flows

Creating New Diagrams

  1. Create a new .md file in this directory
  2. Use Mermaid syntax for the diagram
  3. Include a description of what the diagram represents
  4. Update this README with a link to the new diagram

Viewing Diagrams

These diagrams are rendered automatically in:

  • GitHub markdown preview
  • Our documentation site (mdBook)
  • Most modern markdown editors

Mermaid Resources

Feature Specification: New Pricing + Free Credits Policy

Feature Branch: 001-new-pricing Created: 2025-10-22 Status: Draft Input: User description: β€œNew Pricing + Free Credits Policy”

User Scenarios & Testing (mandatory)

User Story 1 - New User Onboarding with Free Credits (Priority: P1)

A new user signs up for Tenki and receives 1,000 free minutes (normalized to 2 vCPU runners) to explore the platform without requiring payment information upfront. They can access all features and offerings during this trial period.

Why this priority: This is the primary entry point for all new users and directly addresses the problem of acquiring users while deferring payment collection until value is demonstrated.

Independent Test: Can be fully tested by creating a new account, running jobs on various runner sizes, and verifying that free minutes are properly calculated and consumed based on vCPU scaling (e.g., 500 minutes on 4 vCPU, 250 minutes on 8 vCPU).

Acceptance Scenarios:

  1. Given a new user signs up for Tenki, When their account is created, Then they receive 1,000 free minutes normalized to 2 vCPU runners
  2. Given a user has free minutes remaining, When they use a 2 vCPU runner for 10 minutes, Then 10 minutes are deducted from their balance
  3. Given a user has free minutes remaining, When they use a 4 vCPU runner for 10 minutes, Then 20 minutes are deducted from their balance (scaled by vCPU ratio)
  4. Given a user has free minutes remaining, When they use an 8 vCPU runner for 10 minutes, Then 40 minutes are deducted from their balance (scaled by vCPU ratio)
  5. Given a new user with free credits, When they access the platform, Then they can use all features and runner types without restrictions

User Story 2 - Payment Information Collection After Free Credits (Priority: P1)

When a user exhausts their 1,000 free minutes, the system prompts them to enter credit card information to continue using the platform on a pay-as-you-go basis.

Why this priority: This is the critical conversion point from free trial to paid customer and directly addresses the problem of verifying payment intent before allowing continued usage.

Independent Test: Can be tested by consuming all free minutes and verifying that the system blocks further usage until valid payment information is provided, then allows continued usage after payment details are entered.

Acceptance Scenarios:

  1. Given a user has consumed all 1,000 free minutes, When they attempt to run a new job, Then they are prompted to enter credit card information before proceeding
  2. Given a user is prompted for payment, When they enter valid credit card details, Then their account transitions to pay-as-you-go billing and jobs can proceed
  3. Given a user is prompted for payment, When they close the prompt without entering payment details, Then their jobs remain blocked until payment is provided
  4. Given a user has entered payment information, When they consume additional minutes, Then usage is tracked and billed according to PAYG pricing

User Story 3 - Pay-As-You-Go Usage and Billing (Priority: P2)

A paid user runs CI/CD jobs on various runner types and is charged per-minute based on the runner SKU pricing. They receive transparent billing for their actual usage with no upfront commitments.

Why this priority: This is the core revenue model for the platform and must work reliably for sustainable business operations.

Independent Test: Can be tested by running jobs on different runner SKUs, verifying per-minute charges match the pricing table, and confirming accurate invoice generation.

Acceptance Scenarios:

  1. Given a paid user runs a job on a 2c-4GB x64 runner for 10 minutes, When billing is calculated, Then they are charged $0.03 (10 min Γ— $0.003/min)
  2. Given a paid user runs a job on a 4c-8GB x64 runner for 15 minutes, When billing is calculated, Then they are charged $0.09 (15 min Γ— $0.006/min)
  3. Given a paid user with 40 concurrent jobs included, When they run 40 or fewer concurrent jobs, Then no additional concurrency charges apply
  4. Given a paid user, When they view their billing dashboard, Then they see itemized usage by runner type, duration, and total costs

User Story 4 - Add-On Purchase and Management (Priority: P2)

A user on the PAYG plan purchases optional add-ons such as macOS M4 runner access, priority support, or priority queue boost to enhance their experience.

Why this priority: Add-ons provide upsell opportunities and feature-based segmentation, addressing the problem of revenue expansion and feature monetization.

Independent Test: Can be tested by purchasing an add-on (e.g., macOS access for $39/month), verifying access is granted, and confirming the recurring charge appears on invoices.

Acceptance Scenarios:

  1. Given a PAYG user, When they purchase macOS M4 runner access for $39/month, Then they can create and run jobs on macOS runners
  2. Given a user without macOS access, When they attempt to use macOS runners, Then they are prompted to purchase the add-on
  3. Given a user purchases priority support for $250/month, When they submit a support request, Then it is routed to the priority queue with private chat access
  4. Given a user purchases priority queue boost for $49/month per workspace, When their jobs are queued, Then they receive higher priority in job scheduling
  5. Given a user with add-ons, When they view their billing, Then add-on charges are itemized separately from usage charges

User Story 5 - Additional Concurrent Job Slot Purchase (Priority: P3)

A user exceeding the 40 included concurrent job slots purchases additional slots at $7/slot/month for x64 runners or $49/slot/month for macOS M4 runners.

Why this priority: This supports teams with high parallelism needs and provides incremental revenue, but is less critical than core pricing and add-ons.

Independent Test: Can be tested by running more than 40 concurrent jobs, purchasing additional slots, and verifying jobs execute in parallel up to the new limit.

Acceptance Scenarios:

  1. Given a user with 40 included concurrent slots, When they attempt to run 50 concurrent jobs, Then 10 jobs are queued until slots become available
  2. Given a user, When they purchase 10 additional x64 concurrent slots, Then they are charged $70/month and can run up to 50 concurrent x64 jobs
  3. Given a user, When they purchase 5 additional macOS M4 concurrent slots, Then they are charged $245/month and can run up to 45 concurrent macOS jobs (assuming base 40 applies to all runner types)

User Story 6 - Storage Billing (Priority: P3)

A user stores build artifacts, caches, and other data on the platform and is billed $0.20 per GB per month for storage consumption.

Why this priority: Storage is a necessary cost component but secondary to compute billing in terms of implementation priority and revenue impact.

Independent Test: Can be tested by uploading data, tracking storage usage over time, and verifying charges match $0.20/GB/month prorated.

Acceptance Scenarios:

  1. Given a user stores 50 GB of data, When monthly billing is calculated, Then they are charged $10 for storage
  2. Given a user uploads 20 GB on day 15 of the month, When monthly billing is calculated, Then they are charged approximately $2 (prorated for half month)
  3. Given a user has 10 GB of transparent cache included, When they use 10 GB or less total storage, Then no storage charges apply [NEEDS CLARIFICATION: Is the 10GB transparent cache counted toward the $0.20/GB storage billing, or is it separate?]

User Story 7 - Enterprise Custom Pricing Inquiry (Priority: P3)

An enterprise customer with predictable, high-volume usage requests committed use discounts and white-glove onboarding through a sales inquiry process.

Why this priority: Enterprise deals provide revenue predictability and larger contracts, but represent a smaller percentage of total users and require sales team involvement.

Independent Test: Can be tested by submitting an enterprise inquiry form, receiving a response from sales, and negotiating custom pricing terms outside the automated PAYG system.

Acceptance Scenarios:

  1. Given a user interested in enterprise pricing, When they submit an inquiry, Then they are contacted by the sales team within [NEEDS CLARIFICATION: response SLA not specified - 24 hours? 48 hours?]
  2. Given an enterprise customer commits to a minimum usage level, When their contract is established, Then they receive discounted per-minute rates compared to PAYG
  3. Given an enterprise customer, When they onboard, Then they receive dedicated success engineer support for migration and setup

User Story 8 - Premium Runner Pricing (Priority: P3)

A user opts to use premium runners (indicated by β€œPremium Pricing” in the SKU table) and is charged an additional fee on top of the base runner cost.

Why this priority: Premium runners provide differentiated service levels but are an optional enhancement to the base offering.

Independent Test: Can be tested by selecting a premium runner option, running a job, and verifying the charge includes the base price plus the premium surcharge.

Acceptance Scenarios:

  1. Given a user runs a job on a premium 2c-4GB runner for 10 minutes, When billing is calculated, Then they are charged $0.045 (10 min Γ— ($0.003 + $0.0015)/min)
  2. Given a user runs a job on a premium 4c-8GB runner for 10 minutes, When billing is calculated, Then they are charged $0.090 (10 min Γ— ($0.006 + $0.003)/min)
  3. Given a user, When they select a runner, Then they can choose between standard and premium options with clear pricing displayed [NEEDS CLARIFICATION: What specific benefits do premium runners provide - faster provisioning, dedicated resources, better SLA?]

Edge Cases

  • What happens when a user’s payment method fails after exhausting free credits? Are jobs blocked immediately or is there a grace period?
  • How are partial minutes billed (e.g., a job that runs for 3.5 minutes)?
  • What happens if a user deletes stored data mid-month? Is storage billing prorated daily?
  • How are concurrent job limits enforced when a user has both x64 and macOS runners? Are the limits separate or combined?
  • What happens when a user downgrades or cancels add-ons mid-billing cycle? Do they receive prorated refunds or credits?
  • How are discounts (up to 50% offered by sales) applied to the billing system? Are they percentage discounts or fixed credits?
  • What happens when a user exhausts free credits in the middle of a running job? Is the job terminated or allowed to complete?
  • How is abuse detection handled for users who repeatedly create new accounts to exploit free credits?

Requirements (mandatory)

Functional Requirements

Free Credits System

  • FR-001: System MUST allocate 1,000 free minutes (normalized to 2 vCPU runners) to all new user accounts upon creation
  • FR-002: System MUST scale free minute consumption based on runner vCPU count (e.g., 4 vCPU uses 2Γ— minutes, 8 vCPU uses 4Γ— minutes)
  • FR-003: System MUST track free minute balance in real-time and display remaining balance to users
  • FR-004: System MUST allow users with free minutes to access all runner types and platform features without restrictions
  • FR-005: System MUST prevent job execution when free minutes are exhausted and payment information has not been provided

Payment Collection

  • FR-006: System MUST prompt users to enter credit card information when free minutes are exhausted
  • FR-007: System MUST validate and securely store payment information using industry-standard tokenization
  • FR-008: System MUST transition user accounts from free trial to PAYG billing status after payment information is collected
  • FR-009: System MUST block job execution for users who decline to provide payment information after exhausting free credits

Pay-As-You-Go Billing

  • FR-010: System MUST calculate per-minute charges for all runner types according to the defined pricing table (x64, macOS, premium)
  • FR-011: System MUST track actual usage time for each job execution down to the minute
  • FR-012: System MUST generate itemized invoices showing usage by runner type, duration, and cost
  • FR-013: System MUST charge payment methods on a monthly billing cycle for accumulated usage
  • FR-014: System MUST include 40 concurrent job slots in all PAYG accounts at no additional charge
  • FR-015: System MUST include 10 GB of transparent caching in all PAYG accounts at no additional charge

Add-On Management

  • FR-016: System MUST allow users to purchase macOS M4 runner access for $39/month per workspace
  • FR-017: System MUST allow users to purchase priority support for $250/month
  • FR-018: System MUST allow users to purchase priority queue boost for $49/month per workspace
  • FR-019: System MUST restrict access to add-on features until the corresponding add-on is purchased
  • FR-020: System MUST bill add-on charges as recurring monthly fees separate from usage charges
  • FR-021: System MUST allow users to enable, disable, or modify add-ons at any time
  • FR-022: System MUST grant macOS runner access only to users with the macOS add-on active
  • FR-023: System MUST route support requests to priority queue for users with priority support add-on
  • FR-024: System MUST prioritize job scheduling for workspaces with priority queue boost add-on

Concurrent Job Slot Management

  • FR-025: System MUST allow users to purchase additional concurrent job slots at $7/slot/month for x64 runners
  • FR-026: System MUST allow users to purchase additional concurrent job slots at $49/slot/month for macOS M4 runners
  • FR-027: System MUST enforce concurrent job limits based on base allocation plus purchased slots
  • FR-028: System MUST queue jobs that exceed concurrent slot limits until slots become available

Storage Billing

  • FR-029: System MUST track total storage consumption for each user account across artifacts, caches, and data
  • FR-030: System MUST bill storage at $0.20 per GB per month
  • FR-031: System MUST calculate storage billing based on average daily usage over the billing period
  • FR-032: System MUST display current storage usage and projected monthly costs to users

Runner Pricing

  • FR-033: System MUST support all x64 runner SKUs with specified pricing (2c-4GB through 64c-256GB)
  • FR-034: System MUST support macOS runner SKUs with specified pricing (6 vCPU, 12 vCPU)
  • FR-035: System MUST support premium pricing tier for eligible x64 runner SKUs with additional charges
  • FR-036: System MUST clearly display runner pricing to users when selecting runner types

Enterprise Tier

  • FR-037: System MUST provide a mechanism for users to request enterprise pricing and custom contracts
  • FR-038: System MUST support custom pricing configurations for enterprise accounts with committed use discounts
  • FR-039: System MUST allow sales team to configure account-specific discounts up to 50%
  • FR-040: System MUST support white-glove onboarding workflows for enterprise customers

Annual Prepayment Options (Internal)

  • FR-041: System MUST support 12-month prepayment for priority queue boost at $499 (15% discount)
  • FR-042: System MUST support 12-month prepayment for macOS M4 access at $399 (15% discount)
  • FR-043: System MUST apply prepaid add-ons to user accounts for 12-month duration

Abuse Prevention

  • FR-044: System MUST implement mechanisms to detect and prevent abuse patterns (repeated free credit exploitation, cryptocurrency mining, unauthorized Minecraft servers)
  • FR-045: System MUST require payment information as a verification gate to prevent abusive users from continuing operations

Key Entities

  • User Account: Represents an individual or organization using Tenki, with free credit balance, payment status, billing tier, and usage history
  • Free Credit Balance: The remaining free minutes available to a user, normalized to 2 vCPU baseline, consumed based on runner vCPU scaling
  • Payment Method: Tokenized credit card information associated with a user account for billing purposes
  • Add-On Subscription: A purchased add-on feature (macOS access, priority support, priority queue boost) with recurring billing
  • Concurrent Job Slot: Allocated capacity for running parallel jobs, includes base allocation plus purchased additional slots
  • Runner SKU: A specific runner configuration (vCPU, memory) with associated per-minute pricing
  • Usage Record: A log of job execution including runner type, duration, and calculated cost
  • Invoice: A monthly billing statement showing itemized usage charges, add-on fees, and total amount due
  • Enterprise Contract: A custom pricing agreement with committed use discounts and negotiated terms
  • Storage Allocation: The amount of data stored by a user, tracked for billing at $0.20/GB/month
  • Workspace: An organizational unit within a user account, relevant for workspace-specific add-ons (priority queue boost, macOS access)

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: 90% of new users successfully start their first job using free credits within 24 hours of signup
  • SC-002: Free credit system accurately scales minute consumption across all runner SKU types with 100% precision
  • SC-003: Payment conversion rate from free to paid users reaches at least 15% within 30 days of signup
  • SC-004: Billing calculations are accurate to the cent with zero disputes related to calculation errors in the first 90 days
  • SC-005: Users can view real-time usage and cost projections with data latency under 5 minutes
  • SC-006: Add-on purchases are reflected in user accounts and billing within 60 seconds of confirmation
  • SC-007: Abuse detection mechanisms block at least 95% of identified abusive patterns (mining, unauthorized servers) within 24 hours of detection
  • SC-008: Enterprise inquiry response time averages under 24 hours during business days
  • SC-009: Concurrent job limits are enforced in real-time with zero jobs exceeding purchased slot allocation
  • SC-010: Monthly revenue predictability improves by at least 30% through enterprise contracts and add-on subscriptions within 6 months of launch
  • SC-011: Customer support tickets related to billing and pricing decrease by 40% compared to the previous pricing model within 3 months
  • SC-012: Average revenue per user (ARPU) increases by at least 20% through add-on adoption within 6 months

Assumptions

  • Users understand vCPU-based scaling of free credits and can calculate their effective free minutes for different runner sizes
  • Industry-standard payment processing (Stripe or similar) is available and integrated for secure credit card handling
  • Enterprise sales team has capacity and process to handle custom pricing negotiations and white-glove onboarding
  • Abuse detection can leverage usage patterns, payment verification, and potentially behavioral analysis to identify bad actors
  • Storage billing is calculated daily and averaged over the monthly billing period for prorated charges
  • Partial minutes are rounded up to the next whole minute for billing purposes (industry standard for compute billing)
  • Payment method failures trigger automated retry logic and user notifications before blocking service
  • Annual prepayment options are available to sales team but not publicly advertised on the pricing page
  • The 10 GB transparent cache is included in base PAYG pricing and does not count toward the $0.20/GB storage billing
  • Concurrent job limits are enforced separately for x64 and macOS runners (not combined)
  • Add-ons can be canceled at any time but billing continues through the end of the current billing cycle (no prorated refunds)
  • Discounts applied by sales team are percentage-based and apply to all usage charges, not just specific SKUs
  • Jobs in progress when free credits are exhausted are allowed to complete before payment is required
  • Free credit abuse is mitigated by requiring unique email verification and detecting suspicious signup patterns.

Getting Started with Tenki Cloud Development

Last updated: 2025-06-12

This guide will help you set up your development environment and run Tenki Cloud locally.

Prerequisites

Required Software

  1. Nix
  2. Devenv
  3. Direnv
  4. 1Password
  5. 1Password CLI
    • Integrate with 1Password CLI
    • Verify you have access to luxor/Engineering by running:
      op account list
      op vault list --account=luxor
      
    • If luxor isn’t showing as luxor.1password.com, try other accounts from op account list:
      op vault list --account=<account_name>
      
    • If luxor account isn’t showing at all, contact administrator

Verify Prerequisites

nix --version
devenv version
direnv --version
op --version

Hardware Requirements

  • RAM: 16GB minimum (32GB recommended)
  • CPU: 4 cores minimum (8 cores recommended)
  • Disk: 50GB free space

Initial Setup

1. Clone the Repository

git clone https://github.com/luxorlabs/tenki.app.git
cd tenki.app

2. Pull Setup Keys

sh tools/scripts/setup.sh

If you run into an issue where it’s using the wrong account, try this:

op account list
# account can be `luxor` if url is `luxor.1password.com`,
# or `my` if url is `my.1password.com` and there's only one account
# or the 3rd column (USER ID) if you have multiple accounts if you're getting the same url for all accounts.
sh tools/scripts/setup.sh <account>

Example: sh tools/scripts/setup.sh luxor

If you run into permission issues, try chmod +x ./tools/scripts/*.sh

3. Enable Development Environment

direnv allow

This will:

  • Install all required tools (Go, Node.js, pnpm, etc.)
  • Set up environment variables
  • Configure Git hooks

4. Install Dependencies

# Install all npm dependencies
pnpm install

# Generate protobuf code
bufgen

# Run Go mod tidy
tidy

# Update /etc/hosts entries
sync-hosts

# Initialize database
tb-format

5. Hosts and Other Setup

NOTE: Before running this script, the host needs to have hostctl installed since it requires elevated execution. Verify with hostctl --version

Add tenki.lab hosts:

sync-hosts

Format TigerBeetle database:

tb-format

Managing Environment and Secrets

Secret Files

  • resources/secrets/*.sops.yaml - Encrypted secrets pushed/committed to git
  • resources/secrets/*.local.yaml - Decrypted secrets (gitignored)

Commands

  • env-sync - Decrypt secrets and create a copy on individual apps/backend folders
  • env-decrypt - Decrypt secrets, *.sops.yaml to *.local.yaml
  • env-encrypt - Encrypt secrets, *.local.yaml to *.sops.yaml

Pulling Latest Secrets

  1. git pull to get the latest secrets
  2. env-sync to create your own copy of the secrets

Updating Secrets

  1. Locate the secret you want to update in resources/secrets/*.local.yaml
  2. Run env-encrypt to encrypt the secret
  3. Commit the changes and push to GitHub

Overwriting Secrets (usually only needed once per setup)

  • For Next.js apps: .env.local will overwrite .env. Copy .env.sample to .env.local and update values
  • For backend: engine.local.yaml will overwrite engine.yaml. Copy engine.sample.yaml to engine.local.yaml and update values

Seeding and Migrations

NOTE: Before running these commands, the database must be up and running. Run dev up or devenv up

  • Run db up to migrate the database
  • Run psql -U postgres -d tenki -f ./tools/seed/20240915152331_seed.sql to seed the database with CSP & related data

Or run db deploy to run both.

  • Run db seed to seed the database with users, workspaces, projects, VMs. After this you can start the dev server & login with:
    • Email: odin@tenki.cloud
    • Password: tenki.app

Database Commands

db up          # Run migrations
db down        # Rollback last migration
db reset       # Reset database
db status      # Check migration status
db create add_users_table  # Create new migration
db deploy      # Run migrations and seed
db nuke        # Complete database reset
db seed        # Seed test data

For Redpanda, see internal docs to set it up.

NOTE: This should be automated in the future

Running the Application

Quick Start

# Start all services
dev up

# Access the application
open https://app.tenki.lab:4001

Development Domains

  • Frontend: https://app.tenki.lab:4001
  • Temporal UI: https://temporal.tenki.lab
  • Redpanda Console: https://redpanda.tenki.lab
  • Grafana: https://grafana.tenki.lab
  • API: https://api.tenki.lab

Individual Services

# Start specific service
dev up postgres
dev up temporal
dev up engine

# Other options
dev up --simple  # Minimal output
dev up -D        # Detached mode

# Service management
dev start [service]     # Start specific service
dev stop [service]      # Stop specific service
dev restart [service]   # Restart service
dev logs [service]      # View service logs
dev list               # List all services

# Examples
dev start  # (enter, then choose services, hit tab to select multiple)
dev start engine
dev logs -f postgres  # Follow logs

Development Workflow

Frontend Development

cd apps/app
pnpm dev

# Run type checking
pnpm type-check

# Run linting
pnpm lint
pnpm lint:fix

Backend Development

# Run Go services
cd backend
go run cmd/engine/main.go

# Run tests
gotest

# Generate mocks
gomocks

# Build binaries
make build-engine

Database Changes

# Create new migration
db create add_user_preferences

# Apply migrations
db up

# Rollback migration
db down

# Reset database
db reset

Resetting Existing/Flaky Local Environment

  • Close/stop all services
  • Run reset-local
  • In another terminal:
    • Run db deploy

Common Tasks

Adding shadcn/ui Components

# Add a new component
pnpm -F @shared/ui ui:add
# or
pnpm -F @shared/ui ui:add <component>

Then add the component to the exports in packages/ui/package.json:

"exports": {
  "./button": "./src/components/ui/button.tsx",
  "./alert-dialog": "./src/components/ui/alert-dialog.tsx"
}

Generating App Icons

pnpm -F app generate-icon

Then:

  • Copy public/images/favicon-196.png to:
    • src/app/favicon.png
    • src/app/icon.png
  • Copy all rel="apple-touch-startup-image" from src/asset-generator.html to src/app/layout.tsx

Adding a New API Endpoint

  1. Define proto in proto/tenki/cloud/
  2. Run bufgen to generate code
  3. Implement service in backend/internal/domain/
  4. Add tRPC router in apps/app/src/server/api/

Running Tests

# All tests
pnpm test
gotest

# Specific package
pnpm -F app test
cd backend && go test ./pkg/...

# Integration tests
gotest-integration

# With coverage
cd backend && go test -v -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Debugging

# Check service health
dev list

# View all logs
dev logs

# Restart a service
dev restart engine

# Database console (direct connection)
psql -h localhost -U postgres -d tenki

# Temporal CLI
temporal workflow list

Troubleshooting

Port Already in Use

# Find process using port
lsof -i :4001

# Kill process
kill -9 <PID>

Database Connection Issues

# Restart postgres
dev restart postgres

# Check logs
dev logs postgres

# Reset database
db nuke

Proto Generation Fails

# Clean and regenerate
rm -rf backend/pkg/proto
bufgen

Node Modules Issues

# Clean and reinstall
rm -rf node_modules apps/*/node_modules packages/*/node_modules
pnpm install

Next Steps

Editor Setup

VS Code

  1. Install recommended extensions:

    • Go
    • ESLint
    • Prettier
    • Proto3
  2. Use workspace settings:

    {
      "editor.formatOnSave": true,
      "go.lintTool": "golangci-lint"
    }
    

GoLand/WebStorm

  1. Enable Go modules
  2. Set up file watchers for:
    • gofmt
    • prettier
    • eslint

Runner Prerequisites

Setting up GitHub App

  • Create Github Organization, skip this if you already have one
  • Create a github app
    • run pnpm -F github-app run create or pnpm -F github-app run create -o <org>
    • if name already exist, change it and continue
    • once done it’ll redirect you to a success screen, close the tab,
  • In github-app-response.json take note of the slug, pem, webhook_secret, client_id and client_secret

Backend Development Guide

Last updated: 2025-06-12

Overview

The Tenki backend is built with Go and follows Domain-Driven Design principles. Services communicate via Connect RPC (gRPC-Web compatible) and use Temporal for workflow orchestration.

Project Structure

backend/
β”œβ”€β”€ cmd/                  # Application entry points
β”‚   β”œβ”€β”€ engine/           # Main backend service
β”‚   └── tenki-cli/        # CLI tool
β”œβ”€β”€ internal/             # Private application code
β”‚   β”œβ”€β”€ app/              # Application layer
β”‚   └── domain/           # Business domains
β”‚       β”œβ”€β”€ billing/      # Billing domain
β”‚       β”œβ”€β”€ compute/      # VM management
β”‚       β”œβ”€β”€ identity/     # Auth & users
β”‚       β”œβ”€β”€ runner/       # GitHub runners
β”‚       └── workspace/    # Multi-tenancy
β”œβ”€β”€ pkg/                  # Public packages
β”‚   β”œβ”€β”€ proto/            # Generated protobuf
β”œβ”€β”€ queries/              # SQL queries (sqlc)
└── schema/               # Database migrations

Development Workflow

Running the Backend

# Start dependencies
dev up postgres temporal kafka

# Run migrations
db deploy

# Start engine
cd backend
go run cmd/engine/main.go

# Or use the dev script
dev restart engine

Adding a New Feature

  1. Define the API

    // proto/tenki/cloud/workspace/v1/project.proto
    service ProjectService {
      rpc CreateProject(CreateProjectRequest) returns (CreateProjectResponse);
    }
    
  2. Generate code

    bufgen
    
  3. Implement domain logic

    // internal/domain/workspace/service/project.go
    func (s *Service) CreateProject(ctx context.Context, req *params.CreateProject) (*models.Project, error) {
        // Business logic here
    }
    
  4. Write SQL queries

    -- queries/workspace/project.sql
    -- name: CreateProject :one
    INSERT INTO projects (name, workspace_id)
    VALUES ($1, $2)
    RETURNING *;
    
  5. Generate SQL code

    cd backend && sqlc generate
    

Testing

Unit Tests

func TestService_CreateProject(t *testing.T) {
    tests := []struct {
        name    string
        input   *params.CreateProject
        want    *models.Project
        wantErr bool
    }{
        {
            name: "valid project",
            input: &params.CreateProject{
                Name:        "test-project",
                WorkspaceID: "ws-123",
            },
            want: &models.Project{
                Name: "test-project",
            },
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Test implementation
        })
    }
}

Integration Tests

//go:build integration

var _ = Describe("Project Service", func() {
    var (
        service *workspace.Service
        db      *sql.DB
    )

    BeforeEach(func() {
        db = setupTestDB()
        service = workspace.NewService(workspace.WithDB(db))
    })

    It("should create a project", func() {
        project, err := service.CreateProject(ctx, params)
        Expect(err).NotTo(HaveOccurred())
        Expect(project.Name).To(Equal("test"))
    })
})

Running Tests

# Unit tests only
gotest

# Integration tests
gotest-integration

# Specific package
cd backend && go test ./internal/domain/workspace/...

# With coverage
cd backend && go test -cover ./...

Database Operations

Migrations

# Create migration
echo "CREATE TABLE features (id uuid PRIMARY KEY);" > backend/schema/$(date +%Y%m%d%H%M%S)_add_features.sql

# Apply migrations
db up

# Rollback
db down

Query Development

  1. Write query in backend/queries/
  2. Run sqlc generate
  3. Use generated code in service
// Generated code usage
project, err := s.db.CreateProject(ctx, db.CreateProjectParams{
    Name:        req.Name,
    WorkspaceID: req.WorkspaceID,
})

Temporal Workflows

Workflow Definition

func RunnerProvisioningWorkflow(ctx workflow.Context, params RunnerParams) error {
    // Configure workflow
    ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: 10 * time.Minute,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    })

    // Execute activities
    var runner *models.Runner
    err := workflow.ExecuteActivity(ctx, CreateRunnerActivity, params).Get(ctx, &runner)
    if err != nil {
        return fmt.Errorf("create runner: %w", err)
    }

    return nil
}

Testing Workflows

func TestRunnerProvisioningWorkflow(t *testing.T) {
    suite := testsuite.WorkflowTestSuite{}
    env := suite.NewTestWorkflowEnvironment()

    // Mock activities
    env.OnActivity(CreateRunnerActivity, mock.Anything).Return(&models.Runner{ID: "123"}, nil)

    // Execute workflow
    env.ExecuteWorkflow(RunnerProvisioningWorkflow, params)

    require.True(t, env.IsWorkflowCompleted())
    require.NoError(t, env.GetWorkflowError())
}

API Patterns

Service Options

// Use functional options pattern
type Service struct {
    db       *db.Queries
    temporal client.Client
    logger   *slog.Logger
}

type Option func(*Service)

func WithDB(db *db.Queries) Option {
    return func(s *Service) {
        s.db = db
    }
}

func NewService(opts ...Option) *Service {
    s := &Service{
        logger: slog.Default(),
    }
    for _, opt := range opts {
        opt(s)
    }
    return s
}

Error Handling

// Define domain errors
var (
    ErrProjectNotFound = errors.New("project not found")
    ErrUnauthorized    = errors.New("unauthorized")
)

// Wrap errors with context
if err != nil {
    return fmt.Errorf("fetch project %s: %w", projectID, err)
}

// Check errors
if errors.Is(err, ErrProjectNotFound) {
    return connect.NewError(connect.CodeNotFound, err)
}

Debugging

Local Debugging

# Enable debug logging
export LOG_LEVEL=debug

# Run with delve
dlv debug cmd/engine/main.go

# Attach to running process
dlv attach $(pgrep engine)

Temporal UI

# View workflows
open https://temporal.tenki.lab

# List workflows via CLI
temporal workflow list --query 'WorkflowType="RunnerProvisioningWorkflow"'

# Describe workflow
temporal workflow describe -w <workflow-id>

Database Queries

# Connect to database
dev exec postgres psql -U postgres tenki

# Useful queries
SELECT * FROM runners WHERE created_at > NOW() - INTERVAL '1 hour';
SELECT COUNT(*) FROM workflow_runs GROUP BY status;

Performance Tips

  1. Use prepared statements - sqlc does this automatically
  2. Batch operations - Use CopyFrom for bulk inserts
  3. Connection pooling - Configure in engine.yaml
  4. Context cancellation - Always respect context.Done()
  5. Concurrent operations - Use errgroup for parallel work

Common Patterns

Repository Pattern

type RunnerRepository interface {
    Create(ctx context.Context, runner *Runner) error
    GetByID(ctx context.Context, id string) (*Runner, error)
    List(ctx context.Context, filter Filter) ([]*Runner, error)
}

Builder Pattern

query := NewQueryBuilder().
    Where("status", "active").
    OrderBy("created_at", "DESC").
    Limit(10).
    Build()

Middleware Pattern

func LoggingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        next.ServeHTTP(w, r)
        slog.Info("request", "method", r.Method, "path", r.URL.Path, "duration", time.Since(start))
    })
}

Resources

Frontend Development Guide

Last updated: 2025-06-12

Overview

The Tenki frontend is built with Next.js 15, React 19, and TypeScript. We use tRPC for type-safe API communication, Tailwind CSS for styling, and Radix UI for accessible components.

Tech Stack

  • Framework: Next.js 15 (App Router)
  • Language: TypeScript
  • Styling: Tailwind CSS + Radix UI
  • State: React Context + Zustand
  • API: tRPC
  • Forms: React Hook Form + Zod
  • Testing: Jest + React Testing Library

Project Structure

apps/app/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app/                # Next.js app router pages
β”‚   β”‚   β”œβ”€β”€ (dashboard)/   # Protected routes
β”‚   β”‚   β”œβ”€β”€ auth/          # Auth pages
β”‚   β”‚   └── api/           # API routes
β”‚   β”œβ”€β”€ components/        # Reusable components
β”‚   β”œβ”€β”€ hooks/            # Custom hooks
β”‚   β”œβ”€β”€ server/           # Server-side code
β”‚   β”‚   └── api/         # tRPC routers
β”‚   β”œβ”€β”€ trpc/            # tRPC client setup
β”‚   └── utils/           # Utilities
β”œβ”€β”€ public/              # Static assets
└── next.config.mjs      # Next.js config

Development Workflow

Running the Frontend

# Start all services (recommended)
pnpm dev

# Or just the frontend
pnpm -F app dev

# Access at
open https://app.tenki.lab:4001

Creating Components

// components/project-card.tsx
interface ProjectCardProps {
  project: Project;
  onSelect?: (project: Project) => void;
}

export function ProjectCard({ project, onSelect }: ProjectCardProps) {
  return (
    <Card onClick={() => onSelect?.(project)} className="cursor-pointer transition-shadow hover:shadow-lg">
      <CardHeader>
        <CardTitle>{project.name}</CardTitle>
      </CardHeader>
      <CardContent>
        <p className="text-muted-foreground">{project.description}</p>
      </CardContent>
    </Card>
  );
}

Using tRPC

// In a client component
"use client";

import { trpc } from "@/trpc/client";

export function ProjectList() {
  const { data: projects, isLoading } = trpc.project.list.useQuery();

  const createProject = trpc.project.create.useMutation({
    onSuccess: () => {
      // Invalidate and refetch
      utils.project.list.invalidate();
    },
  });

  if (isLoading) return <Skeleton />;

  return (
    <div>
      {projects?.map((project) => (
        <ProjectCard key={project.id} project={project} />
      ))}
    </div>
  );
}

Creating tRPC Routes

// server/api/routers/project.ts
export const projectRouter = createTRPCRouter({
  list: protectedProcedure.query(async ({ ctx }) => {
    return ctx.db.project.findMany({
      where: { workspaceId: ctx.session.workspaceId },
    });
  }),

  create: protectedProcedure
    .input(
      z.object({
        name: z.string().min(1),
        description: z.string().optional(),
      }),
    )
    .mutation(async ({ ctx, input }) => {
      return ctx.db.project.create({
        data: {
          ...input,
          workspaceId: ctx.session.workspaceId,
        },
      });
    }),
});

Styling Guidelines

Using Tailwind

// Use semantic color classes
<div className="bg-background text-foreground">
  <button className="bg-primary text-primary-foreground hover:bg-primary/90">
    Click me
  </button>
</div>

// Responsive design
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4">
  {/* Grid items */}
</div>

// Dark mode support (automatic)
<div className="bg-white dark:bg-gray-900">
  Content adapts to theme
</div>

Component Composition

// Use Radix UI primitives
import * as Dialog from "@radix-ui/react-dialog";

export function CreateProjectDialog() {
  return (
    <Dialog.Root>
      <Dialog.Trigger asChild>
        <Button>Create Project</Button>
      </Dialog.Trigger>
      <Dialog.Portal>
        <Dialog.Overlay className="fixed inset-0 bg-black/50" />
        <Dialog.Content className="bg-background fixed top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2 rounded-lg p-6">
          <Dialog.Title>Create Project</Dialog.Title>
          {/* Form content */}
        </Dialog.Content>
      </Dialog.Portal>
    </Dialog.Root>
  );
}

State Management

Local State

// For simple component state
const [isOpen, setIsOpen] = useState(false);

Context for Feature State

// contexts/project-context.tsx
const ProjectContext = createContext<ProjectContextType | null>(null);

export function ProjectProvider({ children }: { children: ReactNode }) {
  const [selectedProject, setSelectedProject] = useState<Project | null>(null);

  return <ProjectContext.Provider value={{ selectedProject, setSelectedProject }}>{children}</ProjectContext.Provider>;
}

export function useProject() {
  const context = useContext(ProjectContext);
  if (!context) throw new Error("useProject must be used within ProjectProvider");
  return context;
}

Global State with Zustand

// stores/user-preferences.ts
import { create } from "zustand";

interface PreferencesStore {
  theme: "light" | "dark" | "system";
  setTheme: (theme: PreferencesStore["theme"]) => void;
}

export const usePreferences = create<PreferencesStore>((set) => ({
  theme: "system",
  setTheme: (theme) => set({ theme }),
}));

Forms

With React Hook Form + Zod

const ProjectSchema = z.object({
  name: z.string().min(1, "Name is required"),
  description: z.string().optional(),
  isPublic: z.boolean().default(false),
});

type ProjectForm = z.infer<typeof ProjectSchema>;

export function CreateProjectForm() {
  const form = useForm<ProjectForm>({
    resolver: zodResolver(ProjectSchema),
    defaultValues: {
      name: "",
      isPublic: false,
    },
  });

  const onSubmit = async (data: ProjectForm) => {
    await createProject.mutateAsync(data);
  };

  return (
    <Form {...form}>
      <form onSubmit={form.handleSubmit(onSubmit)}>
        <FormField
          control={form.control}
          name="name"
          render={({ field }) => (
            <FormItem>
              <FormLabel>Project Name</FormLabel>
              <FormControl>
                <Input {...field} />
              </FormControl>
              <FormMessage />
            </FormItem>
          )}
        />
        <Button type="submit">Create</Button>
      </form>
    </Form>
  );
}

Testing

Component Tests

// __tests__/project-card.test.tsx
import { ProjectCard } from "@/components/project-card";
import { render, screen } from "@testing-library/react";
import userEvent from "@testing-library/user-event";

describe("ProjectCard", () => {
  it("displays project information", () => {
    const project = { id: "1", name: "Test Project", description: "Test" };
    render(<ProjectCard project={project} />);

    expect(screen.getByText("Test Project")).toBeInTheDocument();
    expect(screen.getByText("Test")).toBeInTheDocument();
  });

  it("calls onSelect when clicked", async () => {
    const onSelect = jest.fn();
    const project = { id: "1", name: "Test Project" };

    render(<ProjectCard project={project} onSelect={onSelect} />);
    await userEvent.click(screen.getByRole("article"));

    expect(onSelect).toHaveBeenCalledWith(project);
  });
});

Running Tests

# Run all tests
pnpm test

# Watch mode
pnpm test:watch

# With coverage
pnpm test:coverage

Performance

Image Optimization

import Image from "next/image";

<Image
  src="/logo.png"
  alt="Logo"
  width={200}
  height={50}
  priority // For above-the-fold images
/>;

Code Splitting

// Dynamic imports for heavy components
const HeavyChart = dynamic(() => import("@/components/heavy-chart"), {
  loading: () => <Skeleton className="h-96" />,
  ssr: false, // Disable SSR for client-only components
});

Data Fetching

// Server component (default in app router)
async function ProjectPage({ params }: { params: { id: string } }) {
  const project = await api.project.get({ id: params.id });
  return <ProjectDetails project={project} />;
}

// Parallel data fetching
async function DashboardPage() {
  const [projects, stats] = await Promise.all([api.project.list(), api.stats.get()]);

  return (
    <>
      <StatsCard stats={stats} />
      <ProjectList projects={projects} />
    </>
  );
}

Common Patterns

Error Boundaries

export function ProjectErrorBoundary({ children }: { children: ReactNode }) {
  return (
    <ErrorBoundary
      fallback={
        <Alert variant="destructive">
          <AlertTitle>Something went wrong</AlertTitle>
          <AlertDescription>Unable to load projects. Please try again.</AlertDescription>
        </Alert>
      }
    >
      {children}
    </ErrorBoundary>
  );
}

Loading States

export function ProjectListSkeleton() {
  return (
    <div className="space-y-4">
      {Array.from({ length: 3 }).map((_, i) => (
        <Skeleton key={i} className="h-24" />
      ))}
    </div>
  );
}

Accessibility

// Always include ARIA labels
<button
  aria-label="Delete project"
  onClick={handleDelete}
>
  <TrashIcon />
</button>

// Keyboard navigation
<div
  role="button"
  tabIndex={0}
  onKeyDown={(e) => {
    if (e.key === 'Enter' || e.key === ' ') {
      handleClick();
    }
  }}
>
  Interactive element
</div>

Debugging

React DevTools

  1. Install React Developer Tools extension
  2. Use Components tab to inspect props/state
  3. Use Profiler tab for performance analysis

tRPC DevTools

// Automatically included in development
// View network tab for tRPC requests
// Check request/response payloads

Common Issues

Hydration Errors

// Ensure client/server render match
{
  typeof window !== "undefined" && <ClientOnlyComponent />;
}

State Not Updating

// Use callbacks for state depending on previous
setItems((prev) => [...prev, newItem]);

Resources

Frontend Testing

Unit Tests

  • Unit tests are colocated with the code they test. Good examples of test files can be found in the /apps/app/src/utils/__tests__ directory, which contains unit tests for the files in the /apps/app/src/utils/ folder.
  • Additional examples of unit test files exist throughout the frontend codebase. They have a .test.{ts,tsx} extension and are sometimes located in __tests__ directories.

Unit Test Approach

  • Implemented using vitest
  • Add as many unit tests as possible, especially for pure functions and complex business logic that can be tested independently without relying on extensive mocking and external dependencies.
  • Prioritize testing different properties and scenarios to catch hard-to-miss edge cases instead of only following the happy path with a few examples.

Running Unit Tests

  • Run pnpm test:unit to run all unit tests
  • Run pnpm test:unit:coverage to run all unit tests and get a coverage report in the terminal

Frontend Test Cases Guide

THIS DOCUMENT IS STILL A WIP…

The test-cases directory inside apps/app contains structured test specifications that define the expected behavior of our application. These specifications serve as a bridge between product requirements and automated tests.

Overview

The test cases are defined in JSON files and follow a strict schema (defined in schema.json). Each test case is identified by a unique ID and contains detailed information about what needs to be tested.

File Structure

  • schema.json - Defines the structure and validation rules for test case specifications
  • onboarding.spec.json - Test specifications for user onboarding flows
  • Additional .spec.json files for other features

Test Case Schema

Each test case follows this structure:

{
  "TEST-001": {
    "title": "Test case title",
    "priority": "P0",
    "preconditions": ["List of conditions that must be met"],
    "steps": ["Step-by-step test instructions"],
    "acceptance_criteria": ["List of criteria that must be met"]
  }
}

Fields Explained

  • title: A descriptive name for the test case
  • priority: Importance level (P0-P3)
    • P0: Critical path, must not break and must be covered by automated tests
    • P1: Core functionality
    • P2: Important but not critical
    • P3: Nice to have
  • preconditions: Required setup or state before running the test
  • steps: Detailed test steps
  • acceptance_criteria: What must be true for the test to pass

Priority Levels

  • P0: Critical business flows (e.g., user registration, login)
  • P1: Core features that significantly impact user experience
  • P2: Secondary features that enhance user experience
  • P3: Edge cases and nice-to-have features

Updating Test Cases

  1. Adding a New Test Case:

    • Choose an appropriate spec file (or create a new one for new features)
    • Add a new entry with a unique ID (format: XXX-###)
    • Fill in all required fields according to the schema
    • Validate against schema.json
  2. Modifying Existing Test Cases:

    • Update the relevant fields
    • Ensure changes are reflected in the corresponding automated tests
    • Keep the test ID unchanged
  3. Best Practices:

    • Keep steps clear and actionable
    • Write acceptance criteria that can be automated
    • Include edge cases and error scenarios
    • Document dependencies between test cases

Integration with Automated Tests

The test specifications in this directory serve as a source of truth for our automated tests. The relationship works as follows:

  1. Test specs define WHAT needs to be tested
  2. Automated tests implement HOW to test it
  3. Automated tests written for a test case should reference its corresponding test case ID

Example:

describe("ONB-001: User Registration - with email", () => {
  it("should complete registration flow successfully", async () => {
    // Test implementation
  });
});

Maintaining Test Coverage

  1. Every new feature should have corresponding test cases
  2. Test cases should be reviewed along with code changes
  3. Regular audits ensure test coverage matches specifications
  4. Update or deprecate test cases when features change

Database Guide

Overview

This guide covers database development practices for Tenki Cloud, including schema management, migrations, and query patterns.

Database Stack

  • PostgreSQL: Primary database
  • sqlc: Type-safe SQL query generation
  • golang-migrate: Database migration management

Schema Management

Migrations

All database schema changes must be made through migrations:

# Create a new migration
make migration name=add_user_settings

# Run migrations
make migrate-up

# Rollback last migration
make migrate-down

Best Practices

  1. Always include both up and down migrations
  2. Keep migrations small and focused
  3. Test rollbacks before merging
  4. Never modify existing migrations

Query Development

We use sqlc for type-safe database queries:

Writing Queries

  1. Add queries to pkg/db/queries/*.sql
  2. Use named parameters: @param_name
  3. Follow naming conventions:
    • GetUserByID for single row
    • ListUsersByOrg for multiple rows
    • CreateUser for inserts
    • UpdateUser for updates
    • DeleteUser for deletes

Generating Code

# Generate Go code from SQL
make sqlc

Performance

Indexing

  • Add indexes for frequently queried columns
  • Use composite indexes for multi-column queries
  • Monitor slow query logs

Query Optimization

  • Use EXPLAIN ANALYZE for query planning
  • Avoid N+1 queries
  • Batch operations when possible
  • Use database views for complex queries

Testing

Unit Tests

  • Mock database interfaces
  • Test query logic separately from business logic

Integration Tests

  • Use test database containers
  • Clean up test data after each test
  • Test migration up/down paths

Testing Guide

This guide covers the testing strategies and patterns used in Tenki Cloud, with a focus on writing effective tests for backend services, particularly those using Temporal workflows.

Overview

Tenki Cloud uses a comprehensive testing approach that includes:

  • Unit Tests: Fast, isolated tests using mocks to verify business logic
  • Integration Tests: End-to-end tests running in a real environment
  • Table-Driven Tests: Systematic approach for testing multiple scenarios
  • BDD-Style Tests: Behavior-driven tests using Ginkgo/Gomega

Testing Stack

Core Libraries

// Unit Testing
import (
    "testing"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    "github.com/stretchr/testify/mock"
)

// Integration Testing
import (
    "github.com/onsi/ginkgo/v2"
    "github.com/onsi/gomega"
)

// Temporal Testing
import (
    "go.temporal.io/sdk/testsuite"
)

Project Structure

internal/domain/{domain}/
β”œβ”€β”€ service/           # Business logic
β”œβ”€β”€ db/               # Database queries (sqlc generated)
β”œβ”€β”€ interface.go      # Service interfaces
β”œβ”€β”€ mock_*.go         # Generated mocks
└── worker/           # Temporal workers
    β”œβ”€β”€ activities/   # Temporal activities
    β”‚   β”œβ”€β”€ *.go     # Activity implementations
    β”‚   └── *_test.go # Activity unit tests
    β”œβ”€β”€ workflows/    # Temporal workflows
    β”‚   β”œβ”€β”€ *.go     # Workflow implementations
    β”‚   └── *_test.go # Workflow unit tests
    └── integration_*.go # Integration tests

Unit Testing

Activity Testing

Activities should be tested with mocked dependencies to ensure business logic correctness.

Basic Pattern

func TestActivities_GetRunnerInstallation(t *testing.T) {
    t.Parallel()

    tests := []struct {
        name           string
        installationId int64
        mockResponse   *connect.Response[runnerproto.GetRunnerInstallationResponse]
        mockError      error
        expectedResult *runnerproto.RunnerInstallation
        expectErr      bool
    }{
        {
            name:           "success",
            installationId: 1234,
            mockResponse: connect.NewResponse(&runnerproto.GetRunnerInstallationResponse{
                RunnerInstallation: &runnerproto.RunnerInstallation{
                    Id: "abc123",
                },
            }),
            expectedResult: &runnerproto.RunnerInstallation{Id: "abc123"},
        },
        {
            name:           "service error",
            installationId: 1234,
            mockError:      connect.NewError(connect.CodeInternal, nil),
            expectErr:      true,
        },
    }

    for _, tc := range tests {
        t.Run(tc.name, func(t *testing.T) {
            // Setup mock
            svc := &runner.MockService{}
            svc.On("GetRunnerInstallation", mock.Anything, mock.Anything).
                Return(tc.mockResponse, tc.mockError)

            // Create activities with mock
            a := newTestActivities(svc, t)

            // Execute
            result, err := a.GetRunnerInstallation(context.Background(), tc.installationId)

            // Assert
            if tc.expectErr {
                assert.Error(t, err)
                assert.Nil(t, result)
            } else {
                assert.NoError(t, err)
                assert.Equal(t, tc.expectedResult, result)
            }
        })
    }
}

Testing with Complex Arguments

// Use MatchedBy for complex argument validation
svc.On("UpdateRunners", mock.Anything,
    mock.MatchedBy(func(req *connect.Request[runnerproto.UpdateRunnersRequest]) bool {
        return assert.ElementsMatch(t, req.Msg.Ids, expectedIds) &&
               assert.Equal(t, req.Msg.State, expectedState)
    })).Return(nil, nil)

Test Helper Functions

Create reusable test helpers to reduce boilerplate:

func newTestActivities(svc runner.Service, t *testing.T) *activities {
    logger := log.NewTestLogger(t)
    sr := trace.NewSpanRecorder()
    tracer, _ := trace.NewTestTracer(sr)

    return &activities{
        logger:  logger,
        svc:     svc,
        tracer:  tracer,
    }
}

Workflow Testing

Workflows require mocking activities since they orchestrate multiple operations.

Basic Workflow Test

func TestGithubJobWorkflow(t *testing.T) {
    var ts testsuite.WorkflowTestSuite

    t.Run("happy path", func(t *testing.T) {
        env := ts.NewTestWorkflowEnvironment()

        // Register activities with stubs
        env.RegisterActivityWithOptions(stubFunc,
            temporal.RegisterOptions{Name: runner.GithubJobWorkflowActivity})

        // Mock activity responses
        env.OnActivity(runner.GithubJobWorkflowActivity, mock.Anything, mock.Anything).
            Return(nil, nil)

        // Execute workflow
        event := github.WorkflowJobEvent{
            Action: github.String("completed"),
            Installation: &github.Installation{ID: github.Int64(123)},
        }
        env.ExecuteWorkflow((&workflows{}).GithubJobWorkflow, event)

        // Assert completion
        require.True(t, env.IsWorkflowCompleted())
        require.NoError(t, env.GetWorkflowError())
    })
}

Testing Retry Logic

t.Run("retry on transient error", func(t *testing.T) {
    env := ts.NewTestWorkflowEnvironment()

    callCount := 0
    env.OnActivity(runner.SomeActivity, mock.Anything, mock.Anything).
        Return(func(context.Context, interface{}) error {
            callCount++
            if callCount < 3 {
                return errors.New("transient error")
            }
            return nil
        })

    env.ExecuteWorkflow(workflow, input)

    require.True(t, env.IsWorkflowCompleted())
    require.NoError(t, env.GetWorkflowError())
    assert.Equal(t, 3, callCount)
})

Integration Testing

Integration tests verify the entire system working together with real dependencies.

Setup with Ginkgo

Test Suite Entry Point

//go:build integration

func TestIntegration(t *testing.T) {
    RegisterFailHandler(Fail)
    RunSpecs(t, "Runner Worker Integration Tests")
}

Suite Configuration

var _ = BeforeSuite(func() {
    // Start Temporal dev server
    cmd := exec.Command("temporal", "server", "start-dev",
        "--port", "7233",
        "--ui-port", "8233",
        "--db-filename", filepath.Join(tempDir, "temporal.db"))

    // Initialize global dependencies
    initializeDatabase()
    initializeTracing()
})

var _ = AfterSuite(func() {
    // Clean up
    stopTemporalServer()
    closeDatabase()
})

var _ = BeforeEach(func() {
    // Start transaction for test isolation
    tx = db.BeginTx()

    // Create service instances
    runnerService = createRunnerService(tx)

    // Start worker
    worker = temporal.NewWorker(client, taskQueue, temporal.WorkerOptions{})
    temporal.RegisterWorkflows(worker)
    temporal.RegisterActivities(worker, activities)
    worker.Start()
})

var _ = AfterEach(func() {
    // Rollback transaction
    tx.Rollback()

    // Stop worker
    worker.Stop()
})

Writing Integration Tests

var _ = Describe("Runner Installation", func() {
    Context("when installing runners", func() {
        It("should install runner successfully", func() {
            // Start workflow
            workflowId := fmt.Sprintf("test-install-%s", uuid.New())
            run, err := temporalClient.ExecuteWorkflow(
                context.Background(),
                client.StartWorkflowOptions{
                    ID:        workflowId,
                    TaskQueue: runner.TaskQueue,
                },
                runner.RunnerInstallWorkflow,
                installationId,
            )
            Expect(err).ToNot(HaveOccurred())

            // Trigger installation via service
            _, err = runnerService.InstallRunners(ctx, connect.NewRequest(
                &runnerproto.InstallRunnersRequest{
                    InstallationId: installationId,
                    WorkspaceId:    workspaceId,
                },
            ))
            Expect(err).ToNot(HaveOccurred())

            // Send signal to workflow
            err = temporalClient.SignalWorkflow(
                context.Background(),
                workflowId,
                "",
                runner.InstallSignal,
                runner.InstallSignalPayload{},
            )
            Expect(err).ToNot(HaveOccurred())

            // Wait for expected state
            Eventually(func() string {
                ins, err := runnerService.GetRunnerInstallation(ctx, req)
                if err != nil || ins == nil {
                    return ""
                }
                return ins.Msg.RunnerInstallation.State
            }, 30*time.Second, 1*time.Second).Should(Equal("active"))

            // Verify final state
            var result runner.RunnerInstallWorkflowResult
            err = run.Get(context.Background(), &result)
            Expect(err).ToNot(HaveOccurred())
            Expect(result.Success).To(BeTrue())
        })
    })
})

Testing Patterns & Best Practices

1. Table-Driven Tests

Use table-driven tests to cover multiple scenarios systematically:

tests := []struct {
    name      string
    input     string
    want      string
    wantErr   bool
    errMsg    string
}{
    {
        name:  "valid input",
        input: "test",
        want:  "TEST",
    },
    {
        name:    "empty input",
        input:   "",
        wantErr: true,
        errMsg:  "input cannot be empty",
    },
}

2. Mock Best Practices

  • Mock at interface boundaries
  • Use mock.MatchedBy for complex argument matching
  • Verify mock expectations when needed:
defer svc.AssertExpectations(t)

3. Test Isolation

  • Each test should be independent
  • Use database transactions with rollback
  • Clean up created resources
  • Reset global state between tests

4. Async Testing

Use Eventually for testing async operations:

Eventually(func() bool {
    // Check condition
    return conditionMet
}, timeout, interval).Should(BeTrue())

5. Error Testing

Always test both success and failure paths:

{
    name:      "network error",
    mockError: errors.New("connection refused"),
    expectErr: true,
},
{
    name:      "timeout error",
    mockError: context.DeadlineExceeded,
    expectErr: true,
},

6. Test Naming

Use descriptive test names that explain the scenario:

t.Run("returns error when installation not found", func(t *testing.T) {
    // test
})

7. Tracing in Tests

Verify tracing behavior when applicable:

sr := trace.NewSpanRecorder()
tracer, _ := trace.NewTestTracer(sr)

// After execution
spans := sr.Ended()
assert.Len(t, spans, 1)
assert.Equal(t, "OperationName", spans[0].Name())
assert.Equal(t, codes.Ok, spans[0].Status().Code)

Common Testing Scenarios

Testing Database Operations

func TestDatabaseOperation(t *testing.T) {
    // Use test database
    db := setupTestDatabase(t)
    defer cleanupDatabase(db)

    // Create queries
    queries := runnerdb.New(db)

    // Test operation
    err := queries.CreateRunner(context.Background(), params)
    require.NoError(t, err)

    // Verify
    runner, err := queries.GetRunner(context.Background(), id)
    require.NoError(t, err)
    assert.Equal(t, expectedName, runner.Name)
}

Testing Kubernetes Operations

func TestKubernetesOperation(t *testing.T) {
    // Create fake client
    objects := []runtime.Object{
        &corev1.Namespace{
            ObjectMeta: metav1.ObjectMeta{Name: "test"},
        },
    }
    k8sClient := fake.NewSimpleClientset(objects...)

    // Test operation
    err := createDeployment(k8sClient, namespace, deployment)
    require.NoError(t, err)

    // Verify
    deploy, err := k8sClient.AppsV1().Deployments(namespace).Get(
        context.Background(), name, metav1.GetOptions{})
    require.NoError(t, err)
    assert.Equal(t, expectedReplicas, *deploy.Spec.Replicas)
}

Testing External API Calls

func TestExternalAPI(t *testing.T) {
    // Create mock HTTP server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        assert.Equal(t, "/api/v1/resource", r.URL.Path)
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(expectedResponse)
    }))
    defer server.Close()

    // Test with mock server URL
    client := NewAPIClient(server.URL)
    result, err := client.GetResource(context.Background(), "id")
    require.NoError(t, err)
    assert.Equal(t, expectedResponse, result)
}

Running Tests

Unit Tests

# Run all unit tests
gotest

# Run specific package tests
cd backend && go test ./internal/domain/runner/...

# Run with coverage
cd backend && go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

# Run specific test
cd backend && go test -run TestActivities_GetRunnerInstallation ./...

Integration Tests

# Ensure services are running
dev up

# Run all integration tests
gotest-integration

# Run specific integration test suite
cd backend && ginkgo -v ./internal/domain/runner/worker/

Continuous Integration

Tests should be part of your CI pipeline:

test:
  script:
    - gotest
    - gotest-integration
  coverage: '/coverage: \d+\.\d+%/'

Debugging Tests

Verbose Output

go test -v ./...

Focus on Specific Tests (Ginkgo)

FIt("should focus on this test", func() {
    // This test will run exclusively
})

Debug Logging

logger := log.NewTestLogger(t)
logger.Debug("test state", "value", someValue)

Test Timeouts

func TestLongRunning(t *testing.T) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    defer cancel()

    // Use ctx for operations
}

Summary

Effective testing in Tenki Cloud requires:

  • Clear separation between unit and integration tests
  • Proper use of mocks for isolation
  • Table-driven tests for comprehensive coverage
  • Integration tests for end-to-end validation
  • Consistent patterns across the codebase

Follow these patterns to ensure your code is well-tested, maintainable, and reliable.

Release System

Tenki Cloud uses a custom release system designed for polyglot monorepos, handling both TypeScript/Node.js applications and Go binaries seamlessly.

Overview

The release system automates version management, changelog generation, and artifact building across all components in the monorepo. It provides a developer-friendly workflow similar to Changesets but with full support for Go modules and Docker deployments.

Key Features

  • Polyglot Support: Handles both Node.js packages and Go binaries
  • Shared Go Versioning: All Go binaries use coordinated versions
  • Deployment Awareness: Different strategies for Docker vs binary deployment
  • Automatic PR Management: Creates and updates Release PRs
  • GitHub Integration: Native releases with artifact uploads
  • Developer-Friendly CLI: Interactive changelog creation

Quick Start

Creating a Release

  1. Create a changelog:

    changelog add          # Interactive with fzf (if available)
    changelog add --empty  # Empty changelog for internal changes
    
  2. Commit and push:

    git add .releases/your-changelog.md
    git commit -m "feat: add new feature"
    git push origin main
    
  3. Review the Release PR that gets created automatically

  4. Merge the Release PR to trigger the release

Checking Status

changelog status

Components

The release system manages these components:

Frontend Applications

  • @tenki/app β†’ Docker image: app:vX.Y.Z
  • @tenki/sentinel β†’ Docker image: sentinel:vX.Y.Z

Go Services (Docker)

  • @tenki/engine β†’ Docker image: engine:vX.Y.Z
  • @tenki/github-proxy β†’ Docker image: github-proxy:vX.Y.Z

Go Binaries (Direct Deployment)

  • @tenki/cli β†’ Binary releases: tenki-cli-vX.Y.Z-{os}-{arch}
  • @tenki/node-agent β†’ Binary releases: node-agent-vX.Y.Z-{os}-{arch}
  • @tenki/vm-agent β†’ Binary releases: vm-agent-vX.Y.Z-{os}-{arch}

Changelog Format

Changelog files use YAML frontmatter to specify affected packages and version bump types:

---
"@tenki/app": minor
"@tenki/engine": patch
"@tenki/cli": major
---

Add new authentication features

- **Frontend**: Added MFA support with TOTP
- **Engine**: Fixed token refresh race condition
- **CLI**: Breaking change: new login command structure

This release improves security and fixes several authentication issues.

Version Bump Types

  • patch (0.0.X): Bug fixes, small improvements
  • minor (0.X.0): New features, backwards compatible
  • major (X.0.0): Breaking changes

Go Binary Versioning

All Go binaries share the same version from backend/go.mod. When any Go binary is updated, all Go binaries receive the same version bump using the highest bump type among them.

Example: If @tenki/cli needs a patch and @tenki/engine needs a minor, all Go binaries get a minor bump.

Workflow

1. Changelog Detection

  • Trigger: .releases/*.md files pushed to main
  • Action: Parses changelog files, determines version bumps
  • Result: Creates or updates Release PR

2. Release PR

  • Contains: Version bumps for all affected components
  • Updates: Individual CHANGELOG.md files
  • Shows: Artifacts that will be built
  • Cleanup: Deletes temporary changelog files

3. Release Automation

  • Trigger: Release PR merged to main
  • Actions:
    • Creates Git tags
    • Builds Docker images
    • Builds cross-platform binaries
    • Creates GitHub release with artifacts

CLI Commands

Interactive Changelog Creation

changelog add

Guides you through:

  1. Selecting affected packages
  2. Choosing version bump types
  3. Writing changelog content

Status Check

changelog status

Shows:

  • Pending changelog files
  • Existing Release PRs
  • Current component versions

File Structure

.releases/
β”œβ”€β”€ config.json              # Release configuration
└── *.md                     # Temporary changelog files

# Individual changelogs
apps/app/CHANGELOG.md
apps/sentinel/CHANGELOG.md
backend/cmd/engine/CHANGELOG.md
backend/cmd/tenki-cli/CHANGELOG.md
backend/cmd/node-agent/CHANGELOG.md
backend/cmd/github-proxy/CHANGELOG.md
backend/cmd/vm-agent/CHANGELOG.md

GitHub Actions

Changelog Detection

File: .github/workflows/changelog-detection.yml

  • Trigger: Push to main with .releases/*.md changes
  • Action: Processes changelogs and creates Release PR

Release Automation

File: .github/workflows/release.yml

  • Trigger: Release PR merged to main
  • Actions: Creates tags, builds artifacts, publishes releases

Configuration

The system is configured in .releases/config.json:

{
  "packages": {
    "@tenki/app": {
      "path": "apps/app",
      "type": "node",
      "changelog": "apps/app/CHANGELOG.md",
      "version_file": "apps/app/package.json",
      "deployment": "docker"
    },
    "@tenki/engine": {
      "path": "backend/cmd/engine",
      "type": "go-binary",
      "changelog": "backend/cmd/engine/CHANGELOG.md",
      "version_file": "backend/cmd/engine/VERSION",
      "binary_name": "engine",
      "deployment": "docker",
      "docker": {
        "component": "engine",
        "dockerfile": "backend/cmd/engine/Dockerfile",
        "context": "backend"
      }
    }
  },
  "release_branch": "release/next",
  "release_pr_title": "chore(release): version packages [skip ci]",
  "commit_message": "chore(release): version packages [skip ci]"
}

Best Practices

Changelog Writing

  • One changelog per logical change - Don’t combine unrelated features
  • Clear descriptions - Explain what changed and why
  • User-focused content - Write for end users, not developers
  • Appropriate bump types - Follow semantic versioning strictly

Release Management

  • Review Release PRs carefully - Verify versions and changelog entries
  • Test before merging - Ensure all CI checks pass
  • Coordinate deployments - Plan releases during appropriate windows
  • Monitor releases - Watch for issues after deployment

Package Dependencies

  • Shared package changes go in consuming app changelogs
  • No separate changelogs for packages/* directories
  • Document impact where users will see the changes

Troubleshooting

Release PR Not Created

  1. Check changelog format - Ensure YAML frontmatter is correct
  2. Verify file location - Files must be in .releases/ directory
  3. Check GitHub Actions - Review workflow logs for errors

Build Failures

  1. Run tests locally - Ensure all tests pass before merging
  2. Check Docker configs - Verify Dockerfile and build contexts
  3. Validate Go modules - Ensure go.mod is properly formatted

Version Conflicts

  1. Understand versioning - Go binaries share versions, Node.js apps are independent
  2. Check existing versions - Use changelog status to see current state
  3. Review Release PR - Verify calculated versions are correct

Examples

Simple Bug Fix

---
"@tenki/app": patch
---

Fix authentication token refresh issue

- Fixed race condition in token refresh logic
- Improved error handling for expired tokens

New Feature with Breaking Change

---
"@tenki/app": minor
"@tenki/cli": major
"@tenki/engine": minor
---

Add workspace management features

- **App**: New workspace dashboard with team management
- **CLI**: Breaking change: `tenki workspace` command restructured
- **Engine**: Added workspace isolation and resource quotas

Multi-Component Update

---
"@tenki/app": minor
"@tenki/engine": minor
"@tenki/node-agent": patch
---

Improve runner monitoring and management

- **App**: Added real-time runner status dashboard
- **Engine**: Implemented auto-scaling for custom images
- **Node Agent**: Fixed memory leak in status reporting

Release Quick Reference

Quick reference for the Tenki Cloud release system.

Commands

# Create new changelog (interactive with fzf if available)
changelog add

# Create empty changelog for internal changes
changelog add --empty

# Check status
changelog status

# Show help
changelog help

Changelog Format

---
"@tenki/app": minor
"@tenki/engine": patch
---

Brief description of changes

- Detailed change 1
- Detailed change 2

Components

ComponentTypeDeploymentOutput
@tenki/appNode.jsDockerapp:vX.Y.Z
@tenki/sentinelNode.jsDockersentinel:vX.Y.Z
@tenki/engineGoDockerengine:vX.Y.Z
@tenki/github-proxyGoDockergithub-proxy:vX.Y.Z
@tenki/cliGoBinarytenki-cli-vX.Y.Z-{os}-{arch}
@tenki/node-agentGoBinarynode-agent-vX.Y.Z-{os}-{arch}
@tenki/vm-agentGoBinaryvm-agent-vX.Y.Z-{os}-{arch}

Version Bump Types

TypeVersion ChangeUse Case
patch1.0.0 β†’ 1.0.1Bug fixes, small improvements
minor1.0.0 β†’ 1.1.0New features, backwards compatible
major1.0.0 β†’ 2.0.0Breaking changes

Workflow

  1. Create changelog β†’ changelog add
  2. Commit & push β†’ git add .releases/*.md && git commit && git push
  3. Review Release PR β†’ Automatically created
  4. Merge Release PR β†’ Triggers release automation
  5. Artifacts built β†’ Docker images + binaries published

File Locations

.releases/
β”œβ”€β”€ config.json                           # Configuration
└── your-feature.md                       # Temporary changelog

apps/app/CHANGELOG.md                      # App changelog
apps/sentinel/CHANGELOG.md                # Sentinel changelog
backend/cmd/engine/CHANGELOG.md           # Engine changelog
backend/cmd/tenki-cli/CHANGELOG.md        # CLI changelog
backend/cmd/node-agent/CHANGELOG.md       # Node agent changelog
backend/cmd/github-proxy/CHANGELOG.md     # GitHub proxy changelog
backend/cmd/vm-agent/CHANGELOG.md         # VM agent changelog

Common Patterns

Bug Fix

---
"@tenki/app": patch
---

Fix login redirect issue

New Feature

---
"@tenki/app": minor
"@tenki/engine": minor
---

Add workspace management

Breaking Change

---
"@tenki/cli": major
---

Restructure CLI commands

Multi-Component

---
"@tenki/app": minor
"@tenki/engine": patch
"@tenki/cli": patch
---

Improve runner monitoring

Troubleshooting

IssueSolution
Release PR not createdCheck changelog format and GitHub Actions logs
Build failureEnsure tests pass and Docker configs are correct
Wrong version calculatedReview frontmatter and component dependencies
CLI not workingRun direnv reload to pick up new scripts

Go Binary Versioning

  • All Go binaries share the same version from backend/go.mod
  • Highest bump type among Go components is used for all
  • Example: cli: patch + engine: minor = all Go binaries get minor

Deployment Guide

Last updated: 2025-06-12

Overview

Tenki Cloud uses GitOps with Flux for Kubernetes deployments. All deployments are triggered via Git commits and automatically reconciled by Flux.

Deployment Environments

EnvironmentDomainBranchCluster
Development*.tenki.labfeature/*Local
Staging*.staging.tenki.cloudstagingtenki-staging
Production*.tenki.cloudmaintenki-prod

Deployment Process

1. Local Development β†’ Staging

# 1. Ensure tests pass
pnpm test
gotest

# 2. Build and push images
make docker-build
make docker-push TAG=staging-$(git rev-parse --short HEAD)

# 3. Update staging manifests
cd infra/flux/apps/staging
vim engine-deployment.yaml  # Update image tag
git add .
git commit -m "deploy: engine staging-abc123"
git push

# 4. Monitor deployment
kubectl --context=staging get pods -w
flux logs -f

2. Staging β†’ Production

# 1. Create release PR
gh pr create --base main --title "Release v1.2.3"

# 2. After approval and merge, tag release
git checkout main
git pull
git tag -a v1.2.3 -m "Release v1.2.3"
git push origin v1.2.3

# 3. CI/CD builds and pushes production images

# 4. Update production manifests
cd infra/flux/apps/production
# Update image tags to v1.2.3
git commit -m "deploy: production v1.2.3"
git push

# 5. Monitor rollout
kubectl --context=production rollout status deployment/engine

Service-Specific Deployments

Backend Engine

# Build
cd backend
make build-engine

# Test
make test

# Docker image
docker build -t tenki/engine:$TAG .
docker push tenki/engine:$TAG

# Update manifest
kubectl set image deployment/engine engine=tenki/engine:$TAG

Frontend App

# Build
cd apps/app
pnpm build

# Docker image
docker build -t tenki/app:$TAG .
docker push tenki/app:$TAG

# Deploy
kubectl set image deployment/app app=tenki/app:$TAG

Database Migrations

# Always run migrations before deploying new code
kubectl exec -it deploy/engine -- /app/migrate up

# Verify migrations
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "\dt"

Rollback Procedures

Quick Rollback (< 5 mins)

# 1. Rollback deployment
kubectl rollout undo deployment/engine

# 2. Verify rollback
kubectl rollout status deployment/engine
kubectl logs -l app=engine --tail=100

# 3. Rollback database if needed
kubectl exec -it deploy/engine -- /app/migrate down

GitOps Rollback

# 1. Revert commit in Git
git revert <commit-hash>
git push

# 2. Flux will automatically sync
flux reconcile source git flux-system

# 3. Monitor
watch flux get kustomizations

Health Checks

Pre-deployment

# Check cluster health
kubectl get nodes
kubectl top nodes

# Check dependencies
kubectl get pods -n default
kubectl get pvc

# Verify secrets
kubectl get secrets

During Deployment

# Watch rollout
kubectl rollout status deployment/engine -w

# Monitor pods
kubectl get pods -l app=engine -w

# Check logs
kubectl logs -f -l app=engine --tail=50

Post-deployment

# Smoke tests
curl https://api.tenki.cloud/health
curl https://app.tenki.cloud

# Check metrics
open https://grafana.tenki.cloud/d/deployment

# Run integration tests
cd backend && gotest-integration

Monitoring Deployments

Grafana Dashboards

Key Metrics to Watch

  • Request rate changes
  • Error rate spikes
  • Response time increases
  • CPU/Memory usage
  • Database connections

Alerts

# Deployment alerts configured in Prometheus
- name: deployment_failed
  expr: kube_deployment_status_replicas_unavailable > 0
  for: 5m

- name: high_error_rate_after_deploy
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m

Blue-Green Deployments

For high-risk changes:

# 1. Deploy to green environment
kubectl apply -f engine-deployment-green.yaml

# 2. Test green environment
curl https://api-green.tenki.cloud/health

# 3. Switch traffic
kubectl patch service engine -p '{"spec":{"selector":{"version":"green"}}}'

# 4. Monitor
watch 'kubectl get pods -l app=engine'

# 5. If issues, switch back
kubectl patch service engine -p '{"spec":{"selector":{"version":"blue"}}}'

Deployment Checklist

Pre-deployment

  • All tests passing
  • Code reviewed and approved
  • Database migrations tested
  • Rollback plan prepared
  • Team notified in Slack

Deployment

  • Images built and pushed
  • Manifests updated
  • Deployment monitored
  • Health checks passing
  • Smoke tests completed

Post-deployment

  • Metrics normal
  • No error spikes
  • Customer reports checked
  • Documentation updated
  • Deployment logged

Troubleshooting

Pod Won’t Start

kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --sort-by='.lastTimestamp'

Image Pull Errors

# Check secret
kubectl get secret regcred -o yaml

# Re-create if needed
kubectl create secret docker-registry regcred \
  --docker-server=registry.tenki.cloud \
  --docker-username=$USER \
  --docker-password=$PASS

Configuration Issues

# Check ConfigMaps
kubectl get configmap
kubectl describe configmap engine-config

# Check Secrets
kubectl get secrets
kubectl describe secret engine-secrets

CI/CD Pipeline

Our GitHub Actions pipeline:

  1. On PR: Run tests, build images, deploy to preview
  2. On merge to main: Build, tag, push to registry
  3. On tag: Build production images, create release

See .github/workflows/deploy.yml in the repository root

Monitoring Guide

Overview

This guide covers monitoring and observability practices for Tenki Cloud operations.

Stack

  • Prometheus: Metrics collection
  • Grafana: Visualization and dashboards
  • Loki: Log aggregation
  • Tempo: Distributed tracing
  • Alertmanager: Alert routing

Metrics

Application Metrics

Key metrics to monitor:

  • Request rate and latency
  • Error rates (4xx, 5xx)
  • Database connection pool stats
  • Background job queue depth
  • GitHub API rate limits

Infrastructure Metrics

  • CPU and memory usage
  • Disk I/O and space
  • Network throughput
  • Container health
  • Database performance

Dashboards

Available Dashboards

  1. Application Overview: High-level health metrics
  2. API Performance: Request rates, latencies, errors
  3. Database Health: Connections, query performance
  4. GitHub Integration: Runner stats, API usage
  5. Billing System: Transaction volumes, failures

Creating Dashboards

  1. Use Grafana dashboard as code
  2. Store dashboards in deployments/grafana/dashboards/
  3. Follow naming convention: category-name.json
  4. Include appropriate tags and metadata

Alerts

Alert Rules

Critical alerts:

  • API availability < 99.9%
  • Database CPU > 80%
  • Disk space < 20%
  • Error rate > 5%
  • GitHub API rate limit < 1000

Alert Routing

  1. Critical: PagerDuty (immediate response)
  2. Warning: Slack #alerts channel
  3. Info: Email daily digest

Logs

Log Levels

  • ERROR: Actionable errors requiring investigation
  • WARN: Potential issues, degraded performance
  • INFO: Important business events
  • DEBUG: Detailed troubleshooting information

Structured Logging

Always use structured logging with consistent fields:

  • trace_id: Request correlation ID
  • user_id: User identifier
  • org_id: Organization identifier
  • error: Error message and stack trace

Tracing

Instrumentation

  • Trace all API endpoints
  • Include database queries
  • Add custom spans for business logic
  • Propagate trace context to external services

Sampling

  • 100% sampling for errors
  • 10% sampling for successful requests
  • Adjust based on traffic volume

SLOs and SLIs

Service Level Indicators

  • API latency (p50, p95, p99)
  • Error rate
  • Availability
  • Database query time

Service Level Objectives

  • 99.9% API availability
  • p95 latency < 500ms
  • Error rate < 0.1%
  • Zero data loss

Manual Billing Workflow Execution

This guide provides information for manually executing billing workflows using Temporal CLI or other workflow execution tools.

Prerequisites

  • Access to Temporal cluster
  • Proper authentication and permissions
  • Understanding of workspace IDs and billing periods

Common Parameters

All billing workflows use the following task queue:

  • Task Queue: BILLING_TASK_QUEUE

Workflows

BillingListWorkspaceBalanceWorkflow

Retrieves invoice line items and balance information for a specific workspace and billing period.

Workflow Details:

  • Task Queue: BILLING_TASK_QUEUE
  • Workflow Type: BillingListWorkspaceBalanceWorkflow

Payload Example:

{
  "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
  "billing_period": "2024-01",
  "billing_period_start": "2024-01-01T00:00:00Z",
  "billing_period_end": "2024-01-31T23:59:59.999Z"
}

Parameters:

  • workspace_id (string): UUID of the workspace
  • billing_period (string): Billing period in YYYY-MM format
  • billing_period_start (time): Start of billing period (ISO 8601)
  • billing_period_end (time): End of billing period (ISO 8601)

Expected Result:

{
  "line_items": [
    {
      "description": "Runner Usage - tenki-standard-autoscale",
      "runner_label": "tenki-standard-autoscale",
      "quantity": 120,
      "unit_price": 0.01,
      "amount": 1.2
    }
  ],
  "total_amount": 1.2,
  "timestamp": "2024-01-31T23:59:59Z"
}

BillingCycleScheduleWorkflow

Parent workflow that orchestrates billing cycles for all workspaces. Typically triggered by a scheduled cron job.

Workflow Details:

  • Task Queue: BILLING_TASK_QUEUE
  • Workflow Type: BillingCycleScheduleWorkflow

Payload Example:

{
  "luxor_only": false,
  "exclude_workspaces": ["123e4567-e89b-12d3-a456-426614174000", "987fcdeb-51a2-43d1-b567-123456789abc"]
}

Parameters:

  • luxor_only (boolean, optional): Filter to process only Luxor customers (is_luxor = true)
  • exclude_workspaces (array of UUIDs, optional): List of workspace IDs to exclude from billing cycle

Behavior:

  • Queries all active workspaces with billing accounts
  • Spawns individual BillingCycleWorkflow child workflows for each workspace
  • Handles the current billing period automatically

BillingCycleWorkflow

Individual workspace billing processing workflow. Handles invoice generation, charging, and payment processing for a single workspace.

Workflow Details:

  • Task Queue: BILLING_TASK_QUEUE
  • Workflow Type: BillingCycleWorkflow

Payload Example:

{
  "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
  "billing_period": "2024-01",
  "billing_period_start": "2024-01-01T00:00:00Z",
  "billing_period_end": "2024-01-31T23:59:59.999Z"
}

Parameters:

  • workspace_id (UUID): The workspace to process billing for
  • billing_period (string): Billing period in YYYY-MM format
  • billing_period_start (time): Start of billing period (ISO 8601)
  • billing_period_end (time): End of billing period (ISO 8601)

Workflow Steps:

  1. Generate Stripe invoice with line items
  2. Process invoice and attempt payment
  3. Handle TigerBeetle accounting transfers
  4. Create billing payment records
  5. Process promotional credit adjustments
  6. Reset monthly free credits

BillingPaymentReversalWorkflow

Reverses a payment by creating a reversal transfer in TigerBeetle and updating the payment status to β€˜reversed’. Used for refunds, chargebacks, or administrative corrections.

Workflow Details:

  • Task Queue: BILLING_TASK_QUEUE
  • Workflow Type: BillingPaymentReversalWorkflow

Payload Example:

{
  "payment_id": "123e4567-e89b-12d3-a456-426614174000",
  "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
  "reason": "Customer requested refund",
  "initiated_by_email": "admin@tenki.cloud"
}

Parameters:

  • payment_id (UUID): The payment ID to reverse
  • workspace_id (UUID): The workspace that owns the payment
  • reason (string): Reason for the reversal (required)
  • initiated_by_email (string): Email of the person initiating the reversal (required)

Expected Result:

{
  "success": true,
  "reversal_transfer_id": "base64-encoded-transfer-id",
  "original_amount": "12.50",
  "reversed_at": "2024-01-31T15:30:00Z"
}

Workflow Steps:

  1. Validate required parameters (payment_id, workspace_id, reason, initiated_by_email)
  2. Lookup payment details from database and TigerBeetle
  3. Create reversal transfer in TigerBeetle using original transfer details
  4. Update payment status to β€˜reversed’ with reversal details

BillingUsageReversalWorkflow

Reverses a usage event by creating a reversal transfer in TigerBeetle and deleting the usage event record. Used for correcting erroneous charges or administrative adjustments to usage records.

Workflow Details:

  • Task Queue: BILLING_TASK_QUEUE
  • Workflow Type: BillingUsageReversalWorkflow

Payload Example:

{
  "usage_event_id": "123e4567-e89b-12d3-a456-426614174000",
  "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
  "reason": "Incorrect runner charge - job failed",
  "initiated_by_email": "admin@tenki.cloud"
}

Parameters:

  • usage_event_id (UUID): The usage event ID to reverse (required)
  • workspace_id (UUID): The workspace that owns the usage event (required)
  • reason (string): Reason for the reversal (required)
  • initiated_by_email (string): Email of the person initiating the reversal (required)

Expected Result:

{
  "success": true,
  "reversal_transfer_id": "base64-encoded-transfer-id",
  "original_amount": "0.50",
  "reversed_at": "2024-01-31T15:30:00Z"
}

Workflow Steps:

  1. Validate required parameters (usage_event_id, workspace_id, reason, initiated_by_email)
  2. Fetch usage event details from database
  3. Verify workspace ownership matches provided workspace_id
  4. Lookup actual transfer details from TigerBeetle
  5. Create reversal transfer in TigerBeetle using original transfer amounts
  6. Delete the usage event record from database

Important Notes:

  • Unlike payment reversals, usage event reversals permanently delete the record (no audit trail in the usage_events table)
  • The reversal transfer in TigerBeetle maintains the financial audit trail
  • Workspace ID validation ensures the usage event belongs to the specified workspace

Temporal CLI Examples

Execute BillingListWorkspaceBalanceWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingListWorkspaceBalanceWorkflow \
  --input '{
    "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
    "billing_period": "2024-01",
    "billing_period_start": "2024-01-01T00:00:00Z",
    "billing_period_end": "2024-01-31T23:59:59.999Z"
  }'

Execute BillingCycleScheduleWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingCycleScheduleWorkflow \
  --input '{
    "luxor_only": false,
    "exclude_workspaces": []
  }'

Execute BillingCycleWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingCycleWorkflow \
  --input '{
    "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
    "billing_period": "2024-01",
    "billing_period_start": "2024-01-01T00:00:00Z",
    "billing_period_end": "2024-01-31T23:59:59.999Z"
  }'

Execute BillingPaymentReversalWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingPaymentReversalWorkflow \
  --input '{
    "payment_id": "123e4567-e89b-12d3-a456-426614174000",
    "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
    "reason": "Customer requested refund",
    "initiated_by_email": "admin@tenki.cloud"
  }'

Execute BillingUsageReversalWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingUsageReversalWorkflow \
  --input '{
    "usage_event_id": "123e4567-e89b-12d3-a456-426614174000",
    "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
    "reason": "Incorrect runner charge - job failed",
    "initiated_by_email": "admin@tenki.cloud"
  }'

Notes

  • All timestamps should be in UTC
  • Workspace IDs must be valid UUIDs
  • Billing periods follow YYYY-MM format
  • The BillingCycleScheduleWorkflow is typically run automatically via Temporal schedules
  • Individual BillingCycleWorkflow executions can be run manually for specific workspaces
  • Use BillingListWorkspaceBalanceWorkflow to preview billing information before processing

Operational Runbooks

This section contains runbooks for common operational scenarios and incident response.

Available Runbooks

Runbook Template

When creating a new runbook, use this template:

# Runbook: [Issue Name]

## Alert Details

- **Alert Name**: `AlertNameInPrometheus`
- **Severity**: P1 | P2 | P3
- **Team**: Backend | Frontend | Platform
- **Last Updated**: YYYY-MM-DD

## Symptoms

- What the user/system experiences
- What metrics are affected
- What alerts fire

## Quick Diagnostics

\```bash

# Commands to quickly assess the situation

\```

## Resolution Steps

### 1. Immediate Mitigation (X mins)

Steps to stop the bleeding

### 2. Root Cause Analysis (X mins)

How to find what caused the issue

### 3. Fix Implementation

How to fix the underlying problem

### 4. Verification

How to confirm the fix worked

## Prevention

Long-term fixes to prevent recurrence

## Escalation Path

When and who to escalate to

## Related Runbooks

Links to related procedures

Writing Good Runbooks

  1. Be specific - Include exact commands and expected outputs
  2. Time-box steps - Indicate how long each step should take
  3. Include rollback - Always have a way to undo changes
  4. Test regularly - Run through the runbook quarterly
  5. Keep updated - Update after each incident

Incident Response Process

  1. Acknowledge the alert
  2. Assess using quick diagnostics
  3. Mitigate following the runbook
  4. Communicate status updates
  5. Resolve the root cause
  6. Document in incident report

Runbook: High Database CPU

Alert Details

  • Alert Name: HighDatabaseCPU
  • Severity: P2
  • Team: Backend/Platform
  • Last Updated: 2025-06-12

Symptoms

  • Database CPU usage > 80% for 5+ minutes
  • API response times > 500ms
  • Increased error rates in logs
  • Grafana dashboard shows CPU spike

Quick Diagnostics

# 1. Check current connections
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT count(*), state
  FROM pg_stat_activity
  GROUP BY state;"

# 2. Find slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT
    substring(query, 1, 50) as query_start,
    calls,
    mean_exec_time,
    total_exec_time
  FROM pg_stat_statements
  WHERE mean_exec_time > 100
  ORDER BY mean_exec_time DESC
  LIMIT 10;"

# 3. Check for locks
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT
    pid,
    usename,
    pg_blocking_pids(pid) as blocked_by,
    query_start,
    substring(query, 1, 50) as query
  FROM pg_stat_activity
  WHERE pg_blocking_pids(pid)::text != '{}';"

Resolution Steps

1. Immediate Mitigation (5 mins)

# Scale up API to reduce per-instance load
kubectl scale deployment/engine --replicas=10

# Kill long-running queries (>5 minutes)
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state != 'idle'
    AND query_start < now() - interval '5 minutes'
    AND query NOT LIKE '%pg_stat_activity%';"

2. Identify Root Cause (10 mins)

Check recent deployments:

kubectl get deployments -o wide | grep engine
kubectl rollout history deployment/engine

Review slow query log:

kubectl logs postgres-0 | grep "duration:" | tail -50

Check for missing indexes:

-- Run on affected tables
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM workflow_runs
WHERE status = 'pending'
  AND created_at > NOW() - INTERVAL '1 hour';

3. Fix Implementation

If missing index:

-- Create index (be careful on large tables)
CREATE INDEX CONCURRENTLY idx_workflow_runs_status_created
ON workflow_runs(status, created_at)
WHERE status IN ('pending', 'running');

If bad query from recent deploy:

# Rollback to previous version
kubectl rollout undo deployment/engine

# Or deploy hotfix
git checkout main
git pull
# Fix query
git commit -am "fix: optimize workflow query"
git push
# Deploy via CI/CD

4. Verify Resolution

# Monitor CPU (should drop within 5 mins)
watch -n 5 "kubectl exec -it postgres-0 -- psql -U postgres -c 'SELECT round(100 * cpu_usage) as cpu_percent FROM pg_stat_database_stats;'"

# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.tenki.lab/health

# Verify no more slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '1 minute';"

Long-term Prevention

  1. Add query timeout to engine configuration
  2. Set up query monitoring in Datadog/NewRelic
  3. Regular ANALYZE on high-traffic tables
  4. Consider read replicas for analytics queries
  5. Implement connection pooling with PgBouncer

Escalation Path

  1. 15 mins: If CPU still high β†’ Page backend on-call
  2. 30 mins: If impacting customers β†’ Incident Commander
  3. 45 mins: If data corruption risk β†’ CTO

Post-Incident

  • Create incident report
  • Add missing monitoring
  • Update this runbook with findings
  • Schedule postmortem if customer impact

Runbook: High API Latency

Overview

This runbook covers troubleshooting and resolving high API latency issues.

Symptoms

  • p95 latency > 500ms
  • User reports of slow loading
  • Timeout errors in client applications
  • Increased error rates due to timeouts

Impact

  • Poor user experience
  • Increased error rates
  • Potential cascading failures
  • Customer complaints

Detection

  • Alert: APILatencyHigh
  • Threshold: p95 > 500ms for 5 minutes
  • Dashboard: API Performance

Response

Immediate Actions

  1. Check current latency

    • View p50, p95, p99 latencies
    • Identify affected endpoints
    • Check error rates
  2. Verify system health

    # Check pod status
    kubectl get pods -n production
    
    # Check resource usage
    kubectl top pods -n production
    
    # Check recent deployments
    kubectl rollout history deployment/api -n production
    
  3. Enable detailed logging (temporarily)

    kubectl set env deployment/api LOG_LEVEL=debug -n production
    

Diagnosis

  1. Database performance

    • Check slow query log
    • Review connection pool status
    • Look for lock contention
  2. External dependencies

    • GitHub API response times
    • Payment processor latency
    • CDN performance
  3. Application issues

    • Memory leaks (increasing memory usage)
    • CPU bottlenecks
    • Inefficient algorithms

Common Causes and Fixes

1. Database Queries

Symptom: High database CPU, slow queries Fix:

-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_table_column ON table(column);

2. Cache Misses

Symptom: High cache miss rate Fix:

  • Warm up caches after deployment
  • Increase cache TTL for stable data
  • Review cache key generation

3. Resource Constraints

Symptom: High CPU/memory usage Fix:

# Scale horizontally
kubectl scale deployment api --replicas=6 -n production

# Or scale vertically (requires restart)
kubectl set resources deployment api -c api --requests=memory=2Gi,cpu=1000m -n production

4. Inefficient Code

Symptom: Specific endpoints consistently slow Fix:

  • Profile the endpoint
  • Optimize algorithms
  • Implement pagination
  • Add caching layer

Recovery

  1. Quick wins

    • Increase cache TTLs
    • Scale out services
    • Enable read replicas
  2. Rollback if needed

    kubectl rollout undo deployment/api -n production
    
  3. Communicate status

    • Update status page
    • Notify affected customers
    • Post in #incidents channel

Prevention

  • Load testing before major releases
  • Gradual rollouts with canary deployments
  • Query performance regression tests
  • Capacity planning reviews

Monitoring

Key metrics to watch:

  • API latency percentiles
  • Database query time
  • Cache hit rates
  • Resource utilization
  • Error rates

Runbook: High Database Connections

Overview

This runbook describes how to handle situations where database connection pool is exhausted or nearing limits.

Symptoms

  • Application errors: β€œtoo many connections”
  • Slow API responses
  • Connection pool metrics showing high usage
  • Database showing max_connections limit reached

Impact

  • API requests fail
  • Background jobs unable to process
  • Users experience errors and timeouts

Detection

  • Alert: DatabaseConnectionsHigh
  • Threshold: > 80% of max_connections
  • Dashboard: Database Health

Response

Immediate Actions

  1. Check current connections

    SELECT count(*) FROM pg_stat_activity;
    SELECT usename, application_name, count(*)
    FROM pg_stat_activity
    GROUP BY usename, application_name
    ORDER BY count DESC;
    
  2. Identify idle connections

    SELECT pid, usename, application_name, state, state_change
    FROM pg_stat_activity
    WHERE state = 'idle'
    AND state_change < NOW() - INTERVAL '10 minutes';
    
  3. Kill long-idle connections (if safe)

    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle'
    AND state_change < NOW() - INTERVAL '30 minutes';
    

Root Cause Analysis

  1. Check for connection leaks

    • Review recent deployments
    • Check for missing defer db.Close()
    • Look for transactions not being committed/rolled back
  2. Review pool configuration

    • Current settings in environment
    • Calculate optimal pool size
    • Check for misconfigured services
  3. Analyze traffic patterns

    • Sudden spike in requests
    • New feature causing more queries
    • Background job issues

Long-term Fixes

  1. Optimize connection pool settings

    db.SetMaxOpenConns(25)
    db.SetMaxIdleConns(10)
    db.SetConnMaxLifetime(5 * time.Minute)
    
  2. Implement connection pooler

    • Consider PgBouncer for connection multiplexing
    • Configure pool modes appropriately
  3. Code improvements

    • Use prepared statements
    • Batch queries where possible
    • Implement query result caching

Prevention

  • Monitor connection pool metrics
  • Load test with realistic concurrency
  • Regular code reviews for database usage
  • Implement circuit breakers

Runbook: Database Failover

Overview

This runbook covers the process of failing over to a standby database in case of primary database failure.

Symptoms

  • Primary database unreachable
  • Replication lag increasing indefinitely
  • Database corruption detected
  • Catastrophic hardware failure

Impact

  • Complete service outage
  • Data writes blocked
  • Potential data loss (depending on replication lag)

Detection

  • Alert: DatabasePrimaryDown
  • Alert: DatabaseReplicationLagHigh
  • Dashboard: Database Health

Pre-failover Checks

1. Verify Primary is Down

# Check connectivity
pg_isready -h primary.db.tenki.cloud -p 5432

# Check from multiple locations
for host in api-1 api-2 worker-1; do
  ssh $host "pg_isready -h primary.db.tenki.cloud"
done

2. Check Replication Status

-- On standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- Check if standby is receiving updates
SELECT * FROM pg_stat_replication;

3. Assess Data Loss Risk

  • Note the last transaction timestamp
  • Document replication lag
  • Make go/no-go decision based on business impact

Failover Process

1. Stop All Application Traffic

# Scale down applications
kubectl scale deployment api worker --replicas=0 -n production

# Verify no active connections
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';

2. Promote Standby

# On standby server
pg_ctl promote -D /var/lib/postgresql/data

# Or using managed service commands
gcloud sql instances promote-replica tenki-db-standby

3. Update Connection Strings

# Update DNS
terraform apply -var="database_host=standby.db.tenki.cloud"

# Or update environment variables
kubectl set env deployment/api deployment/worker \
  DATABASE_URL=postgres://user:pass@standby.db.tenki.cloud/tenki \
  -n production

4. Verify New Primary

-- Check if accepting writes
SELECT pg_is_in_recovery();  -- Should return false

-- Test write
INSERT INTO health_check (timestamp) VALUES (now());

5. Resume Application Traffic

# Scale up applications
kubectl scale deployment api --replicas=3 -n production
kubectl scale deployment worker --replicas=2 -n production

# Monitor for errors
kubectl logs -f deployment/api -n production

Post-Failover Tasks

1. Immediate

  • Monitor application health
  • Check for data inconsistencies
  • Communicate status to stakeholders

2. Within 1 Hour

  • Set up new standby from old primary (if recoverable)
  • Update monitoring to reflect new topology
  • Document timeline and impact

3. Within 24 Hours

  • Root cause analysis
  • Update disaster recovery procedures
  • Test backup restoration process

Rollback Procedure

If failover was premature or primary recovers:

  1. Stop applications again
  2. Ensure data consistency
    • Compare transaction IDs
    • Check for split-brain scenarios
  3. Resync if needed
    pg_rewind --target-pgdata=/var/lib/postgresql/data \
              --source-server="host=primary.db.tenki.cloud"
    
  4. Switch back to primary
  5. Resume traffic

Prevention

  • Regular failover drills
  • Monitor replication lag closely
  • Implement automatic failover with proper fencing
  • Use synchronous replication for critical data

Runbook: Playwright Scenario Failed

Test Failure Due to Multiple Matching Elements with Similar Text

Alert Details

  • Alert Name: Tenki Production - App Can Login
  • Severity: P2
  • Team: Frontend
  • Last Updated: 2025-09-08

Symptoms

  • should allow entering email and password

Quick Diagnostics

kubectx tenki-prod-apps

Resolution Steps

1. Immediate Mitigation (5-10 mins)

  • checked staging and production if I can successfully login - seems to be working on my end upon testing
  • ran kubectx tenki-prod-apps and ran logs from a namespace - everything is in Running status

2. Root Cause Analysis (10 mins)

  • The test failed due to a strict mode violation in Playwright.
  • locator detected multiple Projects in the code, and it didn’t know which one to interact with.
  • Playwright expects a single unique element when using .toBeVisible() in strict mode.

3. Fix Implementation / Possible Resolution

  • Add a unique internal ID to the correct element so the test can reliably target it without confusion from similar elements.
  • Update the test to match exact text to avoid picking up similar elements.

4. Verification

  • Successful test when i ran the scenario in Monitors

Prevention

  • ensure a proper unique id for dynamic or conditionally rendered UI elements

Not Applicable

On-call Guide

Last updated: 2025-06-30

Qualification

Watch the initial onboarding video.

Refer to this Notion document.

Duplicate this sample and use your name as the title.

It is very important to go through the steps to ensure proper qualifications and awareness of the responsibilities.

Product Requirement Documents (PRDs)

This directory contains PRDs for major features and initiatives. Each PRD captures the why, what, and success criteria for a feature.

PRD Template

# PRD-XXX: Feature Name

**Author**: Name
**Date**: YYYY-MM-DD
**Status**: Draft | In Review | Approved | In Development | Launched

## Summary

One paragraph overview of what we're building and why.

## Problem Statement

What problem are we solving? Who experiences this problem? Why does it matter?

## Goals & Success Metrics

- **Primary Goal**: What we must achieve
- **Success Metrics**:
  - Metric 1: Target value
  - Metric 2: Target value

## User Stories

1. As a [user type], I want to [action] so that [benefit]
2. As a [user type], I want to [action] so that [benefit]

## Requirements

### Must Have (MVP)

- [ ] Requirement 1
- [ ] Requirement 2

### Should Have

- [ ] Requirement 3
- [ ] Requirement 4

### Nice to Have

- [ ] Requirement 5

## Technical Approach

High-level technical approach. Details go in technical design docs.

## Risks & Mitigations

| Risk   | Impact | Likelihood | Mitigation          |
| ------ | ------ | ---------- | ------------------- |
| Risk 1 | High   | Medium     | How we'll handle it |

## Timeline

- Week 1-2: Design and planning
- Week 3-4: Implementation
- Week 5: Testing and rollout

## Open Questions

- [ ] Question 1
- [ ] Question 2

Current PRDs

Writing a Good PRD

Do’s

  • Start with the problem, not the solution
  • Include measurable success criteria
  • Keep it concise (2-3 pages max)
  • Focus on the β€œwhat” and β€œwhy”, not β€œhow”
  • Include user stories

Don’ts

  • Don’t include implementation details
  • Don’t skip the problem statement
  • Don’t forget about edge cases
  • Don’t ignore risks

PRD Process

  1. Draft - PM creates initial PRD
  2. Review - Engineering, Design, and stakeholders review
  3. Approval - Leadership approves
  4. Development - Engineering implements
  5. Launch - Feature released
  6. Retrospective - Measure against success criteria

PRD-001: GitHub Integration

Author: Product Team
Date: 2024-01-20
Status: Launched

Summary

Enable customers to connect their GitHub organizations to Tenki Cloud and automatically provision runners for their repositories without any configuration or infrastructure management.

Problem Statement

Development teams waste significant time and money managing GitHub Actions infrastructure:

  • Setting up self-hosted runners requires DevOps expertise
  • Maintaining runner infrastructure distracts from product development
  • GitHub’s hosted runners are expensive and have limited customization
  • Scaling runners up/down based on demand is complex

Who experiences this: Engineering teams using GitHub Actions for CI/CD Impact: Teams spend 10-20 hours/month on runner management instead of shipping features

Goals & Success Metrics

Primary Goal: Zero-config GitHub Actions runners that just work

Success Metrics:

  • Time to first runner: < 5 minutes from signup
  • Runner startup time: < 30 seconds
  • Platform uptime: 99.9%
  • Customer runner cost: 50% less than GitHub hosted
  • Monthly active organizations: 100 by Q2

User Stories

  1. As a developer, I want to connect my GitHub org so that runners are automatically available for all my repos
  2. As a team lead, I want to set spending limits so that we don’t exceed our CI/CD budget
  3. As a DevOps engineer, I want to customize runner specs so that our builds run efficiently
  4. As a finance manager, I want to see detailed usage reports so that I can allocate costs to teams

Requirements

Must Have (MVP)

  • GitHub App for OAuth authentication
  • Automatic runner provisioning for workflow_job events
  • Support for Linux runners (Ubuntu 22.04)
  • Basic usage dashboard showing minutes used
  • Automatic runner cleanup after job completion
  • Support for public and private repositories

Should Have

  • Multiple runner sizes (2-16 vCPU)
  • Usage alerts and spending limits
  • Windows and macOS runners
  • Runner caching between jobs
  • Team-based access controls

Nice to Have

  • Custom runner images
  • Dedicated runner pools
  • GitHub Enterprise Server support
  • API for programmatic management

Technical Approach

  1. GitHub App handles authentication and webhook events
  2. Webhook handler processes workflow_job events
  3. Temporal workflows orchestrate runner lifecycle
  4. Kubernetes operators manage runner pods
  5. Usage tracking via TigerBeetle for accurate billing

Risks & Mitigations

RiskImpactLikelihoodMitigation
GitHub API rate limitsHighMediumImplement caching and exponential backoff
Runner startup time > 30sHighMediumPre-warm runner pools, optimize images
Security vulnerabilitiesHighLowRegular security audits, isolated runners
Cost overrunsMediumMediumReal-time usage tracking and limits

Timeline

  • Week 1-2: GitHub App development and authentication
  • Week 3-4: Webhook handling and runner provisioning
  • Week 5-6: Usage tracking and billing integration
  • Week 7: Beta testing with friendly customers
  • Week 8: Public launch

Open Questions

  • Should we support GitHub Enterprise? β†’ Not in MVP
  • How do we handle runner caching? β†’ Post-MVP feature
  • What’s our runner retention policy? β†’ 7 days for logs
  • How do we handle abuse/crypto mining? β†’ Usage anomaly detection

Post-Launch Results

Launched: 2025-04-15

Actual Metrics (as of 2024-06-01):

  • Time to first runner:
  • Runner startup time:
  • Platform uptime:
  • Cost savings:
  • Monthly active orgs:

Key Learnings:

  1. Pre-warming runner pools was critical for startup time
  2. Customers want custom images more than expected
  3. Windows runner demand higher than anticipated

Product Roadmap

Overview

This document outlines the product roadmap for Tenki Cloud, organized by quarters and strategic themes.

Q1 2025

Core Platform

  • βœ… GitHub integration MVP
  • βœ… Basic runner management
  • βœ… Usage tracking and billing
  • 🚧 Self-service onboarding
  • 🚧 Team management

Developer Experience

  • βœ… CLI tool
  • 🚧 VS Code extension
  • πŸ“‹ IntelliJ plugin

Q2 2025

Scale and Performance

  • πŸ“‹ Multi-region support
  • πŸ“‹ Runner auto-scaling
  • πŸ“‹ Performance optimizations
  • πŸ“‹ Caching improvements

Enterprise Features

  • πŸ“‹ SSO integration
  • πŸ“‹ Advanced access controls
  • πŸ“‹ Audit logging
  • πŸ“‹ Compliance certifications

Q3 2025

Ecosystem Integration

  • πŸ“‹ GitLab support
  • πŸ“‹ Bitbucket support
  • πŸ“‹ Jenkins integration
  • πŸ“‹ Kubernetes operators

Advanced Features

  • πŸ“‹ Custom runner images
  • πŸ“‹ GPU runner support
  • πŸ“‹ Spot instance integration
  • πŸ“‹ Advanced scheduling

Q4 2025

Platform Maturity

  • πŸ“‹ White-label solution
  • πŸ“‹ Marketplace integrations
  • πŸ“‹ Partner ecosystem
  • πŸ“‹ Advanced analytics

Legend

  • βœ… Completed
  • 🚧 In Progress
  • πŸ“‹ Planned

Feature Requests

Track feature requests in our GitHub Issues.

Feedback

We welcome feedback on our roadmap. Please reach out through:

  • GitHub Discussions
  • Support channels
  • Customer success team

Product Metrics

Overview

This document defines the key metrics we track to measure product success and guide decision-making.

North Star Metrics

Primary Metric: Weekly Active Builds

  • Definition: Unique organizations with at least one successful build in the past 7 days
  • Target: 20% month-over-month growth
  • Current: [Dashboard Link]

Product Metrics

Activation

  • Time to First Build: Time from signup to first successful build
    • Target: < 10 minutes
    • Measured from: Account creation to first build completion
  • Activation Rate: % of signups that complete first build within 7 days
    • Target: > 80%
    • Segmented by: Source, plan type

Engagement

  • Build Frequency: Average builds per organization per week

    • Target: > 50 builds/week for active orgs
    • Segmented by: Organization size, industry
  • Runner Utilization: % of time runners are actively building

    • Target: > 70% during business hours
    • Measured: CPU time / available time

Retention

  • 30-Day Retention: % of orgs active after 30 days

    • Target: > 85%
    • Cohorted by: Signup month
  • 90-Day Retention: % of orgs active after 90 days

    • Target: > 75%
    • Leading indicator: Build frequency in first week

Revenue

  • MRR Growth: Month-over-month recurring revenue growth

    • Target: 15% MoM
    • Segmented by: Plan type, acquisition channel
  • Net Revenue Retention: Revenue from existing customers

    • Target: > 120%
    • Includes: Upgrades, downgrades, churn

Operational Metrics

Performance

  • Build Success Rate: % of builds completing successfully

    • Target: > 99%
    • Excluding: User errors
  • API Latency: p95 response time

    • Target: < 200ms
    • Measured: All API endpoints

Quality

  • Customer Satisfaction (CSAT): Post-interaction survey

    • Target: > 4.5/5
    • Measured: Support interactions
  • Net Promoter Score (NPS): Quarterly survey

    • Target: > 50
    • Segmented by: Customer segment

Leading Indicators

Feature Adoption

  • CLI usage rate
  • API integration rate
  • Advanced features usage

Customer Health

  • Support ticket volume
  • Feature request patterns
  • Churn risk scores

Data Collection

Tools

  • Amplitude: Product analytics
  • Segment: Event tracking
  • Metabase: Business intelligence
  • Custom dashboards: Real-time metrics

Privacy

  • All metrics are aggregated
  • No PII in analytics
  • GDPR compliant tracking
  • User consent required

Reporting

Weekly

  • North star metric update
  • Key metric changes
  • Anomaly alerts

Monthly

  • Full metrics review
  • Cohort analysis
  • Revenue metrics
  • OKR progress

Quarterly

  • Strategic metric review
  • NPS survey results
  • Market comparison
  • Forecast updates

πŸ§ͺ Testing Plan: Tenki GitHub Runners Evaluation


πŸ“… Thursday 6/26 β€” Phase 1 & Phase 2: Staging & Controlled Evaluation

Phase 1: Staging Load Test

Objective: Validate stability and responsiveness of the new VM-based runners under parallel job load.

Setup:
Trigger ~50 GitHub Actions jobs in parallel using the gh-runner-test repository.

Definition of Done (DoD):

  • Jobs are picked up within 30 seconds.
  • Job duration is within +5% of baseline execution time from existing Docker-based runners.

Phase 2: Tenki Test Suite Evaluation

Condition: Executed only if Phase 1 is successful.

Objective: Assess runner performance using real-world workflows from the test suite.

Setup:

  • Switch the GitHub Actions workspace used by LuxorLabs/tenki-tests to the new VM-based runners.
  • Monitor CI jobs for performance and reliability.

Definition of Done (DoD):

  • End-to-end performance delta is < 5% compared to current production metrics.

πŸ“… Friday β€” Phase 3: β€œPre-Production” Migration

Phase 3: Luxor Workflow Migration

Precondition: All DoDs from Phase 1 and Phase 2 must be fully met.

Objective: Transition production workloads to the new runners, based on successful Thursday validation.

Setup:

  • Migrate all GitHub workflows under the Luxor Tenki Workspace to the new VM-based runners.

Definition of Done (DoD):

  • All jobs are successful.
  • Performance delta is < 5% compared to current production metrics.

Documentation Roadmap

This roadmap tracks documentation that needs to be written for Tenki Cloud. Items are prioritized based on impact and frequency of use.

🚨 Priority 1: Critical Gaps

These affect daily development and operations:

  • Environment Variables Reference - Complete list of all env vars
  • API Reference - tRPC endpoints and Connect/gRPC services
  • GitHub App Setup Guide - Step-by-step installation
  • Secrets Management Guide - SOPS usage and key rotation
  • Troubleshooting Guide - Common issues and solutions

πŸ”§ Priority 2: Configuration & Setup

Essential for proper deployment and configuration:

  • Service Configuration Guide - engine.yaml and other configs
  • Authentication Setup - Kratos and Keto configuration
  • Notification Service Guide - Email and webhook setup
  • Database Guide - Schema, migrations, and optimization
  • CLI Tool Documentation - tenki-cli command reference

πŸ“Š Priority 3: Operational Excellence

For production operations and monitoring:

  • Monitoring & Observability - Metrics, logs, and tracing
  • Backup & Restore Procedures - Database and state backup
  • Scaling Guidelines - When and how to scale services
  • Security Best Practices - Hardening and compliance
  • Audit Logging Guide - Event tracking and retention

πŸš€ Priority 4: Advanced Features

For power users and advanced scenarios:

  • Custom Runner Images - Building and managing
  • Temporal Workflows Guide - Patterns and testing
  • TigerBeetle Integration - Ledger design and reconciliation
  • Multi-region Setup - Geographic distribution
  • Performance Tuning - Optimization techniques

πŸ“ Contributing

To add documentation:

  1. Pick an item from this roadmap
  2. Create the documentation in the appropriate section
  3. Update SUMMARY.md to include your new page
  4. Remove the item from this roadmap
  5. Submit a PR

Progress Tracking

  • Total items: 24
  • Completed: 0
  • In Progress: 0
  • Remaining: 24

Last updated: 2025-06-12