Tenki Cloud Documentation

Welcome to Tenki Cloud’s documentation. This is your starting point for understanding the system architecture, development practices, and operational procedures.

Note: This documentation is built with mdBook. Run pnpm docs:dev to view it locally.

Quick Links

New to Tenki? Start with Getting Started
Architecture Overview - System Architecture
API Reference - Backend API Guide
Deployment - Deployment Guide

Documentation Organization

📐 Architecture

System design, technical decisions, and architectural diagrams.

💻 Development

Everything you need to start developing on Tenki Cloud.

Getting Started - Set up your dev environment
Backend Development - Go services and APIs
Frontend Development - React/Next.js apps
Database Guide - Schema and migrations

🚀 Operations

Deployment, monitoring, and incident response.

Deployment Guide
Monitoring
Runbooks - Operational procedures

📋 Product

Product vision, roadmap, and requirements.

Contributing to Documentation

When to Add Documentation

Architecture changes → Add an ADR
New features → Add a PRD
Operational issues → Add a runbook
API changes → Update the relevant guide

Documentation Standards

Keep it concise - Get to the point quickly
Use examples - Show, don’t just tell
Date your docs - Add “Last updated: YYYY-MM-DD” to guides
Test your instructions - Make sure they actually work

Quick Doc Updates

# Install mdBook (first time only)
./docs/install-mdbook.sh

# Edit documentation
vim docs/src/development/getting-started.md

# Preview locally with hot reload
pnpm docs:dev

# Build static site
pnpm docs:build

# Submit changes
git add docs/
git commit -m "docs: update getting started guide"

Finding Information

By Role

Backend Engineer

Frontend Engineer

DevOps/SRE

Product Manager

By Task

“I need to…”

Set up my development environment → Getting Started
Understand the system design → Architecture Overview
Deploy to production → Deployment Guide
Debug an issue → Runbooks
Plan a new feature → PRD Template

Maintenance

This documentation is maintained by the engineering team. Each team member is responsible for keeping their area of expertise documented.

Backend team owns: Backend guide, database docs, API patterns
Frontend team owns: Frontend guide, component docs
DevOps team owns: Deployment, monitoring, runbooks
Product team owns: Roadmap, PRDs, metrics

Last updated: 2025-06-12

Tenki Cloud System Architecture

Last updated: 2025-06-12

Tenki Cloud is a cloud compute marketplace that provides GitHub Actions runner management as a service. The system is built as a distributed microservices architecture with clear separation of concerns.

High-Level Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   GitHub.com    │────▶│  GitHub Proxy    │────▶│    Temporal     │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                           │
┌─────────────────┐     ┌──────────────────┐              ▼
│   Next.js App   │────▶│   tRPC Gateway   │     ┌─────────────────┐
└─────────────────┘     └──────────────────┘     │  Backend Engine │
                                │                 └─────────────────┘
                                ▼                          │
                        ┌──────────────────┐              ▼
                        │   Backend API    │      ┌─────────────────┐
                        │  (Connect RPC)   │      │   PostgreSQL    │
                        └──────────────────┘      └─────────────────┘

Core Components

Frontend Layer

Next.js Application (apps/app/)

Server-side rendered React application
TypeScript with tRPC for type-safe API calls
Tailwind CSS with Radix UI components
Authentication via Kratos sessions

API Gateway Layer

tRPC Router (apps/app/src/server/api/)

Type-safe RPC layer between frontend and backend
Handles session management and authentication
Routes requests to appropriate backend services

Backend Services

Engine (backend/cmd/engine/)

Main orchestrator for all backend operations
Implements Connect RPC (gRPC-Web compatible)
Manages service lifecycle and dependencies

Domain Services (backend/internal/domain/)

Identity: User authentication (Kratos) and authorization (Keto)
Workspace: Multi-tenant workspace and project management
Runner: GitHub Actions runner lifecycle management
Billing: Usage tracking, TigerBeetle ledger, Stripe integration
Compute: VM provisioning via CloudStack/Kubernetes

Event Processing

GitHub Proxy (backend/cmd/github-proxy/)

Receives GitHub webhooks
Validates and transforms events
Publishes to Temporal for processing

Temporal Workflows

Long-running business processes
Runner provisioning workflows
Billing cycle management
Retry and failure handling

Data Layer

PostgreSQL

Primary data store
Managed via migrations (backend/schema/)
Type-safe queries via sqlc

Redpanda

Event streaming platform
Audit log collection
Inter-service communication

TigerBeetle

Financial ledger for billing
Double-entry bookkeeping
High-performance transaction processing

Key Design Decisions

Security Architecture

Authentication Flow

User → Next.js → Kratos → Session Cookie → tRPC → Backend

Authorization Model

Keto for fine-grained permissions
Workspace-based multi-tenancy
Project-level access control

Secrets Management

SOPS for encrypted configuration
Kubernetes secrets for runtime
No secrets in environment variables

Deployment Architecture

Kubernetes Deployment

GitOps via Flux
Horizontal pod autoscaling
Service mesh for inter-service communication

Infrastructure Components

Ingress: Traefik with automatic TLS
Monitoring: Prometheus + Grafana
Logging: Loki + Grafana
Tracing: Tempo

Data Flow Examples

Runner Provisioning

GitHub sends webhook to proxy
Proxy validates and publishes to Kafka
Backend consumes event, starts Temporal workflow
Workflow provisions runner in Kubernetes
Runner registers with GitHub
Status updates flow back via Temporal

Billing Flow

Runner usage tracked via Temporal activities
Usage events written to TigerBeetle
Daily aggregation job calculates costs
Monthly billing workflow generates invoices
Stripe processes payments
Payment status updates ledger

Scalability Considerations

Horizontal Scaling

Stateless services scale via replicas
Database uses read replicas for queries
Temporal workers scale independently

Performance Optimization

Redis for session caching
CDN for static assets
Database query optimization via indexes

Reliability

Circuit breakers for external services
Retry logic in Temporal workflows
Graceful degradation for non-critical features

Future Architecture Plans

Multi-region deployment for global latency optimization
GraphQL federation for more flexible API access
Event sourcing for complete audit trail
Service mesh for advanced traffic management

GitHub Runners Architecture

This document provides a comprehensive overview of Tenki Cloud’s GitHub Actions runner system, detailing how we manage self-hosted runners at scale.

Overview

Tenki Cloud provides a managed GitHub Actions runner platform that allows users to run their CI/CD workflows on dedicated, scalable infrastructure. The system integrates deeply with GitHub through a GitHub App, orchestrates runner lifecycle through Temporal workflows, and manages the underlying Kubernetes infrastructure.

System Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│     GitHub      │────▶│  GitHub Proxy    │────▶│    Temporal     │
│    Webhooks     │     │   (Node.js)      │     │   Workflows     │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                           │
                                                           ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│   Kubernetes    │◀────│  Runner Service  │◀────│    Database     │
│   (Runners)     │     │     (Go)         │     │  (PostgreSQL)   │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Core Components

1. GitHub Proxy

The GitHub proxy serves as the entry point for all GitHub webhook events. Built with Node.js and Probot, it:

Receives webhook events from GitHub (installation, workflow_job, workflow_run, push)
Validates webhook signatures for security
Forwards events to Temporal workflows for processing
Preserves GitHub headers for workflow_job events

Key event handlers:

installation.created/deleted: Manages GitHub App installations
workflow_job: Processes individual CI/CD job events
workflow_run: Tracks overall workflow execution
push: Monitors changes to workflow files

2. Runner Service

The runner service is the core business logic layer, implemented in Go with Connect RPC:

Manages runner lifecycle: Creation, deletion, suspension
Handles GitHub integration: Repository synchronization, workflow analysis
Controls Kubernetes resources: Deployments, autoscalers, secrets
Tracks usage and billing: Job metrics, duration, failures

Key operations:

InstallRunners: Initialize a new GitHub App installation
CreateRunner: Provision custom runner configurations
GetRunnerMetrics: Performance analytics (p50/p90, failure rates)

3. Temporal Workflows

Temporal provides durable workflow orchestration for long-running operations:

Primary Workflows

Runner Installation Workflow

Long-running workflow per GitHub installation
Responds to signals: Install, Uninstall, Suspend, AddRepositories
Manages entire runner lifecycle
Handles failure recovery and retries

GitHub Job Workflow

Processes each GitHub Actions job
Tracks state transitions (queued → in_progress → completed)
Creates billing events for usage tracking
Forwards requests to Actions Runner Controller

GitHub Run Workflow

Monitors overall workflow execution
Provides visibility into CI/CD pipeline status
Updates database with run metadata

4. Data Models

Runner

message Runner {
  string id = 1;
  string name = 2;
  string namespace = 3;
  string runner_offering_id = 4;
  repeated string repositories = 5;
  string status = 6;
  bool is_custom = 7;
  // Resource specifications
  string cpu = 8;
  string memory = 9;
}

RunnerInstallation

message RunnerInstallation {
  int64 installation_id = 1;
  string workspace_id = 2;
  string state = 3;
  string github_account_type = 4;
  bool is_service_enabled = 5;
}

RunnerOffering

message RunnerOffering {
  string id = 1;
  string name = 2;
  string cpu = 3;
  string memory = 4;
  string image_repository = 5;
  bool is_autoscale = 6;
}

Event Flow

1. GitHub App Installation

sequenceDiagram
    participant GH as GitHub
    participant GP as GitHub Proxy
    participant T as Temporal
    participant RS as Runner Service
    participant K8s as Kubernetes

    GH->>GP: installation.created
    GP->>T: Start RunnerInstallWorkflow
    T->>RS: Install signal
    RS->>RS: Sync repositories
    RS->>K8s: Create namespace
    RS->>K8s: Deploy runners
    RS->>GH: Installation complete

2. Workflow Job Execution

sequenceDiagram
    participant GH as GitHub
    participant GP as GitHub Proxy
    participant T as Temporal
    participant RS as Runner Service
    participant ARC as Actions Controller
    participant B as Billing

    GH->>GP: workflow_job (queued)
    GP->>T: Start GithubJobWorkflow
    T->>RS: Create job record
    T->>ARC: Forward job request
    GH->>GP: workflow_job (completed)
    T->>B: Create usage event
    T->>RS: Update job metrics

Key Features

Multi-tenancy

Workspace isolation: Each workspace has dedicated resources
Project organization: Runners are scoped to projects
Kubernetes namespaces: Physical isolation at infrastructure level

Custom Runners

Container registry support: GCP, AWS, or custom registries
Custom images: Build and manage custom runner images
Resource configurations: Flexible CPU/memory specifications

Auto-scaling

Horizontal Pod Autoscaler: Scale based on job queue
Dynamic provisioning: Add runners based on repository activity
Cost optimization: Scale down when idle

Observability

Metrics collection: Job duration, success rates, queue times
Workflow tracking: Complete visibility into CI/CD pipelines
Performance analytics: P50/P90 latencies, failure analysis

Security Considerations

Authentication

GitHub App: OAuth-based authentication
Webhook validation: Signature verification on all events
Token management: Secure storage in Kubernetes secrets

Authorization

Workspace boundaries: Strict tenant isolation
Repository access: Fine-grained permissions per runner
RBAC integration: Keto-based permission system

Network Security

Private networking: Runners in isolated VPCs
Egress controls: Restricted outbound access
TLS everywhere: Encrypted communication throughout

Operational Aspects

Monitoring

Temporal UI: Workflow state and history
Prometheus metrics: Resource usage and performance
Application logs: Structured logging with trace IDs

Failure Handling

Temporal retries: Automatic retry with exponential backoff
Circuit breakers: Prevent cascading failures
Manual recovery: Reset workflows for reconciliation

Maintenance

Rolling updates: Zero-downtime deployments
Database migrations: Version-controlled schema changes
Backup strategies: Regular snapshots of critical data

Future Enhancements

GPU Support: Enable ML/AI workloads
Spot Instance Integration: Cost optimization with preemptible VMs
Advanced Caching: Distributed cache for dependencies
Windows Runners: Support for Windows-based workflows
Enhanced Analytics: Deeper insights into CI/CD performance

Billing System Architecture

This document provides a comprehensive overview of Tenki Cloud’s billing system, which handles usage-based billing, payment processing, and financial accounting for GitHub Actions runners.

Overview

Tenki Cloud’s billing system is designed to provide accurate, reliable, and scalable billing for compute usage. It integrates multiple systems:

TigerBeetle: High-performance financial database for double-entry bookkeeping
Stripe: Payment processing and invoice generation
Temporal: Workflow orchestration for billing cycles and retry logic
PostgreSQL: Storage for billing metadata and history

System Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  GitHub Actions │────▶│ Runner Service   │────▶│ Usage Events    │
│     Jobs        │     │                  │     │   (Database)    │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                           │
                                                           ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│     Stripe      │◀────│ Billing Service  │◀────│    Temporal     │
│   (Payments)    │     │                  │     │   Workflows     │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                │
                                ▼
                        ┌─────────────────┐
                        │   TigerBeetle   │
                        │  (Accounting)   │
                        └─────────────────┘

Core Components

1. Data Models

Customer

message Customer {
  string id = 1;
  string identity_id = 2;
  string workspace_id = 3;
  uint64 tb_account_id = 4;        // TigerBeetle account
  string stripe_customer_id = 5;    // Stripe customer
  string default_payment_method = 6;
  bool has_payment_method = 7;
  string payment_method_status = 8;
}

Invoice

message Invoice {
  string id = 1;
  string customer_id = 2;
  string billing_period = 3;        // YYYY-MM format
  string status = 4;                // draft, issued, paid, void
  int64 amount = 5;                 // in cents
  bytes pdf_content = 6;
  string pdf_url = 7;
  string stripe_invoice_id = 8;
  int32 retry_count = 9;
}

UsageEvent

message UsageEvent {
  string id = 1;
  string workspace_id = 2;
  string runner_id = 3;
  google.protobuf.Timestamp started_at = 4;
  google.protobuf.Timestamp finished_at = 5;
  int64 seconds = 6;
  string external_id = 7;           // Idempotency key
  uint64 tb_transfer_id = 8;        // TigerBeetle transfer
}

2. TigerBeetle Accounting

The system uses double-entry bookkeeping with predefined accounts:

Fixed Accounts

1001 - TENKI_RECEIVABLE: Money owed to Tenki
1010 - STRIPE_RECEIVABLE: Money in Stripe
2001 - USER: Customer liability accounts
4001 - REVENUE: Income account
5010 - STRIPE_FEE: Payment processing fees
5020 - MARKETING_EXPENSE: Promotional credits

Transfer Types

1002 - T_StripePayment: Customer payments via Stripe
2001 - T_RunnerCharge: GitHub Actions runner usage charge
2002 - T_RunnerPromoCreditUsage: Promotional credit usage adjustment
2003 - T_UsageReversal: Reversal of negative usage charges
2010 - T_ComputeCharge: Charge for compute resources (future use)
3001 - T_AccountSignup: Initial signup bonus credit
3002 - T_MonthlyFreeCredit: Monthly free credit allowance
3003 - T_PromoCredit: General promotional credit
3004 - T_PromoCreditReversal: Reversal of promotional credits

Example Transactions

Usage Charge (Runner completes job):

Debit:  USER (Customer Account)     $5.00
Credit: REVENUE                     $5.00

Payment Received (Stripe payment):

Debit:  STRIPE_RECEIVABLE          $100.00
Credit: USER (Customer Account)    $100.00

Debit:  STRIPE_FEE                 $2.90
Credit: STRIPE_RECEIVABLE          $2.90

Promotional Credit:

Debit:  MARKETING_EXPENSE          $10.00
Credit: USER (Customer Account)    $10.00

Financial Flow Sequences

The following sequence diagram illustrates the complete financial flows in the Tenki Cloud billing system, showing how money moves between different accounts through various transfer codes:

sequenceDiagram
    participant USER as User/Customer
    participant SIGNUP as Signup Process
    participant GITHUB as GitHub Actions
    participant BILLING as Billing Service
    participant TB as TigerBeetle Ledger
    participant STRIPE as Stripe
    participant CYCLE as Billing Cycle
    participant AUDIT as Audit System

    Note over USER,AUDIT: Tenki Cloud Financial Flow System

    %% Phase 1: Account Creation & Initial Credits
    rect rgb(240, 248, 255)
        Note left of USER: Phase 1: Account Setup & Signup Credits
        USER->>SIGNUP: Create account
        SIGNUP->>BILLING: Create customer account
        BILLING->>TB: Create USER account (ACCOUNT_CODE_USER)

        SIGNUP->>BILLING: Add signup bonus
        BILLING->>TB: Transfer: T_AccountSignup<br/>MARKETING_EXPENSE → USER<br/>($10 signup credit)
    end

    %% Phase 2: Service Usage
    rect rgb(255, 253, 240)
        Note left of GITHUB: Phase 2: Service Usage & Charges
        GITHUB->>BILLING: Job execution event
        BILLING->>BILLING: Calculate usage cost
        BILLING->>TB: Transfer: T_RunnerCharge<br/>USER → REVENUE<br/>(Usage charges)
    end

    %% Phase 3: Payment Processing
    rect rgb(240, 255, 240)
        Note left of CYCLE: Phase 3: Billing Cycle & Payments

        Note over BILLING,TB: Step 1: Promotional credit adjustments
        CYCLE->>BILLING: Start billing cycle
        BILLING->>BILLING: Check promo credit usage for period
        BILLING->>TB: Transfer: T_RunnerPromoCreditUsage<br/>REVENUE → MARKETING_EXPENSE<br/>(Move promo usage from revenue)

        Note over BILLING,STRIPE: Step 2: Invoice generation
        BILLING->>STRIPE: Create Stripe invoice
        STRIPE->>USER: Send payment request
        alt Payment Success
            USER->>STRIPE: Make payment
            STRIPE->>BILLING: Payment webhook

            Note over BILLING,AUDIT: Payment Success Workflow
            BILLING->>TB: Transfer: T_StripePayment<br/>STRIPE_RECEIVABLE → USER<br/>(Payment received)

            BILLING->>TB: Transfer: T_MonthlyFreeCredit<br/>MARKETING_EXPENSE → USER<br/>($10/month free credit reset)

            BILLING->>BILLING: Create payment record in database
            BILLING->>AUDIT: Create billing audit record<br/>(compliance tracking)

        else Payment Failed
            STRIPE->>BILLING: Payment failed webhook
            BILLING->>BILLING: Schedule retry attempts
            BILLING->>BILLING: Start service interruption timer
        end
    end

Complete Transfer Code Reference

The system uses the following transfer codes for different types of financial transactions:

Payment & Withdrawal Operations (1000s)

1001 - T_BankWithdrawal: Cash withdrawal from bank account
1002 - T_StripePayment: Payment received from Stripe (invoice payment)

Service Charges (2000s)

2001 - T_RunnerCharge: Charge for GitHub Actions runner usage
2002 - T_RunnerPromoCreditUsage: Adjustment to move promotional credit usage from revenue to marketing expense
2003 - T_UsageReversal: Reversal of negative usage charges
2010 - T_ComputeCharge: Charge for compute resources (future use)

Credits & Bonuses (3000s)

3001 - T_AccountSignup: Initial signup bonus credit
3002 - T_MonthlyFreeCredit: Monthly free credit allowance (e.g., $10/month)
3003 - T_PromoCredit: General promotional credit (campaigns, support, etc.)
3004 - T_PromoCreditReversal: Reversal of promotional credits (corrections, violations, etc.)

Key Financial Flow Patterns

Customer Onboarding: New users receive signup credits (T_AccountSignup) and monthly free credits (T_MonthlyFreeCredit) from the marketing expense account.
Usage Billing: GitHub Actions runner usage generates charges (T_RunnerCharge) that move money from customer accounts to revenue.
Promotional Credit Accounting: When promotional credits are used for services, the system adjusts by moving the equivalent amount from revenue back to marketing expense (T_RunnerPromoCreditUsage).
Payment Processing: Customer payments through Stripe (T_StripePayment) add funds to customer accounts from the Stripe receivable account.
Administrative Corrections: The system supports reversals for both usage charges (T_UsageReversal) and promotional credits (T_PromoCreditReversal) for corrections and violations.

3. Billing Service

The billing service provides APIs for:

Customer Management: Creating and retrieving billing customers
Balance Operations: Checking workspace credits/debits
Invoice Management: Generating and managing monthly invoices
Usage Tracking: Recording compute usage events
Payment Methods: Managing cards and payment details
Stripe Integration: Setup intents and billing portal

Key service methods:

// Record runner usage
RecordUsage(ctx, workspaceID, runnerID, startTime, endTime)

// Process monthly billing
ProcessInvoiceAndCharge(ctx, workspaceID, billingPeriod)

// Add promotional credits
AddPromotionalCredits(ctx, workspaceID, amount, description)

Workflow Orchestration

1. Billing Cycle Workflow

Runs monthly for each workspace:

flowchart TD
    A[Start Monthly Billing] --> B[Generate Stripe Invoice]
    B --> C[Send Invoice Email]
    C --> D{Amount > 0?}
    D -->|Yes| E[Charge Payment Method]
    D -->|No| F[Complete]
    E --> G{Payment Success?}
    G -->|Yes| H[Payment Succeeded Workflow]
    G -->|No| I[Payment Failed Workflow]
    H --> F
    I --> F

2. Payment Processing Workflows

Payment Succeeded:

Record payment in TigerBeetle
Create payment record in database
Update invoice status

Payment Failed:

Send failure notification
Schedule retry attempts (max 5)
Start service interruption timer

3. Retry Logic

Failed payments are retried with exponential backoff:

Retry 1: 3 days later
Retry 2: 5 days later
Retry 3: 7 days later
Retry 4: 14 days later
Retry 5: 21 days later

If all retries fail by the 9th of the following month, services are suspended on the 10th.

4. Credit Management

Long-running workflow that handles credit operations via signals:

AddPromotionalCredits: Adds credits to workspace
DeductPromotionalCredits: Removes credits
Maintains audit trail in TigerBeetle

Usage Flow

1. Recording Usage

When a GitHub Actions job completes:

sequenceDiagram
    participant Job as GitHub Job
    participant Runner as Runner Service
    participant Billing as Billing Service
    participant TB as TigerBeetle

    Job->>Runner: Job completed
    Runner->>Billing: Record usage event
    Billing->>Billing: Calculate cost
    Billing->>TB: Create usage transfer
    TB->>TB: Debit user account
    TB->>TB: Credit revenue account
    Billing->>Runner: Usage recorded

2. Monthly Billing

At the start of each month:

sequenceDiagram
    participant Temporal
    participant Billing as Billing Service
    participant Stripe
    participant Customer

    Temporal->>Billing: Start billing cycle
    Billing->>Billing: Calculate usage for month
    Billing->>Stripe: Create invoice
    Stripe->>Customer: Send invoice email
    Billing->>Stripe: Charge payment method
    alt Payment successful
        Stripe->>Billing: Payment confirmed
        Billing->>Billing: Record in TigerBeetle
    else Payment failed
        Stripe->>Billing: Payment failed
        Billing->>Temporal: Schedule retry
    end

Key Features

Precision Accounting

All amounts stored as micro-cents (1/1,000,000 of a cent)
Prevents rounding errors in usage calculations
Supports high-frequency micro-transactions

Idempotency

External IDs prevent duplicate usage records
Workflow IDs ensure single execution
TigerBeetle provides transaction guarantees

Audit Trail

Every financial transaction recorded in TigerBeetle
Complete history of charges, payments, and credits
Immutable ledger for compliance

Self-Service

Stripe billing portal for payment method management
Invoice history and downloads
Usage reports by billing period

Graceful Degradation

Billing continues even if Stripe is unavailable
TigerBeetle ensures accounting accuracy
Workflows retry transient failures

Security Considerations

Payment Security

No credit card data stored in Tenki systems
All payment processing through PCI-compliant Stripe
Secure token-based payment method references

Access Control

Workspace-scoped billing operations
Admin-only credit management
Audit logs for all financial operations

Data Protection

Encrypted storage for sensitive data
TLS for all external communications
Regular backups of financial data

Operational Aspects

Monitoring

Temporal workflow status for billing cycles
TigerBeetle consistency checks
Stripe webhook processing metrics
Failed payment alerts

Troubleshooting

Workflow history in Temporal UI
TigerBeetle account balances
Stripe dashboard for payment issues
Database queries for usage history

Common Issues

Payment failures: Check Stripe logs and retry status
Missing usage: Verify runner job completion events
Balance discrepancies: Audit TigerBeetle transfers
Invoice generation: Check Temporal workflow status

Future Enhancements

Volume Discounts: Tiered pricing based on usage
Prepaid Packages: Bulk minute purchases
Cost Alerts: Notifications for spending thresholds
Multi-Currency: Support for international customers
Advanced Analytics: Detailed cost breakdowns by repository/workflow

Architecture Decision Records

This directory contains Architecture Decision Records (ADRs) - documents that capture important architectural decisions made during the development of Tenki Cloud.

What is an ADR?

An ADR is a document that captures an important architectural decision made along with its context and consequences. Each ADR describes a single decision and is immutable once accepted.

ADR Template

# ADR-XXX: Title

## Status

[Proposed | Accepted | Deprecated | Superseded by ADR-YYY]

## Context

What is the issue that we're seeing that is motivating this decision or change?

## Decision

What is the change that we're proposing and/or doing?

## Consequences

What becomes easier or more difficult to do because of this change?

### Positive

- List of positive consequences

### Negative

- List of negative consequences

## Alternatives Considered

What other options were evaluated and why were they rejected?

Current ADRs

ADR-001: Monorepo Structure - Using monorepo for all services
ADR-002: Temporal for Workflow Orchestration - Workflow engine choice
ADR-003: Connect RPC over REST - API protocol decision

Creating a New ADR

Copy the template above
Create a new file: XXX-short-description.md (increment XXX)
Fill out all sections
Submit PR for review
Once accepted, the ADR becomes immutable

When to Write an ADR

Write an ADR when:

Selecting key technologies (databases, frameworks, protocols)
Defining major architectural patterns
Making security decisions
Choosing between significant alternatives
Deprecating existing patterns

ADR Lifecycle

Proposed - Under discussion
Accepted - Decision made and being implemented
Deprecated - No longer recommended but still in use
Superseded - Replaced by another ADR

ADR-001: Monorepo Structure

Status

Accepted (2024-01-15)

Context

Tenki Cloud consists of multiple interconnected services:

Frontend applications (Next.js web app, future mobile apps)
Backend services (Go microservices)
Shared packages (TypeScript utilities, proto definitions)
Infrastructure code (Kubernetes manifests, Terraform)

We need a repository structure that:

Enables code sharing between services
Ensures coordinated deployments
Maintains clear boundaries between services
Provides good developer experience

Decision

We will use a monorepo structure with:

pnpm workspaces for TypeScript/JavaScript projects
Go modules with replace directives for Go services
Turborepo for orchestrated builds
Shared tooling across all services

Repository structure:

tenki.app/
├── apps/           # Deployable applications
├── backend/        # Go services
├── packages/       # Shared libraries
├── proto/          # Protocol buffer definitions
└── infra/          # Infrastructure code

Consequences

Positive

Atomic changes - Features spanning multiple services can be implemented in a single commit
Shared tooling - Linting, formatting, and testing tools configured once
Simplified dependencies - No need for private package registries
Consistent versioning - All services released together
Easier refactoring - Moving code between services is straightforward
Single source of truth - Proto definitions shared directly

Negative

Larger repository - Clone and fetch times increase over time
Complex CI/CD - Need to determine which services to build/deploy
Steeper learning curve - New developers must understand entire structure
Potential for coupling - Easier to create inappropriate dependencies
Tooling requirements - Requires pnpm, Go, and other tools installed

Alternatives Considered

1. Separate Repositories

Rejected because:

Coordination overhead for cross-service changes
Dependency version management complexity
Need for private package registry
Difficult to maintain API contracts

2. Git Submodules

Rejected because:

Poor developer experience
Complex update workflows
Easy to get into inconsistent states
Limited tool support

3. Lerna (instead of Turborepo)

Rejected because:

Turborepo has better performance
Native pnpm workspace support
Better caching mechanisms
Simpler configuration

Implementation Notes

Use pnpm filters for targeted operations:

pnpm -F app dev          # Run only app
pnpm -F "backend/*" test # Test all backend

Go services use local replace:

replace github.com/luxorlabs/proto => ../../proto

CI uses Turborepo caching:

{
  "pipeline": {
    "build": {
      "cache": true
    }
  }
}

ADR-002: Temporal Workflows

Status

Accepted

Context

We need a reliable workflow orchestration system for managing complex, long-running processes like GitHub runner lifecycle management, billing operations, and asynchronous tasks.

Decision

We will use Temporal for workflow orchestration because it provides:

Durable execution with automatic retries
Built-in error handling and compensation
Strong consistency guarantees
Visibility into workflow state and history
Language-specific SDKs with good Go support

Consequences

Positive

Reliable execution of critical business processes
Built-in observability and debugging capabilities
Simplified error handling for distributed operations
Ability to handle long-running workflows (hours/days)

Negative

Additional infrastructure to maintain
Learning curve for developers new to Temporal
Potential vendor lock-in for workflow logic

Implementation

Temporal workflows will be used for:

GitHub runner provisioning and lifecycle management
Billing and subscription management
Asynchronous job processing
Scheduled maintenance tasks

ADR-003: gRPC Gateway

Status

Accepted

Context

We need to expose our internal gRPC services to web clients that don’t support gRPC directly. We also want to maintain a single source of truth for our API definitions while supporting both gRPC and REST/JSON clients.

Decision

We will use grpc-gateway to automatically generate a RESTful HTTP API from our gRPC service definitions. This allows us to:

Maintain a single API definition in protobuf
Support both gRPC and REST clients
Auto-generate OpenAPI documentation
Preserve strong typing across the stack

Consequences

Positive

Single source of truth for API definitions
Automatic REST API generation from protobuf
Built-in OpenAPI/Swagger documentation
Consistent API behavior between gRPC and REST
Strong typing preserved through code generation

Negative

Additional build step for gateway generation
Some gRPC features don’t map perfectly to REST
Slightly increased complexity in the API layer
Need to carefully design protos for good REST mappings

Implementation

The grpc-gateway will:

Run as a reverse proxy in front of gRPC services
Translate HTTP/JSON requests to gRPC
Use protobuf annotations for REST endpoint configuration
Generate OpenAPI specs for documentation

Architecture Diagrams

This directory contains architectural diagrams for the Tenki Cloud platform.

Overview

Our architecture diagrams use Mermaid for easy maintenance and version control. Each diagram is stored as a .md file with embedded Mermaid syntax.

Available Diagrams

System Overview: High-level view of all components
Data Flow: How data moves through the system
Deployment Architecture: Infrastructure and deployment topology
Security Model: Authentication and authorization flows

Creating New Diagrams

Create a new .md file in this directory
Use Mermaid syntax for the diagram
Include a description of what the diagram represents
Update this README with a link to the new diagram

Viewing Diagrams

These diagrams are rendered automatically in:

GitHub markdown preview
Our documentation site (mdBook)
Most modern markdown editors

Mermaid Resources

Feature Specification: New Pricing + Free Credits Policy

Feature Branch: 001-new-pricing Created: 2025-10-22 Status: Draft Input: User description: “New Pricing + Free Credits Policy”

User Scenarios & Testing (mandatory)

User Story 1 - New User Onboarding with Free Credits (Priority: P1)

A new user signs up for Tenki and receives 1,000 free minutes (normalized to 2 vCPU runners) to explore the platform without requiring payment information upfront. They can access all features and offerings during this trial period.

Why this priority: This is the primary entry point for all new users and directly addresses the problem of acquiring users while deferring payment collection until value is demonstrated.

Independent Test: Can be fully tested by creating a new account, running jobs on various runner sizes, and verifying that free minutes are properly calculated and consumed based on vCPU scaling (e.g., 500 minutes on 4 vCPU, 250 minutes on 8 vCPU).

Acceptance Scenarios:

Given a new user signs up for Tenki, When their account is created, Then they receive 1,000 free minutes normalized to 2 vCPU runners
Given a user has free minutes remaining, When they use a 2 vCPU runner for 10 minutes, Then 10 minutes are deducted from their balance
Given a user has free minutes remaining, When they use a 4 vCPU runner for 10 minutes, Then 20 minutes are deducted from their balance (scaled by vCPU ratio)
Given a user has free minutes remaining, When they use an 8 vCPU runner for 10 minutes, Then 40 minutes are deducted from their balance (scaled by vCPU ratio)
Given a new user with free credits, When they access the platform, Then they can use all features and runner types without restrictions

User Story 2 - Payment Information Collection After Free Credits (Priority: P1)

When a user exhausts their 1,000 free minutes, the system prompts them to enter credit card information to continue using the platform on a pay-as-you-go basis.

Why this priority: This is the critical conversion point from free trial to paid customer and directly addresses the problem of verifying payment intent before allowing continued usage.

Independent Test: Can be tested by consuming all free minutes and verifying that the system blocks further usage until valid payment information is provided, then allows continued usage after payment details are entered.

Acceptance Scenarios:

Given a user has consumed all 1,000 free minutes, When they attempt to run a new job, Then they are prompted to enter credit card information before proceeding
Given a user is prompted for payment, When they enter valid credit card details, Then their account transitions to pay-as-you-go billing and jobs can proceed
Given a user is prompted for payment, When they close the prompt without entering payment details, Then their jobs remain blocked until payment is provided
Given a user has entered payment information, When they consume additional minutes, Then usage is tracked and billed according to PAYG pricing

User Story 3 - Pay-As-You-Go Usage and Billing (Priority: P2)

A paid user runs CI/CD jobs on various runner types and is charged per-minute based on the runner SKU pricing. They receive transparent billing for their actual usage with no upfront commitments.

Why this priority: This is the core revenue model for the platform and must work reliably for sustainable business operations.

Independent Test: Can be tested by running jobs on different runner SKUs, verifying per-minute charges match the pricing table, and confirming accurate invoice generation.

Acceptance Scenarios:

Given a paid user runs a job on a 2c-4GB x64 runner for 10 minutes, When billing is calculated, Then they are charged $0.03 (10 min × $0.003/min)
Given a paid user runs a job on a 4c-8GB x64 runner for 15 minutes, When billing is calculated, Then they are charged $0.09 (15 min × $0.006/min)
Given a paid user with 40 concurrent jobs included, When they run 40 or fewer concurrent jobs, Then no additional concurrency charges apply
Given a paid user, When they view their billing dashboard, Then they see itemized usage by runner type, duration, and total costs

User Story 4 - Add-On Purchase and Management (Priority: P2)

A user on the PAYG plan purchases optional add-ons such as macOS M4 runner access, priority support, or priority queue boost to enhance their experience.

Why this priority: Add-ons provide upsell opportunities and feature-based segmentation, addressing the problem of revenue expansion and feature monetization.

Independent Test: Can be tested by purchasing an add-on (e.g., macOS access for $39/month), verifying access is granted, and confirming the recurring charge appears on invoices.

Acceptance Scenarios:

Given a PAYG user, When they purchase macOS M4 runner access for $39/month, Then they can create and run jobs on macOS runners
Given a user without macOS access, When they attempt to use macOS runners, Then they are prompted to purchase the add-on
Given a user purchases priority support for $250/month, When they submit a support request, Then it is routed to the priority queue with private chat access
Given a user purchases priority queue boost for $49/month per workspace, When their jobs are queued, Then they receive higher priority in job scheduling
Given a user with add-ons, When they view their billing, Then add-on charges are itemized separately from usage charges

User Story 5 - Additional Concurrent Job Slot Purchase (Priority: P3)

A user exceeding the 40 included concurrent job slots purchases additional slots at $7/slot/month for x64 runners or $49/slot/month for macOS M4 runners.

Why this priority: This supports teams with high parallelism needs and provides incremental revenue, but is less critical than core pricing and add-ons.

Independent Test: Can be tested by running more than 40 concurrent jobs, purchasing additional slots, and verifying jobs execute in parallel up to the new limit.

Acceptance Scenarios:

Given a user with 40 included concurrent slots, When they attempt to run 50 concurrent jobs, Then 10 jobs are queued until slots become available
Given a user, When they purchase 10 additional x64 concurrent slots, Then they are charged $70/month and can run up to 50 concurrent x64 jobs
Given a user, When they purchase 5 additional macOS M4 concurrent slots, Then they are charged $245/month and can run up to 45 concurrent macOS jobs (assuming base 40 applies to all runner types)

User Story 6 - Storage Billing (Priority: P3)

A user stores build artifacts, caches, and other data on the platform and is billed $0.20 per GB per month for storage consumption.

Why this priority: Storage is a necessary cost component but secondary to compute billing in terms of implementation priority and revenue impact.

Independent Test: Can be tested by uploading data, tracking storage usage over time, and verifying charges match $0.20/GB/month prorated.

Acceptance Scenarios:

Given a user stores 50 GB of data, When monthly billing is calculated, Then they are charged $10 for storage
Given a user uploads 20 GB on day 15 of the month, When monthly billing is calculated, Then they are charged approximately $2 (prorated for half month)
Given a user has 10 GB of transparent cache included, When they use 10 GB or less total storage, Then no storage charges apply [NEEDS CLARIFICATION: Is the 10GB transparent cache counted toward the $0.20/GB storage billing, or is it separate?]

User Story 7 - Enterprise Custom Pricing Inquiry (Priority: P3)

An enterprise customer with predictable, high-volume usage requests committed use discounts and white-glove onboarding through a sales inquiry process.

Why this priority: Enterprise deals provide revenue predictability and larger contracts, but represent a smaller percentage of total users and require sales team involvement.

Independent Test: Can be tested by submitting an enterprise inquiry form, receiving a response from sales, and negotiating custom pricing terms outside the automated PAYG system.

Acceptance Scenarios:

Given a user interested in enterprise pricing, When they submit an inquiry, Then they are contacted by the sales team within [NEEDS CLARIFICATION: response SLA not specified - 24 hours? 48 hours?]
Given an enterprise customer commits to a minimum usage level, When their contract is established, Then they receive discounted per-minute rates compared to PAYG
Given an enterprise customer, When they onboard, Then they receive dedicated success engineer support for migration and setup

User Story 8 - Premium Runner Pricing (Priority: P3)

A user opts to use premium runners (indicated by “Premium Pricing” in the SKU table) and is charged an additional fee on top of the base runner cost.

Why this priority: Premium runners provide differentiated service levels but are an optional enhancement to the base offering.

Independent Test: Can be tested by selecting a premium runner option, running a job, and verifying the charge includes the base price plus the premium surcharge.

Acceptance Scenarios:

Given a user runs a job on a premium 2c-4GB runner for 10 minutes, When billing is calculated, Then they are charged $0.045 (10 min × ($0.003 + $0.0015)/min)
Given a user runs a job on a premium 4c-8GB runner for 10 minutes, When billing is calculated, Then they are charged $0.090 (10 min × ($0.006 + $0.003)/min)
Given a user, When they select a runner, Then they can choose between standard and premium options with clear pricing displayed [NEEDS CLARIFICATION: What specific benefits do premium runners provide - faster provisioning, dedicated resources, better SLA?]

Edge Cases

What happens when a user’s payment method fails after exhausting free credits? Are jobs blocked immediately or is there a grace period?
How are partial minutes billed (e.g., a job that runs for 3.5 minutes)?
What happens if a user deletes stored data mid-month? Is storage billing prorated daily?
How are concurrent job limits enforced when a user has both x64 and macOS runners? Are the limits separate or combined?
What happens when a user downgrades or cancels add-ons mid-billing cycle? Do they receive prorated refunds or credits?
How are discounts (up to 50% offered by sales) applied to the billing system? Are they percentage discounts or fixed credits?
What happens when a user exhausts free credits in the middle of a running job? Is the job terminated or allowed to complete?
How is abuse detection handled for users who repeatedly create new accounts to exploit free credits?

Requirements (mandatory)

Functional Requirements

Free Credits System

FR-001: System MUST allocate 1,000 free minutes (normalized to 2 vCPU runners) to all new user accounts upon creation
FR-002: System MUST scale free minute consumption based on runner vCPU count (e.g., 4 vCPU uses 2× minutes, 8 vCPU uses 4× minutes)
FR-003: System MUST track free minute balance in real-time and display remaining balance to users
FR-004: System MUST allow users with free minutes to access all runner types and platform features without restrictions
FR-005: System MUST prevent job execution when free minutes are exhausted and payment information has not been provided

Payment Collection

FR-006: System MUST prompt users to enter credit card information when free minutes are exhausted
FR-007: System MUST validate and securely store payment information using industry-standard tokenization
FR-008: System MUST transition user accounts from free trial to PAYG billing status after payment information is collected
FR-009: System MUST block job execution for users who decline to provide payment information after exhausting free credits

Pay-As-You-Go Billing

FR-010: System MUST calculate per-minute charges for all runner types according to the defined pricing table (x64, macOS, premium)
FR-011: System MUST track actual usage time for each job execution down to the minute
FR-012: System MUST generate itemized invoices showing usage by runner type, duration, and cost
FR-013: System MUST charge payment methods on a monthly billing cycle for accumulated usage
FR-014: System MUST include 40 concurrent job slots in all PAYG accounts at no additional charge
FR-015: System MUST include 10 GB of transparent caching in all PAYG accounts at no additional charge

Add-On Management

FR-016: System MUST allow users to purchase macOS M4 runner access for $39/month per workspace
FR-017: System MUST allow users to purchase priority support for $250/month
FR-018: System MUST allow users to purchase priority queue boost for $49/month per workspace
FR-019: System MUST restrict access to add-on features until the corresponding add-on is purchased
FR-020: System MUST bill add-on charges as recurring monthly fees separate from usage charges
FR-021: System MUST allow users to enable, disable, or modify add-ons at any time
FR-022: System MUST grant macOS runner access only to users with the macOS add-on active
FR-023: System MUST route support requests to priority queue for users with priority support add-on
FR-024: System MUST prioritize job scheduling for workspaces with priority queue boost add-on

Concurrent Job Slot Management

FR-025: System MUST allow users to purchase additional concurrent job slots at $7/slot/month for x64 runners
FR-026: System MUST allow users to purchase additional concurrent job slots at $49/slot/month for macOS M4 runners
FR-027: System MUST enforce concurrent job limits based on base allocation plus purchased slots
FR-028: System MUST queue jobs that exceed concurrent slot limits until slots become available

Storage Billing

FR-029: System MUST track total storage consumption for each user account across artifacts, caches, and data
FR-030: System MUST bill storage at $0.20 per GB per month
FR-031: System MUST calculate storage billing based on average daily usage over the billing period
FR-032: System MUST display current storage usage and projected monthly costs to users

Runner Pricing

FR-033: System MUST support all x64 runner SKUs with specified pricing (2c-4GB through 64c-256GB)
FR-034: System MUST support macOS runner SKUs with specified pricing (6 vCPU, 12 vCPU)
FR-035: System MUST support premium pricing tier for eligible x64 runner SKUs with additional charges
FR-036: System MUST clearly display runner pricing to users when selecting runner types

Enterprise Tier

FR-037: System MUST provide a mechanism for users to request enterprise pricing and custom contracts
FR-038: System MUST support custom pricing configurations for enterprise accounts with committed use discounts
FR-039: System MUST allow sales team to configure account-specific discounts up to 50%
FR-040: System MUST support white-glove onboarding workflows for enterprise customers

Annual Prepayment Options (Internal)

FR-041: System MUST support 12-month prepayment for priority queue boost at $499 (15% discount)
FR-042: System MUST support 12-month prepayment for macOS M4 access at $399 (15% discount)
FR-043: System MUST apply prepaid add-ons to user accounts for 12-month duration

Abuse Prevention

FR-044: System MUST implement mechanisms to detect and prevent abuse patterns (repeated free credit exploitation, cryptocurrency mining, unauthorized Minecraft servers)
FR-045: System MUST require payment information as a verification gate to prevent abusive users from continuing operations

Key Entities

User Account: Represents an individual or organization using Tenki, with free credit balance, payment status, billing tier, and usage history
Free Credit Balance: The remaining free minutes available to a user, normalized to 2 vCPU baseline, consumed based on runner vCPU scaling
Payment Method: Tokenized credit card information associated with a user account for billing purposes
Add-On Subscription: A purchased add-on feature (macOS access, priority support, priority queue boost) with recurring billing
Concurrent Job Slot: Allocated capacity for running parallel jobs, includes base allocation plus purchased additional slots
Runner SKU: A specific runner configuration (vCPU, memory) with associated per-minute pricing
Usage Record: A log of job execution including runner type, duration, and calculated cost
Invoice: A monthly billing statement showing itemized usage charges, add-on fees, and total amount due
Enterprise Contract: A custom pricing agreement with committed use discounts and negotiated terms
Storage Allocation: The amount of data stored by a user, tracked for billing at $0.20/GB/month
Workspace: An organizational unit within a user account, relevant for workspace-specific add-ons (priority queue boost, macOS access)

Success Criteria (mandatory)

Measurable Outcomes

SC-001: 90% of new users successfully start their first job using free credits within 24 hours of signup
SC-002: Free credit system accurately scales minute consumption across all runner SKU types with 100% precision
SC-003: Payment conversion rate from free to paid users reaches at least 15% within 30 days of signup
SC-004: Billing calculations are accurate to the cent with zero disputes related to calculation errors in the first 90 days
SC-005: Users can view real-time usage and cost projections with data latency under 5 minutes
SC-006: Add-on purchases are reflected in user accounts and billing within 60 seconds of confirmation
SC-007: Abuse detection mechanisms block at least 95% of identified abusive patterns (mining, unauthorized servers) within 24 hours of detection
SC-008: Enterprise inquiry response time averages under 24 hours during business days
SC-009: Concurrent job limits are enforced in real-time with zero jobs exceeding purchased slot allocation
SC-010: Monthly revenue predictability improves by at least 30% through enterprise contracts and add-on subscriptions within 6 months of launch
SC-011: Customer support tickets related to billing and pricing decrease by 40% compared to the previous pricing model within 3 months
SC-012: Average revenue per user (ARPU) increases by at least 20% through add-on adoption within 6 months

Assumptions

Users understand vCPU-based scaling of free credits and can calculate their effective free minutes for different runner sizes
Industry-standard payment processing (Stripe or similar) is available and integrated for secure credit card handling
Enterprise sales team has capacity and process to handle custom pricing negotiations and white-glove onboarding
Abuse detection can leverage usage patterns, payment verification, and potentially behavioral analysis to identify bad actors
Storage billing is calculated daily and averaged over the monthly billing period for prorated charges
Partial minutes are rounded up to the next whole minute for billing purposes (industry standard for compute billing)
Payment method failures trigger automated retry logic and user notifications before blocking service
Annual prepayment options are available to sales team but not publicly advertised on the pricing page
The 10 GB transparent cache is included in base PAYG pricing and does not count toward the $0.20/GB storage billing
Concurrent job limits are enforced separately for x64 and macOS runners (not combined)
Add-ons can be canceled at any time but billing continues through the end of the current billing cycle (no prorated refunds)
Discounts applied by sales team are percentage-based and apply to all usage charges, not just specific SKUs
Jobs in progress when free credits are exhausted are allowed to complete before payment is required
Free credit abuse is mitigated by requiring unique email verification and detecting suspicious signup patterns.

Getting Started with Tenki Cloud Development

Last updated: 2025-06-12

This guide will help you set up your development environment and run Tenki Cloud locally.

Prerequisites

Required Software

Nix
Devenv
Direnv
1Password
1Password CLI
- Integrate with 1Password CLI
- Verify you have access to luxor/Engineering by running:
```
op account list
op vault list --account=luxor
```
- If luxor isn’t showing as luxor.1password.com, try other accounts from op account list:
```
op vault list --account=<account_name>
```
- If luxor account isn’t showing at all, contact administrator

Verify Prerequisites

nix --version
devenv version
direnv --version
op --version

Hardware Requirements

RAM: 16GB minimum (32GB recommended)
CPU: 4 cores minimum (8 cores recommended)
Disk: 50GB free space

Initial Setup

1. Clone the Repository

git clone https://github.com/luxorlabs/tenki.app.git
cd tenki.app

2. Pull Setup Keys

sh tools/scripts/setup.sh

If you run into an issue where it’s using the wrong account, try this:

op account list
# account can be `luxor` if url is `luxor.1password.com`,
# or `my` if url is `my.1password.com` and there's only one account
# or the 3rd column (USER ID) if you have multiple accounts if you're getting the same url for all accounts.
sh tools/scripts/setup.sh <account>

Example: sh tools/scripts/setup.sh luxor

If you run into permission issues, try chmod +x ./tools/scripts/*.sh

3. Enable Development Environment

direnv allow

This will:

Install all required tools (Go, Node.js, pnpm, etc.)
Set up environment variables
Configure Git hooks

4. Install Dependencies

# Install all npm dependencies
pnpm install

# Generate protobuf code
bufgen

# Run Go mod tidy
tidy

# Update /etc/hosts entries
sync-hosts

# Initialize database
tb-format

5. Hosts and Other Setup

NOTE: Before running this script, the host needs to have hostctl installed since it requires elevated execution. Verify with hostctl --version

Add tenki.lab hosts:

sync-hosts

Format TigerBeetle database:

tb-format

Managing Environment and Secrets

Secret Files

resources/secrets/*.sops.yaml - Encrypted secrets pushed/committed to git
resources/secrets/*.local.yaml - Decrypted secrets (gitignored)

Commands

env-sync - Decrypt secrets and create a copy on individual apps/backend folders
env-decrypt - Decrypt secrets, *.sops.yaml to *.local.yaml
env-encrypt - Encrypt secrets, *.local.yaml to *.sops.yaml

Pulling Latest Secrets

git pull to get the latest secrets
env-sync to create your own copy of the secrets

Updating Secrets

Locate the secret you want to update in resources/secrets/*.local.yaml
Run env-encrypt to encrypt the secret
Commit the changes and push to GitHub

Overwriting Secrets (usually only needed once per setup)

For Next.js apps: .env.local will overwrite .env. Copy .env.sample to .env.local and update values
For backend: engine.local.yaml will overwrite engine.yaml. Copy engine.sample.yaml to engine.local.yaml and update values

Seeding and Migrations

NOTE: Before running these commands, the database must be up and running. Run dev up or devenv up

Run db up to migrate the database
Run psql -U postgres -d tenki -f ./tools/seed/20240915152331_seed.sql to seed the database with CSP & related data

Or run db deploy to run both.

Run db seed to seed the database with users, workspaces, projects, VMs. After this you can start the dev server & login with:
- Email: odin@tenki.cloud
- Password: tenki.app

Database Commands

db up          # Run migrations
db down        # Rollback last migration
db reset       # Reset database
db status      # Check migration status
db create add_users_table  # Create new migration
db deploy      # Run migrations and seed
db nuke        # Complete database reset
db seed        # Seed test data

For Redpanda, see internal docs to set it up.

NOTE: This should be automated in the future

Running the Application

Quick Start

# Start all services
dev up

# Access the application
open https://app.tenki.lab:4001

Development Domains

Frontend: https://app.tenki.lab:4001
Temporal UI: https://temporal.tenki.lab
Redpanda Console: https://redpanda.tenki.lab
Grafana: https://grafana.tenki.lab
API: https://api.tenki.lab

Individual Services

# Start specific service
dev up postgres
dev up temporal
dev up engine

# Other options
dev up --simple  # Minimal output
dev up -D        # Detached mode

# Service management
dev start [service]     # Start specific service
dev stop [service]      # Stop specific service
dev restart [service]   # Restart service
dev logs [service]      # View service logs
dev list               # List all services

# Examples
dev start  # (enter, then choose services, hit tab to select multiple)
dev start engine
dev logs -f postgres  # Follow logs

Development Workflow

Frontend Development

cd apps/app
pnpm dev

# Run type checking
pnpm type-check

# Run linting
pnpm lint
pnpm lint:fix

Backend Development

# Run Go services
cd backend
go run cmd/engine/main.go

# Run tests
gotest

# Generate mocks
gomocks

# Build binaries
make build-engine

Database Changes

# Create new migration
db create add_user_preferences

# Apply migrations
db up

# Rollback migration
db down

# Reset database
db reset

Resetting Existing/Flaky Local Environment

Close/stop all services
Run reset-local
In another terminal:
- Run db deploy

Common Tasks

Adding shadcn/ui Components

# Add a new component
pnpm -F @shared/ui ui:add
# or
pnpm -F @shared/ui ui:add <component>

Then add the component to the exports in packages/ui/package.json:

"exports": {
  "./button": "./src/components/ui/button.tsx",
  "./alert-dialog": "./src/components/ui/alert-dialog.tsx"
}

Generating App Icons

pnpm -F app generate-icon

Then:

Copy public/images/favicon-196.png to:
- src/app/favicon.png
- src/app/icon.png
Copy all rel="apple-touch-startup-image" from src/asset-generator.html to src/app/layout.tsx

Adding a New API Endpoint

Define proto in proto/tenki/cloud/
Run bufgen to generate code
Implement service in backend/internal/domain/
Add tRPC router in apps/app/src/server/api/

Running Tests

# All tests
pnpm test
gotest

# Specific package
pnpm -F app test
cd backend && go test ./pkg/...

# Integration tests
gotest-integration

# With coverage
cd backend && go test -v -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

Debugging

# Check service health
dev list

# View all logs
dev logs

# Restart a service
dev restart engine

# Database console (direct connection)
psql -h localhost -U postgres -d tenki

# Temporal CLI
temporal workflow list

Troubleshooting

Port Already in Use

# Find process using port
lsof -i :4001

# Kill process
kill -9 <PID>

Database Connection Issues

# Restart postgres
dev restart postgres

# Check logs
dev logs postgres

# Reset database
db nuke

Proto Generation Fails

# Clean and regenerate
rm -rf backend/pkg/proto
bufgen

Node Modules Issues

# Clean and reinstall
rm -rf node_modules apps/*/node_modules packages/*/node_modules
pnpm install

Next Steps

Read the Architecture Overview
Set up your IDE/Editor
Join the development Slack channel
Pick a starter issue from GitHub

Editor Setup

VS Code

Install recommended extensions:
- Go
- ESLint
- Prettier
- Proto3

Use workspace settings:

{
  "editor.formatOnSave": true,
  "go.lintTool": "golangci-lint"
}

GoLand/WebStorm

Enable Go modules
Set up file watchers for:
- gofmt
- prettier
- eslint

Runner Prerequisites

Setting up GitHub App

Create Github Organization, skip this if you already have one
Create a github app
- run pnpm -F github-app run create or pnpm -F github-app run create -o <org>
- if name already exist, change it and continue
- once done it’ll redirect you to a success screen, close the tab,
In github-app-response.json take note of the slug, pem, webhook_secret, client_id and client_secret

Backend Development Guide

Last updated: 2025-06-12

Overview

The Tenki backend is built with Go and follows Domain-Driven Design principles. Services communicate via Connect RPC (gRPC-Web compatible) and use Temporal for workflow orchestration.

Project Structure

backend/
├── cmd/                  # Application entry points
│   ├── engine/           # Main backend service
│   └── tenki-cli/        # CLI tool
├── internal/             # Private application code
│   ├── app/              # Application layer
│   └── domain/           # Business domains
│       ├── billing/      # Billing domain
│       ├── compute/      # VM management
│       ├── identity/     # Auth & users
│       ├── runner/       # GitHub runners
│       └── workspace/    # Multi-tenancy
├── pkg/                  # Public packages
│   ├── proto/            # Generated protobuf
├── queries/              # SQL queries (sqlc)
└── schema/               # Database migrations

Development Workflow

Running the Backend

# Start dependencies
dev up postgres temporal kafka

# Run migrations
db deploy

# Start engine
cd backend
go run cmd/engine/main.go

# Or use the dev script
dev restart engine

Adding a New Feature

Define the API

// proto/tenki/cloud/workspace/v1/project.proto
service ProjectService {
  rpc CreateProject(CreateProjectRequest) returns (CreateProjectResponse);
}

Generate code
```
bufgen
```

Implement domain logic

// internal/domain/workspace/service/project.go
func (s *Service) CreateProject(ctx context.Context, req *params.CreateProject) (*models.Project, error) {
    // Business logic here
}

Write SQL queries

-- queries/workspace/project.sql
-- name: CreateProject :one
INSERT INTO projects (name, workspace_id)
VALUES ($1, $2)
RETURNING *;

Generate SQL code
```
cd backend && sqlc generate
```

Testing

Unit Tests

func TestService_CreateProject(t *testing.T) {
    tests := []struct {
        name    string
        input   *params.CreateProject
        want    *models.Project
        wantErr bool
    }{
        {
            name: "valid project",
            input: &params.CreateProject{
                Name:        "test-project",
                WorkspaceID: "ws-123",
            },
            want: &models.Project{
                Name: "test-project",
            },
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Test implementation
        })
    }
}

Integration Tests

//go:build integration

var _ = Describe("Project Service", func() {
    var (
        service *workspace.Service
        db      *sql.DB
    )

    BeforeEach(func() {
        db = setupTestDB()
        service = workspace.NewService(workspace.WithDB(db))
    })

    It("should create a project", func() {
        project, err := service.CreateProject(ctx, params)
        Expect(err).NotTo(HaveOccurred())
        Expect(project.Name).To(Equal("test"))
    })
})

Running Tests

# Unit tests only
gotest

# Integration tests
gotest-integration

# Specific package
cd backend && go test ./internal/domain/workspace/...

# With coverage
cd backend && go test -cover ./...

Database Operations

Migrations

# Create migration
echo "CREATE TABLE features (id uuid PRIMARY KEY);" > backend/schema/$(date +%Y%m%d%H%M%S)_add_features.sql

# Apply migrations
db up

# Rollback
db down

Query Development

Write query in backend/queries/
Run sqlc generate
Use generated code in service

// Generated code usage
project, err := s.db.CreateProject(ctx, db.CreateProjectParams{
    Name:        req.Name,
    WorkspaceID: req.WorkspaceID,
})

Temporal Workflows

Workflow Definition

func RunnerProvisioningWorkflow(ctx workflow.Context, params RunnerParams) error {
    // Configure workflow
    ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: 10 * time.Minute,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    })

    // Execute activities
    var runner *models.Runner
    err := workflow.ExecuteActivity(ctx, CreateRunnerActivity, params).Get(ctx, &runner)
    if err != nil {
        return fmt.Errorf("create runner: %w", err)
    }

    return nil
}

Testing Workflows

func TestRunnerProvisioningWorkflow(t *testing.T) {
    suite := testsuite.WorkflowTestSuite{}
    env := suite.NewTestWorkflowEnvironment()

    // Mock activities
    env.OnActivity(CreateRunnerActivity, mock.Anything).Return(&models.Runner{ID: "123"}, nil)

    // Execute workflow
    env.ExecuteWorkflow(RunnerProvisioningWorkflow, params)

    require.True(t, env.IsWorkflowCompleted())
    require.NoError(t, env.GetWorkflowError())
}

API Patterns

Service Options

// Use functional options pattern
type Service struct {
    db       *db.Queries
    temporal client.Client
    logger   *slog.Logger
}

type Option func(*Service)

func WithDB(db *db.Queries) Option {
    return func(s *Service) {
        s.db = db
    }
}

func NewService(opts ...Option) *Service {
    s := &Service{
        logger: slog.Default(),
    }
    for _, opt := range opts {
        opt(s)
    }
    return s
}

Error Handling

// Define domain errors
var (
    ErrProjectNotFound = errors.New("project not found")
    ErrUnauthorized    = errors.New("unauthorized")
)

// Wrap errors with context
if err != nil {
    return fmt.Errorf("fetch project %s: %w", projectID, err)
}

// Check errors
if errors.Is(err, ErrProjectNotFound) {
    return connect.NewError(connect.CodeNotFound, err)
}

Debugging

Local Debugging

# Enable debug logging
export LOG_LEVEL=debug

# Run with delve
dlv debug cmd/engine/main.go

# Attach to running process
dlv attach $(pgrep engine)

Temporal UI

# View workflows
open https://temporal.tenki.lab

# List workflows via CLI
temporal workflow list --query 'WorkflowType="RunnerProvisioningWorkflow"'

# Describe workflow
temporal workflow describe -w <workflow-id>

Database Queries

# Connect to database
dev exec postgres psql -U postgres tenki

# Useful queries
SELECT * FROM runners WHERE created_at > NOW() - INTERVAL '1 hour';
SELECT COUNT(*) FROM workflow_runs GROUP BY status;

Performance Tips

Use prepared statements - sqlc does this automatically
Batch operations - Use CopyFrom for bulk inserts
Connection pooling - Configure in engine.yaml
Context cancellation - Always respect context.Done()
Concurrent operations - Use errgroup for parallel work

Common Patterns

Repository Pattern

type RunnerRepository interface {
    Create(ctx context.Context, runner *Runner) error
    GetByID(ctx context.Context, id string) (*Runner, error)
    List(ctx context.Context, filter Filter) ([]*Runner, error)
}

Builder Pattern

query := NewQueryBuilder().
    Where("status", "active").
    OrderBy("created_at", "DESC").
    Limit(10).
    Build()

Middleware Pattern

func LoggingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        next.ServeHTTP(w, r)
        slog.Info("request", "method", r.Method, "path", r.URL.Path, "duration", time.Since(start))
    })
}

Resources

Frontend Development Guide

Last updated: 2025-06-12

Overview

The Tenki frontend is built with Next.js 15, React 19, and TypeScript. We use tRPC for type-safe API communication, Tailwind CSS for styling, and Radix UI for accessible components.

Tech Stack

Framework: Next.js 15 (App Router)
Language: TypeScript
Styling: Tailwind CSS + Radix UI
State: React Context + Zustand
API: tRPC
Forms: React Hook Form + Zod
Testing: Jest + React Testing Library

Project Structure

apps/app/
├── src/
│   ├── app/                # Next.js app router pages
│   │   ├── (dashboard)/   # Protected routes
│   │   ├── auth/          # Auth pages
│   │   └── api/           # API routes
│   ├── components/        # Reusable components
│   ├── hooks/            # Custom hooks
│   ├── server/           # Server-side code
│   │   └── api/         # tRPC routers
│   ├── trpc/            # tRPC client setup
│   └── utils/           # Utilities
├── public/              # Static assets
└── next.config.mjs      # Next.js config

Development Workflow

Running the Frontend

# Start all services (recommended)
pnpm dev

# Or just the frontend
pnpm -F app dev

# Access at
open https://app.tenki.lab:4001

Creating Components

// components/project-card.tsx
interface ProjectCardProps {
  project: Project;
  onSelect?: (project: Project) => void;
}

export function ProjectCard({ project, onSelect }: ProjectCardProps) {
  return (
    <Card onClick={() => onSelect?.(project)} className="cursor-pointer transition-shadow hover:shadow-lg">
      <CardHeader>
        <CardTitle>{project.name}</CardTitle>
      </CardHeader>
      <CardContent>
        <p className="text-muted-foreground">{project.description}</p>
      </CardContent>
    </Card>
  );
}

Using tRPC

// In a client component
"use client";

import { trpc } from "@/trpc/client";

export function ProjectList() {
  const { data: projects, isLoading } = trpc.project.list.useQuery();

  const createProject = trpc.project.create.useMutation({
    onSuccess: () => {
      // Invalidate and refetch
      utils.project.list.invalidate();
    },
  });

  if (isLoading) return <Skeleton />;

  return (
    <div>
      {projects?.map((project) => (
        <ProjectCard key={project.id} project={project} />
      ))}
    </div>
  );
}

Creating tRPC Routes

// server/api/routers/project.ts
export const projectRouter = createTRPCRouter({
  list: protectedProcedure.query(async ({ ctx }) => {
    return ctx.db.project.findMany({
      where: { workspaceId: ctx.session.workspaceId },
    });
  }),

  create: protectedProcedure
    .input(
      z.object({
        name: z.string().min(1),
        description: z.string().optional(),
      }),
    )
    .mutation(async ({ ctx, input }) => {
      return ctx.db.project.create({
        data: {
          ...input,
          workspaceId: ctx.session.workspaceId,
        },
      });
    }),
});

Styling Guidelines

Using Tailwind

// Use semantic color classes
<div className="bg-background text-foreground">
  <button className="bg-primary text-primary-foreground hover:bg-primary/90">
    Click me
  </button>
</div>

// Responsive design
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4">
  {/* Grid items */}
</div>

// Dark mode support (automatic)
<div className="bg-white dark:bg-gray-900">
  Content adapts to theme
</div>

Component Composition

// Use Radix UI primitives
import * as Dialog from "@radix-ui/react-dialog";

export function CreateProjectDialog() {
  return (
    <Dialog.Root>
      <Dialog.Trigger asChild>
        <Button>Create Project</Button>
      </Dialog.Trigger>
      <Dialog.Portal>
        <Dialog.Overlay className="fixed inset-0 bg-black/50" />
        <Dialog.Content className="bg-background fixed top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2 rounded-lg p-6">
          <Dialog.Title>Create Project</Dialog.Title>
          {/* Form content */}
        </Dialog.Content>
      </Dialog.Portal>
    </Dialog.Root>
  );
}

State Management

Local State

// For simple component state
const [isOpen, setIsOpen] = useState(false);

Context for Feature State

// contexts/project-context.tsx
const ProjectContext = createContext<ProjectContextType | null>(null);

export function ProjectProvider({ children }: { children: ReactNode }) {
  const [selectedProject, setSelectedProject] = useState<Project | null>(null);

  return <ProjectContext.Provider value={{ selectedProject, setSelectedProject }}>{children}</ProjectContext.Provider>;
}

export function useProject() {
  const context = useContext(ProjectContext);
  if (!context) throw new Error("useProject must be used within ProjectProvider");
  return context;
}

Global State with Zustand

// stores/user-preferences.ts
import { create } from "zustand";

interface PreferencesStore {
  theme: "light" | "dark" | "system";
  setTheme: (theme: PreferencesStore["theme"]) => void;
}

export const usePreferences = create<PreferencesStore>((set) => ({
  theme: "system",
  setTheme: (theme) => set({ theme }),
}));

Forms

With React Hook Form + Zod

const ProjectSchema = z.object({
  name: z.string().min(1, "Name is required"),
  description: z.string().optional(),
  isPublic: z.boolean().default(false),
});

type ProjectForm = z.infer<typeof ProjectSchema>;

export function CreateProjectForm() {
  const form = useForm<ProjectForm>({
    resolver: zodResolver(ProjectSchema),
    defaultValues: {
      name: "",
      isPublic: false,
    },
  });

  const onSubmit = async (data: ProjectForm) => {
    await createProject.mutateAsync(data);
  };

  return (
    <Form {...form}>
      <form onSubmit={form.handleSubmit(onSubmit)}>
        <FormField
          control={form.control}
          name="name"
          render={({ field }) => (
            <FormItem>
              <FormLabel>Project Name</FormLabel>
              <FormControl>
                <Input {...field} />
              </FormControl>
              <FormMessage />
            </FormItem>
          )}
        />
        <Button type="submit">Create</Button>
      </form>
    </Form>
  );
}

Testing

Component Tests

// __tests__/project-card.test.tsx
import { ProjectCard } from "@/components/project-card";
import { render, screen } from "@testing-library/react";
import userEvent from "@testing-library/user-event";

describe("ProjectCard", () => {
  it("displays project information", () => {
    const project = { id: "1", name: "Test Project", description: "Test" };
    render(<ProjectCard project={project} />);

    expect(screen.getByText("Test Project")).toBeInTheDocument();
    expect(screen.getByText("Test")).toBeInTheDocument();
  });

  it("calls onSelect when clicked", async () => {
    const onSelect = jest.fn();
    const project = { id: "1", name: "Test Project" };

    render(<ProjectCard project={project} onSelect={onSelect} />);
    await userEvent.click(screen.getByRole("article"));

    expect(onSelect).toHaveBeenCalledWith(project);
  });
});

Running Tests

# Run all tests
pnpm test

# Watch mode
pnpm test:watch

# With coverage
pnpm test:coverage

Performance

Image Optimization

import Image from "next/image";

<Image
  src="/logo.png"
  alt="Logo"
  width={200}
  height={50}
  priority // For above-the-fold images
/>;

Code Splitting

// Dynamic imports for heavy components
const HeavyChart = dynamic(() => import("@/components/heavy-chart"), {
  loading: () => <Skeleton className="h-96" />,
  ssr: false, // Disable SSR for client-only components
});

Data Fetching

// Server component (default in app router)
async function ProjectPage({ params }: { params: { id: string } }) {
  const project = await api.project.get({ id: params.id });
  return <ProjectDetails project={project} />;
}

// Parallel data fetching
async function DashboardPage() {
  const [projects, stats] = await Promise.all([api.project.list(), api.stats.get()]);

  return (
    <>
      <StatsCard stats={stats} />
      <ProjectList projects={projects} />
    </>
  );
}

Common Patterns

Error Boundaries

export function ProjectErrorBoundary({ children }: { children: ReactNode }) {
  return (
    <ErrorBoundary
      fallback={
        <Alert variant="destructive">
          <AlertTitle>Something went wrong</AlertTitle>
          <AlertDescription>Unable to load projects. Please try again.</AlertDescription>
        </Alert>
      }
    >
      {children}
    </ErrorBoundary>
  );
}

Loading States

export function ProjectListSkeleton() {
  return (
    <div className="space-y-4">
      {Array.from({ length: 3 }).map((_, i) => (
        <Skeleton key={i} className="h-24" />
      ))}
    </div>
  );
}

Accessibility

// Always include ARIA labels
<button
  aria-label="Delete project"
  onClick={handleDelete}
>
  <TrashIcon />
</button>

// Keyboard navigation
<div
  role="button"
  tabIndex={0}
  onKeyDown={(e) => {
    if (e.key === 'Enter' || e.key === ' ') {
      handleClick();
    }
  }}
>
  Interactive element
</div>

Debugging

React DevTools

Install React Developer Tools extension
Use Components tab to inspect props/state
Use Profiler tab for performance analysis

tRPC DevTools

// Automatically included in development
// View network tab for tRPC requests
// Check request/response payloads

Common Issues

Hydration Errors

// Ensure client/server render match
{
  typeof window !== "undefined" && <ClientOnlyComponent />;
}

State Not Updating

// Use callbacks for state depending on previous
setItems((prev) => [...prev, newItem]);

Resources

Frontend Testing

Unit Tests

Unit tests are colocated with the code they test. Good examples of test files can be found in the /apps/app/src/utils/__tests__ directory, which contains unit tests for the files in the /apps/app/src/utils/ folder.
Additional examples of unit test files exist throughout the frontend codebase. They have a .test.{ts,tsx} extension and are sometimes located in __tests__ directories.

Unit Test Approach

Implemented using vitest
Add as many unit tests as possible, especially for pure functions and complex business logic that can be tested independently without relying on extensive mocking and external dependencies.
Prioritize testing different properties and scenarios to catch hard-to-miss edge cases instead of only following the happy path with a few examples.

Running Unit Tests

Run pnpm test:unit to run all unit tests
Run pnpm test:unit:coverage to run all unit tests and get a coverage report in the terminal

Frontend Test Cases Guide

THIS DOCUMENT IS STILL A WIP…

The test-cases directory inside apps/app contains structured test specifications that define the expected behavior of our application. These specifications serve as a bridge between product requirements and automated tests.

Overview

The test cases are defined in JSON files and follow a strict schema (defined in schema.json). Each test case is identified by a unique ID and contains detailed information about what needs to be tested.

File Structure

schema.json - Defines the structure and validation rules for test case specifications
onboarding.spec.json - Test specifications for user onboarding flows
Additional .spec.json files for other features

Test Case Schema

Each test case follows this structure:

{
  "TEST-001": {
    "title": "Test case title",
    "priority": "P0",
    "preconditions": ["List of conditions that must be met"],
    "steps": ["Step-by-step test instructions"],
    "acceptance_criteria": ["List of criteria that must be met"]
  }
}

Fields Explained

title: A descriptive name for the test case
priority: Importance level (P0-P3)
- P0: Critical path, must not break and must be covered by automated tests
- P1: Core functionality
- P2: Important but not critical
- P3: Nice to have
preconditions: Required setup or state before running the test
steps: Detailed test steps
acceptance_criteria: What must be true for the test to pass

Priority Levels

P0: Critical business flows (e.g., user registration, login)
P1: Core features that significantly impact user experience
P2: Secondary features that enhance user experience
P3: Edge cases and nice-to-have features

Updating Test Cases

Adding a New Test Case:
- Choose an appropriate spec file (or create a new one for new features)
- Add a new entry with a unique ID (format: XXX-###)
- Fill in all required fields according to the schema
- Validate against schema.json
Modifying Existing Test Cases:
- Update the relevant fields
- Ensure changes are reflected in the corresponding automated tests
- Keep the test ID unchanged
Best Practices:
- Keep steps clear and actionable
- Write acceptance criteria that can be automated
- Include edge cases and error scenarios
- Document dependencies between test cases

Integration with Automated Tests

The test specifications in this directory serve as a source of truth for our automated tests. The relationship works as follows:

Test specs define WHAT needs to be tested
Automated tests implement HOW to test it
Automated tests written for a test case should reference its corresponding test case ID

Example:

describe("ONB-001: User Registration - with email", () => {
  it("should complete registration flow successfully", async () => {
    // Test implementation
  });
});

Maintaining Test Coverage

Every new feature should have corresponding test cases
Test cases should be reviewed along with code changes
Regular audits ensure test coverage matches specifications
Update or deprecate test cases when features change

Database Guide

Overview

This guide covers database development practices for Tenki Cloud, including schema management, migrations, and query patterns.

Database Stack

PostgreSQL: Primary database
sqlc: Type-safe SQL query generation
golang-migrate: Database migration management

Schema Management

Migrations

All database schema changes must be made through migrations:

# Create a new migration
make migration name=add_user_settings

# Run migrations
make migrate-up

# Rollback last migration
make migrate-down

Best Practices

Always include both up and down migrations
Keep migrations small and focused
Test rollbacks before merging
Never modify existing migrations

Query Development

We use sqlc for type-safe database queries:

Writing Queries

Add queries to pkg/db/queries/*.sql
Use named parameters: @param_name
Follow naming conventions:
- GetUserByID for single row
- ListUsersByOrg for multiple rows
- CreateUser for inserts
- UpdateUser for updates
- DeleteUser for deletes

Generating Code

# Generate Go code from SQL
make sqlc

Performance

Indexing

Add indexes for frequently queried columns
Use composite indexes for multi-column queries
Monitor slow query logs

Query Optimization

Use EXPLAIN ANALYZE for query planning
Avoid N+1 queries
Batch operations when possible
Use database views for complex queries

Testing

Unit Tests

Mock database interfaces
Test query logic separately from business logic

Integration Tests

Use test database containers
Clean up test data after each test
Test migration up/down paths

Testing Guide

This guide covers the testing strategies and patterns used in Tenki Cloud, with a focus on writing effective tests for backend services, particularly those using Temporal workflows.

Overview

Tenki Cloud uses a comprehensive testing approach that includes:

Unit Tests: Fast, isolated tests using mocks to verify business logic
Integration Tests: End-to-end tests running in a real environment
Table-Driven Tests: Systematic approach for testing multiple scenarios
BDD-Style Tests: Behavior-driven tests using Ginkgo/Gomega

Testing Stack

Core Libraries

// Unit Testing
import (
    "testing"
    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    "github.com/stretchr/testify/mock"
)

// Integration Testing
import (
    "github.com/onsi/ginkgo/v2"
    "github.com/onsi/gomega"
)

// Temporal Testing
import (
    "go.temporal.io/sdk/testsuite"
)

Project Structure

internal/domain/{domain}/
├── service/           # Business logic
├── db/               # Database queries (sqlc generated)
├── interface.go      # Service interfaces
├── mock_*.go         # Generated mocks
└── worker/           # Temporal workers
    ├── activities/   # Temporal activities
    │   ├── *.go     # Activity implementations
    │   └── *_test.go # Activity unit tests
    ├── workflows/    # Temporal workflows
    │   ├── *.go     # Workflow implementations
    │   └── *_test.go # Workflow unit tests
    └── integration_*.go # Integration tests

Unit Testing

Activity Testing

Activities should be tested with mocked dependencies to ensure business logic correctness.

Basic Pattern

func TestActivities_GetRunnerInstallation(t *testing.T) {
    t.Parallel()

    tests := []struct {
        name           string
        installationId int64
        mockResponse   *connect.Response[runnerproto.GetRunnerInstallationResponse]
        mockError      error
        expectedResult *runnerproto.RunnerInstallation
        expectErr      bool
    }{
        {
            name:           "success",
            installationId: 1234,
            mockResponse: connect.NewResponse(&runnerproto.GetRunnerInstallationResponse{
                RunnerInstallation: &runnerproto.RunnerInstallation{
                    Id: "abc123",
                },
            }),
            expectedResult: &runnerproto.RunnerInstallation{Id: "abc123"},
        },
        {
            name:           "service error",
            installationId: 1234,
            mockError:      connect.NewError(connect.CodeInternal, nil),
            expectErr:      true,
        },
    }

    for _, tc := range tests {
        t.Run(tc.name, func(t *testing.T) {
            // Setup mock
            svc := &runner.MockService{}
            svc.On("GetRunnerInstallation", mock.Anything, mock.Anything).
                Return(tc.mockResponse, tc.mockError)

            // Create activities with mock
            a := newTestActivities(svc, t)

            // Execute
            result, err := a.GetRunnerInstallation(context.Background(), tc.installationId)

            // Assert
            if tc.expectErr {
                assert.Error(t, err)
                assert.Nil(t, result)
            } else {
                assert.NoError(t, err)
                assert.Equal(t, tc.expectedResult, result)
            }
        })
    }
}

Testing with Complex Arguments

// Use MatchedBy for complex argument validation
svc.On("UpdateRunners", mock.Anything,
    mock.MatchedBy(func(req *connect.Request[runnerproto.UpdateRunnersRequest]) bool {
        return assert.ElementsMatch(t, req.Msg.Ids, expectedIds) &&
               assert.Equal(t, req.Msg.State, expectedState)
    })).Return(nil, nil)

Test Helper Functions

Create reusable test helpers to reduce boilerplate:

func newTestActivities(svc runner.Service, t *testing.T) *activities {
    logger := log.NewTestLogger(t)
    sr := trace.NewSpanRecorder()
    tracer, _ := trace.NewTestTracer(sr)

    return &activities{
        logger:  logger,
        svc:     svc,
        tracer:  tracer,
    }
}

Workflow Testing

Workflows require mocking activities since they orchestrate multiple operations.

Basic Workflow Test

func TestGithubJobWorkflow(t *testing.T) {
    var ts testsuite.WorkflowTestSuite

    t.Run("happy path", func(t *testing.T) {
        env := ts.NewTestWorkflowEnvironment()

        // Register activities with stubs
        env.RegisterActivityWithOptions(stubFunc,
            temporal.RegisterOptions{Name: runner.GithubJobWorkflowActivity})

        // Mock activity responses
        env.OnActivity(runner.GithubJobWorkflowActivity, mock.Anything, mock.Anything).
            Return(nil, nil)

        // Execute workflow
        event := github.WorkflowJobEvent{
            Action: github.String("completed"),
            Installation: &github.Installation{ID: github.Int64(123)},
        }
        env.ExecuteWorkflow((&workflows{}).GithubJobWorkflow, event)

        // Assert completion
        require.True(t, env.IsWorkflowCompleted())
        require.NoError(t, env.GetWorkflowError())
    })
}

Testing Retry Logic

t.Run("retry on transient error", func(t *testing.T) {
    env := ts.NewTestWorkflowEnvironment()

    callCount := 0
    env.OnActivity(runner.SomeActivity, mock.Anything, mock.Anything).
        Return(func(context.Context, interface{}) error {
            callCount++
            if callCount < 3 {
                return errors.New("transient error")
            }
            return nil
        })

    env.ExecuteWorkflow(workflow, input)

    require.True(t, env.IsWorkflowCompleted())
    require.NoError(t, env.GetWorkflowError())
    assert.Equal(t, 3, callCount)
})

Integration Testing

Integration tests verify the entire system working together with real dependencies.

Setup with Ginkgo

Test Suite Entry Point

//go:build integration

func TestIntegration(t *testing.T) {
    RegisterFailHandler(Fail)
    RunSpecs(t, "Runner Worker Integration Tests")
}

Suite Configuration

var _ = BeforeSuite(func() {
    // Start Temporal dev server
    cmd := exec.Command("temporal", "server", "start-dev",
        "--port", "7233",
        "--ui-port", "8233",
        "--db-filename", filepath.Join(tempDir, "temporal.db"))

    // Initialize global dependencies
    initializeDatabase()
    initializeTracing()
})

var _ = AfterSuite(func() {
    // Clean up
    stopTemporalServer()
    closeDatabase()
})

var _ = BeforeEach(func() {
    // Start transaction for test isolation
    tx = db.BeginTx()

    // Create service instances
    runnerService = createRunnerService(tx)

    // Start worker
    worker = temporal.NewWorker(client, taskQueue, temporal.WorkerOptions{})
    temporal.RegisterWorkflows(worker)
    temporal.RegisterActivities(worker, activities)
    worker.Start()
})

var _ = AfterEach(func() {
    // Rollback transaction
    tx.Rollback()

    // Stop worker
    worker.Stop()
})

Writing Integration Tests

var _ = Describe("Runner Installation", func() {
    Context("when installing runners", func() {
        It("should install runner successfully", func() {
            // Start workflow
            workflowId := fmt.Sprintf("test-install-%s", uuid.New())
            run, err := temporalClient.ExecuteWorkflow(
                context.Background(),
                client.StartWorkflowOptions{
                    ID:        workflowId,
                    TaskQueue: runner.TaskQueue,
                },
                runner.RunnerInstallWorkflow,
                installationId,
            )
            Expect(err).ToNot(HaveOccurred())

            // Trigger installation via service
            _, err = runnerService.InstallRunners(ctx, connect.NewRequest(
                &runnerproto.InstallRunnersRequest{
                    InstallationId: installationId,
                    WorkspaceId:    workspaceId,
                },
            ))
            Expect(err).ToNot(HaveOccurred())

            // Send signal to workflow
            err = temporalClient.SignalWorkflow(
                context.Background(),
                workflowId,
                "",
                runner.InstallSignal,
                runner.InstallSignalPayload{},
            )
            Expect(err).ToNot(HaveOccurred())

            // Wait for expected state
            Eventually(func() string {
                ins, err := runnerService.GetRunnerInstallation(ctx, req)
                if err != nil || ins == nil {
                    return ""
                }
                return ins.Msg.RunnerInstallation.State
            }, 30*time.Second, 1*time.Second).Should(Equal("active"))

            // Verify final state
            var result runner.RunnerInstallWorkflowResult
            err = run.Get(context.Background(), &result)
            Expect(err).ToNot(HaveOccurred())
            Expect(result.Success).To(BeTrue())
        })
    })
})

Testing Patterns & Best Practices

1. Table-Driven Tests

Use table-driven tests to cover multiple scenarios systematically:

tests := []struct {
    name      string
    input     string
    want      string
    wantErr   bool
    errMsg    string
}{
    {
        name:  "valid input",
        input: "test",
        want:  "TEST",
    },
    {
        name:    "empty input",
        input:   "",
        wantErr: true,
        errMsg:  "input cannot be empty",
    },
}

2. Mock Best Practices

Mock at interface boundaries
Use mock.MatchedBy for complex argument matching
Verify mock expectations when needed:

defer svc.AssertExpectations(t)

3. Test Isolation

Each test should be independent
Use database transactions with rollback
Clean up created resources
Reset global state between tests

4. Async Testing

Use Eventually for testing async operations:

Eventually(func() bool {
    // Check condition
    return conditionMet
}, timeout, interval).Should(BeTrue())

5. Error Testing

Always test both success and failure paths:

{
    name:      "network error",
    mockError: errors.New("connection refused"),
    expectErr: true,
},
{
    name:      "timeout error",
    mockError: context.DeadlineExceeded,
    expectErr: true,
},

6. Test Naming

Use descriptive test names that explain the scenario:

t.Run("returns error when installation not found", func(t *testing.T) {
    // test
})

7. Tracing in Tests

Verify tracing behavior when applicable:

sr := trace.NewSpanRecorder()
tracer, _ := trace.NewTestTracer(sr)

// After execution
spans := sr.Ended()
assert.Len(t, spans, 1)
assert.Equal(t, "OperationName", spans[0].Name())
assert.Equal(t, codes.Ok, spans[0].Status().Code)

Common Testing Scenarios

Testing Database Operations

func TestDatabaseOperation(t *testing.T) {
    // Use test database
    db := setupTestDatabase(t)
    defer cleanupDatabase(db)

    // Create queries
    queries := runnerdb.New(db)

    // Test operation
    err := queries.CreateRunner(context.Background(), params)
    require.NoError(t, err)

    // Verify
    runner, err := queries.GetRunner(context.Background(), id)
    require.NoError(t, err)
    assert.Equal(t, expectedName, runner.Name)
}

Testing Kubernetes Operations

func TestKubernetesOperation(t *testing.T) {
    // Create fake client
    objects := []runtime.Object{
        &corev1.Namespace{
            ObjectMeta: metav1.ObjectMeta{Name: "test"},
        },
    }
    k8sClient := fake.NewSimpleClientset(objects...)

    // Test operation
    err := createDeployment(k8sClient, namespace, deployment)
    require.NoError(t, err)

    // Verify
    deploy, err := k8sClient.AppsV1().Deployments(namespace).Get(
        context.Background(), name, metav1.GetOptions{})
    require.NoError(t, err)
    assert.Equal(t, expectedReplicas, *deploy.Spec.Replicas)
}

Testing External API Calls

func TestExternalAPI(t *testing.T) {
    // Create mock HTTP server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        assert.Equal(t, "/api/v1/resource", r.URL.Path)
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(expectedResponse)
    }))
    defer server.Close()

    // Test with mock server URL
    client := NewAPIClient(server.URL)
    result, err := client.GetResource(context.Background(), "id")
    require.NoError(t, err)
    assert.Equal(t, expectedResponse, result)
}

Running Tests

Unit Tests

# Run all unit tests
gotest

# Run specific package tests
cd backend && go test ./internal/domain/runner/...

# Run with coverage
cd backend && go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

# Run specific test
cd backend && go test -run TestActivities_GetRunnerInstallation ./...

Integration Tests

# Ensure services are running
dev up

# Run all integration tests
gotest-integration

# Run specific integration test suite
cd backend && ginkgo -v ./internal/domain/runner/worker/

Continuous Integration

Tests should be part of your CI pipeline:

test:
  script:
    - gotest
    - gotest-integration
  coverage: '/coverage: \d+\.\d+%/'

Debugging Tests

Verbose Output

go test -v ./...

Focus on Specific Tests (Ginkgo)

FIt("should focus on this test", func() {
    // This test will run exclusively
})

Debug Logging

logger := log.NewTestLogger(t)
logger.Debug("test state", "value", someValue)

Test Timeouts

func TestLongRunning(t *testing.T) {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    defer cancel()

    // Use ctx for operations
}

Summary

Effective testing in Tenki Cloud requires:

Clear separation between unit and integration tests
Proper use of mocks for isolation
Table-driven tests for comprehensive coverage
Integration tests for end-to-end validation
Consistent patterns across the codebase

Follow these patterns to ensure your code is well-tested, maintainable, and reliable.

Release System

Tenki Cloud uses a custom release system designed for polyglot monorepos, handling both TypeScript/Node.js applications and Go binaries seamlessly.

Overview

The release system automates version management, changelog generation, and artifact building across all components in the monorepo. It provides a developer-friendly workflow similar to Changesets but with full support for Go modules and Docker deployments.

Key Features

Polyglot Support: Handles both Node.js packages and Go binaries
Shared Go Versioning: All Go binaries use coordinated versions
Deployment Awareness: Different strategies for Docker vs binary deployment
Automatic PR Management: Creates and updates Release PRs
GitHub Integration: Native releases with artifact uploads
Developer-Friendly CLI: Interactive changelog creation

Quick Start

Creating a Release

Create a changelog:

changelog add          # Interactive with fzf (if available)
changelog add --empty  # Empty changelog for internal changes

Commit and push:

git add .releases/your-changelog.md
git commit -m "feat: add new feature"
git push origin main

Review the Release PR that gets created automatically
Merge the Release PR to trigger the release

Checking Status

changelog status

Components

The release system manages these components:

Frontend Applications

@tenki/app → Docker image: app:vX.Y.Z
@tenki/sentinel → Docker image: sentinel:vX.Y.Z

Go Services (Docker)

@tenki/engine → Docker image: engine:vX.Y.Z
@tenki/github-proxy → Docker image: github-proxy:vX.Y.Z

Go Binaries (Direct Deployment)

@tenki/cli → Binary releases: tenki-cli-vX.Y.Z-{os}-{arch}
@tenki/node-agent → Binary releases: node-agent-vX.Y.Z-{os}-{arch}
@tenki/vm-agent → Binary releases: vm-agent-vX.Y.Z-{os}-{arch}

Changelog Format

Changelog files use YAML frontmatter to specify affected packages and version bump types:

---
"@tenki/app": minor
"@tenki/engine": patch
"@tenki/cli": major
---

Add new authentication features

- **Frontend**: Added MFA support with TOTP
- **Engine**: Fixed token refresh race condition
- **CLI**: Breaking change: new login command structure

This release improves security and fixes several authentication issues.

Version Bump Types

patch (0.0.X): Bug fixes, small improvements
minor (0.X.0): New features, backwards compatible
major (X.0.0): Breaking changes

Go Binary Versioning

All Go binaries share the same version from backend/go.mod. When any Go binary is updated, all Go binaries receive the same version bump using the highest bump type among them.

Example: If @tenki/cli needs a patch and @tenki/engine needs a minor, all Go binaries get a minor bump.

Workflow

1. Changelog Detection

Trigger: .releases/*.md files pushed to main
Action: Parses changelog files, determines version bumps
Result: Creates or updates Release PR

2. Release PR

Contains: Version bumps for all affected components
Updates: Individual CHANGELOG.md files
Shows: Artifacts that will be built
Cleanup: Deletes temporary changelog files

3. Release Automation

Trigger: Release PR merged to main
Actions:
- Creates Git tags
- Builds Docker images
- Builds cross-platform binaries
- Creates GitHub release with artifacts

CLI Commands

Interactive Changelog Creation

changelog add

Guides you through:

Selecting affected packages
Choosing version bump types
Writing changelog content

Status Check

changelog status

Shows:

Pending changelog files
Existing Release PRs
Current component versions

File Structure

.releases/
├── config.json              # Release configuration
└── *.md                     # Temporary changelog files

# Individual changelogs
apps/app/CHANGELOG.md
apps/sentinel/CHANGELOG.md
backend/cmd/engine/CHANGELOG.md
backend/cmd/tenki-cli/CHANGELOG.md
backend/cmd/node-agent/CHANGELOG.md
backend/cmd/github-proxy/CHANGELOG.md
backend/cmd/vm-agent/CHANGELOG.md

GitHub Actions

Changelog Detection

File: .github/workflows/changelog-detection.yml

Trigger: Push to main with .releases/*.md changes
Action: Processes changelogs and creates Release PR

Release Automation

File: .github/workflows/release.yml

Trigger: Release PR merged to main
Actions: Creates tags, builds artifacts, publishes releases

Configuration

The system is configured in .releases/config.json:

{
  "packages": {
    "@tenki/app": {
      "path": "apps/app",
      "type": "node",
      "changelog": "apps/app/CHANGELOG.md",
      "version_file": "apps/app/package.json",
      "deployment": "docker"
    },
    "@tenki/engine": {
      "path": "backend/cmd/engine",
      "type": "go-binary",
      "changelog": "backend/cmd/engine/CHANGELOG.md",
      "version_file": "backend/cmd/engine/VERSION",
      "binary_name": "engine",
      "deployment": "docker",
      "docker": {
        "component": "engine",
        "dockerfile": "backend/cmd/engine/Dockerfile",
        "context": "backend"
      }
    }
  },
  "release_branch": "release/next",
  "release_pr_title": "chore(release): version packages [skip ci]",
  "commit_message": "chore(release): version packages [skip ci]"
}

Best Practices

Changelog Writing

One changelog per logical change - Don’t combine unrelated features
Clear descriptions - Explain what changed and why
User-focused content - Write for end users, not developers
Appropriate bump types - Follow semantic versioning strictly

Release Management

Review Release PRs carefully - Verify versions and changelog entries
Test before merging - Ensure all CI checks pass
Coordinate deployments - Plan releases during appropriate windows
Monitor releases - Watch for issues after deployment

Package Dependencies

Shared package changes go in consuming app changelogs
No separate changelogs for packages/* directories
Document impact where users will see the changes

Troubleshooting

Release PR Not Created

Check changelog format - Ensure YAML frontmatter is correct
Verify file location - Files must be in .releases/ directory
Check GitHub Actions - Review workflow logs for errors

Build Failures

Run tests locally - Ensure all tests pass before merging
Check Docker configs - Verify Dockerfile and build contexts
Validate Go modules - Ensure go.mod is properly formatted

Version Conflicts

Understand versioning - Go binaries share versions, Node.js apps are independent
Check existing versions - Use changelog status to see current state
Review Release PR - Verify calculated versions are correct

Examples

Simple Bug Fix

---
"@tenki/app": patch
---

Fix authentication token refresh issue

- Fixed race condition in token refresh logic
- Improved error handling for expired tokens

New Feature with Breaking Change

---
"@tenki/app": minor
"@tenki/cli": major
"@tenki/engine": minor
---

Add workspace management features

- **App**: New workspace dashboard with team management
- **CLI**: Breaking change: `tenki workspace` command restructured
- **Engine**: Added workspace isolation and resource quotas

Multi-Component Update

---
"@tenki/app": minor
"@tenki/engine": minor
"@tenki/node-agent": patch
---

Improve runner monitoring and management

- **App**: Added real-time runner status dashboard
- **Engine**: Implemented auto-scaling for custom images
- **Node Agent**: Fixed memory leak in status reporting

Release Quick Reference

Quick reference for the Tenki Cloud release system.

Commands

# Create new changelog (interactive with fzf if available)
changelog add

# Create empty changelog for internal changes
changelog add --empty

# Check status
changelog status

# Show help
changelog help

Changelog Format

---
"@tenki/app": minor
"@tenki/engine": patch
---

Brief description of changes

- Detailed change 1
- Detailed change 2

Components

Component	Type	Deployment	Output
`@tenki/app`	Node.js	Docker	`app:vX.Y.Z`
`@tenki/sentinel`	Node.js	Docker	`sentinel:vX.Y.Z`
`@tenki/engine`	Go	Docker	`engine:vX.Y.Z`
`@tenki/github-proxy`	Go	Docker	`github-proxy:vX.Y.Z`
`@tenki/cli`	Go	Binary	`tenki-cli-vX.Y.Z-{os}-{arch}`
`@tenki/node-agent`	Go	Binary	`node-agent-vX.Y.Z-{os}-{arch}`
`@tenki/vm-agent`	Go	Binary	`vm-agent-vX.Y.Z-{os}-{arch}`

Version Bump Types

Type	Version Change	Use Case
`patch`	1.0.0 → 1.0.1	Bug fixes, small improvements
`minor`	1.0.0 → 1.1.0	New features, backwards compatible
`major`	1.0.0 → 2.0.0	Breaking changes

Workflow

Create changelog → changelog add
Commit & push → git add .releases/*.md && git commit && git push
Review Release PR → Automatically created
Merge Release PR → Triggers release automation
Artifacts built → Docker images + binaries published

File Locations

.releases/
├── config.json                           # Configuration
└── your-feature.md                       # Temporary changelog

apps/app/CHANGELOG.md                      # App changelog
apps/sentinel/CHANGELOG.md                # Sentinel changelog
backend/cmd/engine/CHANGELOG.md           # Engine changelog
backend/cmd/tenki-cli/CHANGELOG.md        # CLI changelog
backend/cmd/node-agent/CHANGELOG.md       # Node agent changelog
backend/cmd/github-proxy/CHANGELOG.md     # GitHub proxy changelog
backend/cmd/vm-agent/CHANGELOG.md         # VM agent changelog

Common Patterns

Bug Fix

---
"@tenki/app": patch
---

Fix login redirect issue

New Feature

---
"@tenki/app": minor
"@tenki/engine": minor
---

Add workspace management

Breaking Change

---
"@tenki/cli": major
---

Restructure CLI commands

Multi-Component

---
"@tenki/app": minor
"@tenki/engine": patch
"@tenki/cli": patch
---

Improve runner monitoring

Troubleshooting

Issue	Solution
Release PR not created	Check changelog format and GitHub Actions logs
Build failure	Ensure tests pass and Docker configs are correct
Wrong version calculated	Review frontmatter and component dependencies
CLI not working	Run `direnv reload` to pick up new scripts

Go Binary Versioning

All Go binaries share the same version from backend/go.mod
Highest bump type among Go components is used for all
Example: cli: patch + engine: minor = all Go binaries get minor

Deployment Guide

Last updated: 2025-06-12

Overview

Tenki Cloud uses GitOps with Flux for Kubernetes deployments. All deployments are triggered via Git commits and automatically reconciled by Flux.

Deployment Environments

Environment	Domain	Branch	Cluster
Development	*.tenki.lab	feature/*	Local
Staging	*.staging.tenki.cloud	staging	tenki-staging
Production	*.tenki.cloud	main	tenki-prod

Deployment Process

1. Local Development → Staging

# 1. Ensure tests pass
pnpm test
gotest

# 2. Build and push images
make docker-build
make docker-push TAG=staging-$(git rev-parse --short HEAD)

# 3. Update staging manifests
cd infra/flux/apps/staging
vim engine-deployment.yaml  # Update image tag
git add .
git commit -m "deploy: engine staging-abc123"
git push

# 4. Monitor deployment
kubectl --context=staging get pods -w
flux logs -f

2. Staging → Production

# 1. Create release PR
gh pr create --base main --title "Release v1.2.3"

# 2. After approval and merge, tag release
git checkout main
git pull
git tag -a v1.2.3 -m "Release v1.2.3"
git push origin v1.2.3

# 3. CI/CD builds and pushes production images

# 4. Update production manifests
cd infra/flux/apps/production
# Update image tags to v1.2.3
git commit -m "deploy: production v1.2.3"
git push

# 5. Monitor rollout
kubectl --context=production rollout status deployment/engine

Service-Specific Deployments

Backend Engine

# Build
cd backend
make build-engine

# Test
make test

# Docker image
docker build -t tenki/engine:$TAG .
docker push tenki/engine:$TAG

# Update manifest
kubectl set image deployment/engine engine=tenki/engine:$TAG

Frontend App

# Build
cd apps/app
pnpm build

# Docker image
docker build -t tenki/app:$TAG .
docker push tenki/app:$TAG

# Deploy
kubectl set image deployment/app app=tenki/app:$TAG

Database Migrations

# Always run migrations before deploying new code
kubectl exec -it deploy/engine -- /app/migrate up

# Verify migrations
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "\dt"

Rollback Procedures

Quick Rollback (< 5 mins)

# 1. Rollback deployment
kubectl rollout undo deployment/engine

# 2. Verify rollback
kubectl rollout status deployment/engine
kubectl logs -l app=engine --tail=100

# 3. Rollback database if needed
kubectl exec -it deploy/engine -- /app/migrate down

GitOps Rollback

# 1. Revert commit in Git
git revert <commit-hash>
git push

# 2. Flux will automatically sync
flux reconcile source git flux-system

# 3. Monitor
watch flux get kustomizations

Health Checks

Pre-deployment

# Check cluster health
kubectl get nodes
kubectl top nodes

# Check dependencies
kubectl get pods -n default
kubectl get pvc

# Verify secrets
kubectl get secrets

During Deployment

# Watch rollout
kubectl rollout status deployment/engine -w

# Monitor pods
kubectl get pods -l app=engine -w

# Check logs
kubectl logs -f -l app=engine --tail=50

Post-deployment

# Smoke tests
curl https://api.tenki.cloud/health
curl https://app.tenki.cloud

# Check metrics
open https://grafana.tenki.cloud/d/deployment

# Run integration tests
cd backend && gotest-integration

Monitoring Deployments

Grafana Dashboards

Key Metrics to Watch

Request rate changes
Error rate spikes
Response time increases
CPU/Memory usage
Database connections

Alerts

# Deployment alerts configured in Prometheus
- name: deployment_failed
  expr: kube_deployment_status_replicas_unavailable > 0
  for: 5m

- name: high_error_rate_after_deploy
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m

Blue-Green Deployments

For high-risk changes:

# 1. Deploy to green environment
kubectl apply -f engine-deployment-green.yaml

# 2. Test green environment
curl https://api-green.tenki.cloud/health

# 3. Switch traffic
kubectl patch service engine -p '{"spec":{"selector":{"version":"green"}}}'

# 4. Monitor
watch 'kubectl get pods -l app=engine'

# 5. If issues, switch back
kubectl patch service engine -p '{"spec":{"selector":{"version":"blue"}}}'

Deployment Checklist

Pre-deployment

All tests passing
Code reviewed and approved
Database migrations tested
Rollback plan prepared
Team notified in Slack

Deployment

Images built and pushed
Manifests updated
Deployment monitored
Health checks passing
Smoke tests completed

Post-deployment

Metrics normal
No error spikes
Customer reports checked
Documentation updated
Deployment logged

Troubleshooting

Pod Won’t Start

kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --sort-by='.lastTimestamp'

Image Pull Errors

# Check secret
kubectl get secret regcred -o yaml

# Re-create if needed
kubectl create secret docker-registry regcred \
  --docker-server=registry.tenki.cloud \
  --docker-username=$USER \
  --docker-password=$PASS

Configuration Issues

# Check ConfigMaps
kubectl get configmap
kubectl describe configmap engine-config

# Check Secrets
kubectl get secrets
kubectl describe secret engine-secrets

CI/CD Pipeline

Our GitHub Actions pipeline:

On PR: Run tests, build images, deploy to preview
On merge to main: Build, tag, push to registry
On tag: Build production images, create release

See .github/workflows/deploy.yml in the repository root

Monitoring Guide

Overview

This guide covers monitoring and observability practices for Tenki Cloud operations.

Stack

Prometheus: Metrics collection
Grafana: Visualization and dashboards
Loki: Log aggregation
Tempo: Distributed tracing
Alertmanager: Alert routing

Metrics

Application Metrics

Key metrics to monitor:

Request rate and latency
Error rates (4xx, 5xx)
Database connection pool stats
Background job queue depth
GitHub API rate limits

Infrastructure Metrics

CPU and memory usage
Disk I/O and space
Network throughput
Container health
Database performance

Dashboards

Available Dashboards

Application Overview: High-level health metrics
API Performance: Request rates, latencies, errors
Database Health: Connections, query performance
GitHub Integration: Runner stats, API usage
Billing System: Transaction volumes, failures

Creating Dashboards

Use Grafana dashboard as code
Store dashboards in deployments/grafana/dashboards/
Follow naming convention: category-name.json
Include appropriate tags and metadata

Alerts

Alert Rules

Critical alerts:

API availability < 99.9%
Database CPU > 80%
Disk space < 20%
Error rate > 5%
GitHub API rate limit < 1000

Alert Routing

Critical: PagerDuty (immediate response)
Warning: Slack #alerts channel
Info: Email daily digest

Logs

Log Levels

ERROR: Actionable errors requiring investigation
WARN: Potential issues, degraded performance
INFO: Important business events
DEBUG: Detailed troubleshooting information

Structured Logging

Always use structured logging with consistent fields:

trace_id: Request correlation ID
user_id: User identifier
org_id: Organization identifier
error: Error message and stack trace

Tracing

Instrumentation

Trace all API endpoints
Include database queries
Add custom spans for business logic
Propagate trace context to external services

Sampling

100% sampling for errors
10% sampling for successful requests
Adjust based on traffic volume

SLOs and SLIs

Service Level Indicators

API latency (p50, p95, p99)
Error rate
Availability
Database query time

Service Level Objectives

99.9% API availability
p95 latency < 500ms
Error rate < 0.1%
Zero data loss

Manual Billing Workflow Execution

This guide provides information for manually executing billing workflows using Temporal CLI or other workflow execution tools.

Prerequisites

Access to Temporal cluster
Proper authentication and permissions
Understanding of workspace IDs and billing periods

Common Parameters

All billing workflows use the following task queue:

Task Queue: BILLING_TASK_QUEUE

Workflows

BillingListWorkspaceBalanceWorkflow

Retrieves invoice line items and balance information for a specific workspace and billing period.

Workflow Details:

Task Queue: BILLING_TASK_QUEUE
Workflow Type: BillingListWorkspaceBalanceWorkflow

Payload Example:

{
  "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
  "billing_period": "2024-01",
  "billing_period_start": "2024-01-01T00:00:00Z",
  "billing_period_end": "2024-01-31T23:59:59.999Z"
}

Parameters:

workspace_id (string): UUID of the workspace
billing_period (string): Billing period in YYYY-MM format
billing_period_start (time): Start of billing period (ISO 8601)
billing_period_end (time): End of billing period (ISO 8601)

Expected Result:

{
  "line_items": [
    {
      "description": "Runner Usage - tenki-standard-autoscale",
      "runner_label": "tenki-standard-autoscale",
      "quantity": 120,
      "unit_price": 0.01,
      "amount": 1.2
    }
  ],
  "total_amount": 1.2,
  "timestamp": "2024-01-31T23:59:59Z"
}

BillingCycleScheduleWorkflow

Parent workflow that orchestrates billing cycles for all workspaces. Typically triggered by a scheduled cron job.

Workflow Details:

Task Queue: BILLING_TASK_QUEUE
Workflow Type: BillingCycleScheduleWorkflow

Payload Example:

{
  "luxor_only": false,
  "exclude_workspaces": ["123e4567-e89b-12d3-a456-426614174000", "987fcdeb-51a2-43d1-b567-123456789abc"]
}

Parameters:

luxor_only (boolean, optional): Filter to process only Luxor customers (is_luxor = true)
exclude_workspaces (array of UUIDs, optional): List of workspace IDs to exclude from billing cycle

Behavior:

Queries all active workspaces with billing accounts
Spawns individual BillingCycleWorkflow child workflows for each workspace
Handles the current billing period automatically

BillingCycleWorkflow

Individual workspace billing processing workflow. Handles invoice generation, charging, and payment processing for a single workspace.

Workflow Details:

Task Queue: BILLING_TASK_QUEUE
Workflow Type: BillingCycleWorkflow

Payload Example:

{
  "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
  "billing_period": "2024-01",
  "billing_period_start": "2024-01-01T00:00:00Z",
  "billing_period_end": "2024-01-31T23:59:59.999Z"
}

Parameters:

workspace_id (UUID): The workspace to process billing for
billing_period (string): Billing period in YYYY-MM format
billing_period_start (time): Start of billing period (ISO 8601)
billing_period_end (time): End of billing period (ISO 8601)

Workflow Steps:

Generate Stripe invoice with line items
Process invoice and attempt payment
Handle TigerBeetle accounting transfers
Create billing payment records
Process promotional credit adjustments
Reset monthly free credits

BillingPaymentReversalWorkflow

Reverses a payment by creating a reversal transfer in TigerBeetle and updating the payment status to ‘reversed’. Used for refunds, chargebacks, or administrative corrections.

Workflow Details:

Task Queue: BILLING_TASK_QUEUE
Workflow Type: BillingPaymentReversalWorkflow

Payload Example:

{
  "payment_id": "123e4567-e89b-12d3-a456-426614174000",
  "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
  "reason": "Customer requested refund",
  "initiated_by_email": "admin@tenki.cloud"
}

Parameters:

payment_id (UUID): The payment ID to reverse
workspace_id (UUID): The workspace that owns the payment
reason (string): Reason for the reversal (required)
initiated_by_email (string): Email of the person initiating the reversal (required)

Expected Result:

{
  "success": true,
  "reversal_transfer_id": "base64-encoded-transfer-id",
  "original_amount": "12.50",
  "reversed_at": "2024-01-31T15:30:00Z"
}

Workflow Steps:

Validate required parameters (payment_id, workspace_id, reason, initiated_by_email)
Lookup payment details from database and TigerBeetle
Create reversal transfer in TigerBeetle using original transfer details
Update payment status to ‘reversed’ with reversal details

BillingUsageReversalWorkflow

Reverses a usage event by creating a reversal transfer in TigerBeetle and deleting the usage event record. Used for correcting erroneous charges or administrative adjustments to usage records.

Workflow Details:

Task Queue: BILLING_TASK_QUEUE
Workflow Type: BillingUsageReversalWorkflow

Payload Example:

{
  "usage_event_id": "123e4567-e89b-12d3-a456-426614174000",
  "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
  "reason": "Incorrect runner charge - job failed",
  "initiated_by_email": "admin@tenki.cloud"
}

Parameters:

usage_event_id (UUID): The usage event ID to reverse (required)
workspace_id (UUID): The workspace that owns the usage event (required)
reason (string): Reason for the reversal (required)
initiated_by_email (string): Email of the person initiating the reversal (required)

Expected Result:

{
  "success": true,
  "reversal_transfer_id": "base64-encoded-transfer-id",
  "original_amount": "0.50",
  "reversed_at": "2024-01-31T15:30:00Z"
}

Workflow Steps:

Validate required parameters (usage_event_id, workspace_id, reason, initiated_by_email)
Fetch usage event details from database
Verify workspace ownership matches provided workspace_id
Lookup actual transfer details from TigerBeetle
Create reversal transfer in TigerBeetle using original transfer amounts
Delete the usage event record from database

Important Notes:

Unlike payment reversals, usage event reversals permanently delete the record (no audit trail in the usage_events table)
The reversal transfer in TigerBeetle maintains the financial audit trail
Workspace ID validation ensures the usage event belongs to the specified workspace

Temporal CLI Examples

Execute BillingListWorkspaceBalanceWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingListWorkspaceBalanceWorkflow \
  --input '{
    "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
    "billing_period": "2024-01",
    "billing_period_start": "2024-01-01T00:00:00Z",
    "billing_period_end": "2024-01-31T23:59:59.999Z"
  }'

Execute BillingCycleScheduleWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingCycleScheduleWorkflow \
  --input '{
    "luxor_only": false,
    "exclude_workspaces": []
  }'

Execute BillingCycleWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingCycleWorkflow \
  --input '{
    "workspace_id": "123e4567-e89b-12d3-a456-426614174000",
    "billing_period": "2024-01",
    "billing_period_start": "2024-01-01T00:00:00Z",
    "billing_period_end": "2024-01-31T23:59:59.999Z"
  }'

Execute BillingPaymentReversalWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingPaymentReversalWorkflow \
  --input '{
    "payment_id": "123e4567-e89b-12d3-a456-426614174000",
    "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
    "reason": "Customer requested refund",
    "initiated_by_email": "admin@tenki.cloud"
  }'

Execute BillingUsageReversalWorkflow

temporal workflow start \
  --task-queue BILLING_TASK_QUEUE \
  --type BillingUsageReversalWorkflow \
  --input '{
    "usage_event_id": "123e4567-e89b-12d3-a456-426614174000",
    "workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
    "reason": "Incorrect runner charge - job failed",
    "initiated_by_email": "admin@tenki.cloud"
  }'

Notes

All timestamps should be in UTC
Workspace IDs must be valid UUIDs
Billing periods follow YYYY-MM format
The BillingCycleScheduleWorkflow is typically run automatically via Temporal schedules
Individual BillingCycleWorkflow executions can be run manually for specific workspaces
Use BillingListWorkspaceBalanceWorkflow to preview billing information before processing

Operational Runbooks

This section contains runbooks for common operational scenarios and incident response.

Available Runbooks

High Database CPU - When database CPU exceeds 80%

Runbook Template

When creating a new runbook, use this template:

# Runbook: [Issue Name]

## Alert Details

- **Alert Name**: `AlertNameInPrometheus`
- **Severity**: P1 | P2 | P3
- **Team**: Backend | Frontend | Platform
- **Last Updated**: YYYY-MM-DD

## Symptoms

- What the user/system experiences
- What metrics are affected
- What alerts fire

## Quick Diagnostics

\```bash

# Commands to quickly assess the situation

\```

## Resolution Steps

### 1. Immediate Mitigation (X mins)

Steps to stop the bleeding

### 2. Root Cause Analysis (X mins)

How to find what caused the issue

### 3. Fix Implementation

How to fix the underlying problem

### 4. Verification

How to confirm the fix worked

## Prevention

Long-term fixes to prevent recurrence

## Escalation Path

When and who to escalate to

## Related Runbooks

Links to related procedures

Writing Good Runbooks

Be specific - Include exact commands and expected outputs
Time-box steps - Indicate how long each step should take
Include rollback - Always have a way to undo changes
Test regularly - Run through the runbook quarterly
Keep updated - Update after each incident

Incident Response Process

Acknowledge the alert
Assess using quick diagnostics
Mitigate following the runbook
Communicate status updates
Resolve the root cause
Document in incident report

Runbook: High Database CPU

Alert Details

Alert Name: HighDatabaseCPU
Severity: P2
Team: Backend/Platform
Last Updated: 2025-06-12

Symptoms

Database CPU usage > 80% for 5+ minutes
API response times > 500ms
Increased error rates in logs
Grafana dashboard shows CPU spike

Quick Diagnostics

# 1. Check current connections
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT count(*), state
  FROM pg_stat_activity
  GROUP BY state;"

# 2. Find slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT
    substring(query, 1, 50) as query_start,
    calls,
    mean_exec_time,
    total_exec_time
  FROM pg_stat_statements
  WHERE mean_exec_time > 100
  ORDER BY mean_exec_time DESC
  LIMIT 10;"

# 3. Check for locks
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT
    pid,
    usename,
    pg_blocking_pids(pid) as blocked_by,
    query_start,
    substring(query, 1, 50) as query
  FROM pg_stat_activity
  WHERE pg_blocking_pids(pid)::text != '{}';"

Resolution Steps

1. Immediate Mitigation (5 mins)

# Scale up API to reduce per-instance load
kubectl scale deployment/engine --replicas=10

# Kill long-running queries (>5 minutes)
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state != 'idle'
    AND query_start < now() - interval '5 minutes'
    AND query NOT LIKE '%pg_stat_activity%';"

2. Identify Root Cause (10 mins)

Check recent deployments:

kubectl get deployments -o wide | grep engine
kubectl rollout history deployment/engine

Review slow query log:

kubectl logs postgres-0 | grep "duration:" | tail -50

Check for missing indexes:

-- Run on affected tables
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM workflow_runs
WHERE status = 'pending'
  AND created_at > NOW() - INTERVAL '1 hour';

3. Fix Implementation

If missing index:

-- Create index (be careful on large tables)
CREATE INDEX CONCURRENTLY idx_workflow_runs_status_created
ON workflow_runs(status, created_at)
WHERE status IN ('pending', 'running');

If bad query from recent deploy:

# Rollback to previous version
kubectl rollout undo deployment/engine

# Or deploy hotfix
git checkout main
git pull
# Fix query
git commit -am "fix: optimize workflow query"
git push
# Deploy via CI/CD

4. Verify Resolution

# Monitor CPU (should drop within 5 mins)
watch -n 5 "kubectl exec -it postgres-0 -- psql -U postgres -c 'SELECT round(100 * cpu_usage) as cpu_percent FROM pg_stat_database_stats;'"

# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.tenki.lab/health

# Verify no more slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '1 minute';"

Long-term Prevention

Add query timeout to engine configuration
Set up query monitoring in Datadog/NewRelic
Regular ANALYZE on high-traffic tables
Consider read replicas for analytics queries
Implement connection pooling with PgBouncer

Escalation Path

15 mins: If CPU still high → Page backend on-call
30 mins: If impacting customers → Incident Commander
45 mins: If data corruption risk → CTO

Post-Incident

Create incident report
Add missing monitoring
Update this runbook with findings
Schedule postmortem if customer impact

Runbook: High API Latency

Overview

This runbook covers troubleshooting and resolving high API latency issues.

Symptoms

p95 latency > 500ms
User reports of slow loading
Timeout errors in client applications
Increased error rates due to timeouts

Impact

Poor user experience
Increased error rates
Potential cascading failures
Customer complaints

Detection

Alert: APILatencyHigh
Threshold: p95 > 500ms for 5 minutes
Dashboard: API Performance

Response

Immediate Actions

Check current latency
- View p50, p95, p99 latencies
- Identify affected endpoints
- Check error rates

Verify system health

# Check pod status
kubectl get pods -n production

# Check resource usage
kubectl top pods -n production

# Check recent deployments
kubectl rollout history deployment/api -n production

Enable detailed logging (temporarily)

kubectl set env deployment/api LOG_LEVEL=debug -n production

Diagnosis

Database performance
- Check slow query log
- Review connection pool status
- Look for lock contention
External dependencies
- GitHub API response times
- Payment processor latency
- CDN performance
Application issues
- Memory leaks (increasing memory usage)
- CPU bottlenecks
- Inefficient algorithms

Common Causes and Fixes

1. Database Queries

Symptom: High database CPU, slow queries Fix:

-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_table_column ON table(column);

2. Cache Misses

Symptom: High cache miss rate Fix:

Warm up caches after deployment
Increase cache TTL for stable data
Review cache key generation

3. Resource Constraints

Symptom: High CPU/memory usage Fix:

# Scale horizontally
kubectl scale deployment api --replicas=6 -n production

# Or scale vertically (requires restart)
kubectl set resources deployment api -c api --requests=memory=2Gi,cpu=1000m -n production

4. Inefficient Code

Symptom: Specific endpoints consistently slow Fix:

Profile the endpoint
Optimize algorithms
Implement pagination
Add caching layer

Recovery

Quick wins
- Increase cache TTLs
- Scale out services
- Enable read replicas

Rollback if needed

kubectl rollout undo deployment/api -n production

Communicate status
- Update status page
- Notify affected customers
- Post in #incidents channel

Prevention

Load testing before major releases
Gradual rollouts with canary deployments
Query performance regression tests
Capacity planning reviews

Monitoring

Key metrics to watch:

API latency percentiles
Database query time
Cache hit rates
Resource utilization
Error rates

Runbook: High Database Connections

Overview

This runbook describes how to handle situations where database connection pool is exhausted or nearing limits.

Symptoms

Application errors: “too many connections”
Slow API responses
Connection pool metrics showing high usage
Database showing max_connections limit reached

Impact

API requests fail
Background jobs unable to process
Users experience errors and timeouts

Detection

Alert: DatabaseConnectionsHigh
Threshold: > 80% of max_connections
Dashboard: Database Health

Response

Immediate Actions

Check current connections

SELECT count(*) FROM pg_stat_activity;
SELECT usename, application_name, count(*)
FROM pg_stat_activity
GROUP BY usename, application_name
ORDER BY count DESC;

Identify idle connections

SELECT pid, usename, application_name, state, state_change
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes';

Kill long-idle connections (if safe)

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '30 minutes';

Root Cause Analysis

Check for connection leaks
- Review recent deployments
- Check for missing defer db.Close()
- Look for transactions not being committed/rolled back
Review pool configuration
- Current settings in environment
- Calculate optimal pool size
- Check for misconfigured services
Analyze traffic patterns
- Sudden spike in requests
- New feature causing more queries
- Background job issues

Long-term Fixes

Optimize connection pool settings

db.SetMaxOpenConns(25)
db.SetMaxIdleConns(10)
db.SetConnMaxLifetime(5 * time.Minute)

Implement connection pooler
- Consider PgBouncer for connection multiplexing
- Configure pool modes appropriately
Code improvements
- Use prepared statements
- Batch queries where possible
- Implement query result caching

Prevention

Monitor connection pool metrics
Load test with realistic concurrency
Regular code reviews for database usage
Implement circuit breakers

Runbook: Database Failover

Overview

This runbook covers the process of failing over to a standby database in case of primary database failure.

Symptoms

Primary database unreachable
Replication lag increasing indefinitely
Database corruption detected
Catastrophic hardware failure

Impact

Complete service outage
Data writes blocked
Potential data loss (depending on replication lag)

Detection

Alert: DatabasePrimaryDown
Alert: DatabaseReplicationLagHigh
Dashboard: Database Health

Pre-failover Checks

1. Verify Primary is Down

# Check connectivity
pg_isready -h primary.db.tenki.cloud -p 5432

# Check from multiple locations
for host in api-1 api-2 worker-1; do
  ssh $host "pg_isready -h primary.db.tenki.cloud"
done

2. Check Replication Status

-- On standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

-- Check if standby is receiving updates
SELECT * FROM pg_stat_replication;

3. Assess Data Loss Risk

Note the last transaction timestamp
Document replication lag
Make go/no-go decision based on business impact

Failover Process

1. Stop All Application Traffic

# Scale down applications
kubectl scale deployment api worker --replicas=0 -n production

# Verify no active connections
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';

2. Promote Standby

# On standby server
pg_ctl promote -D /var/lib/postgresql/data

# Or using managed service commands
gcloud sql instances promote-replica tenki-db-standby

3. Update Connection Strings

# Update DNS
terraform apply -var="database_host=standby.db.tenki.cloud"

# Or update environment variables
kubectl set env deployment/api deployment/worker \
  DATABASE_URL=postgres://user:pass@standby.db.tenki.cloud/tenki \
  -n production

4. Verify New Primary

-- Check if accepting writes
SELECT pg_is_in_recovery();  -- Should return false

-- Test write
INSERT INTO health_check (timestamp) VALUES (now());

5. Resume Application Traffic

# Scale up applications
kubectl scale deployment api --replicas=3 -n production
kubectl scale deployment worker --replicas=2 -n production

# Monitor for errors
kubectl logs -f deployment/api -n production

Post-Failover Tasks

1. Immediate

Monitor application health
Check for data inconsistencies
Communicate status to stakeholders

2. Within 1 Hour

Set up new standby from old primary (if recoverable)
Update monitoring to reflect new topology
Document timeline and impact

3. Within 24 Hours

Root cause analysis
Update disaster recovery procedures
Test backup restoration process

Rollback Procedure

If failover was premature or primary recovers:

Stop applications again
Ensure data consistency
- Compare transaction IDs
- Check for split-brain scenarios

Resync if needed

pg_rewind --target-pgdata=/var/lib/postgresql/data \
          --source-server="host=primary.db.tenki.cloud"

Switch back to primary
Resume traffic

Prevention

Regular failover drills
Monitor replication lag closely
Implement automatic failover with proper fencing
Use synchronous replication for critical data

Runbook: Playwright Scenario Failed

Test Failure Due to Multiple Matching Elements with Similar Text

Alert Details

Alert Name: Tenki Production - App Can Login
Severity: P2
Team: Frontend
Last Updated: 2025-09-08

Symptoms

should allow entering email and password

Quick Diagnostics

kubectx tenki-prod-apps

Resolution Steps

1. Immediate Mitigation (5-10 mins)

checked staging and production if I can successfully login - seems to be working on my end upon testing
ran kubectx tenki-prod-apps and ran logs from a namespace - everything is in Running status

2. Root Cause Analysis (10 mins)

The test failed due to a strict mode violation in Playwright.
locator detected multiple Projects in the code, and it didn’t know which one to interact with.
Playwright expects a single unique element when using .toBeVisible() in strict mode.

3. Fix Implementation / Possible Resolution

Add a unique internal ID to the correct element so the test can reliably target it without confusion from similar elements.
Update the test to match exact text to avoid picking up similar elements.

4. Verification

Successful test when i ran the scenario in Monitors

Prevention

ensure a proper unique id for dynamic or conditionally rendered UI elements

Not Applicable

On-call Guide

Last updated: 2025-06-30

Qualification

Watch the initial onboarding video.

Refer to this Notion document.

Duplicate this sample and use your name as the title.

It is very important to go through the steps to ensure proper qualifications and awareness of the responsibilities.

Product Requirement Documents (PRDs)

This directory contains PRDs for major features and initiatives. Each PRD captures the why, what, and success criteria for a feature.

PRD Template

# PRD-XXX: Feature Name

**Author**: Name
**Date**: YYYY-MM-DD
**Status**: Draft | In Review | Approved | In Development | Launched

## Summary

One paragraph overview of what we're building and why.

## Problem Statement

What problem are we solving? Who experiences this problem? Why does it matter?

## Goals & Success Metrics

- **Primary Goal**: What we must achieve
- **Success Metrics**:
  - Metric 1: Target value
  - Metric 2: Target value

## User Stories

1. As a [user type], I want to [action] so that [benefit]
2. As a [user type], I want to [action] so that [benefit]

## Requirements

### Must Have (MVP)

- [ ] Requirement 1
- [ ] Requirement 2

### Should Have

- [ ] Requirement 3
- [ ] Requirement 4

### Nice to Have

- [ ] Requirement 5

## Technical Approach

High-level technical approach. Details go in technical design docs.

## Risks & Mitigations

| Risk   | Impact | Likelihood | Mitigation          |
| ------ | ------ | ---------- | ------------------- |
| Risk 1 | High   | Medium     | How we'll handle it |

## Timeline

- Week 1-2: Design and planning
- Week 3-4: Implementation
- Week 5: Testing and rollout

## Open Questions

- [ ] Question 1
- [ ] Question 2

Current PRDs

PRD-001: GitHub Integration - Connect GitHub organizations

Writing a Good PRD

Do’s

Start with the problem, not the solution
Include measurable success criteria
Keep it concise (2-3 pages max)
Focus on the “what” and “why”, not “how”
Include user stories

Don’ts

Don’t include implementation details
Don’t skip the problem statement
Don’t forget about edge cases
Don’t ignore risks

PRD Process

Draft - PM creates initial PRD
Review - Engineering, Design, and stakeholders review
Approval - Leadership approves
Development - Engineering implements
Launch - Feature released
Retrospective - Measure against success criteria

PRD-001: GitHub Integration

Author: Product Team
Date: 2024-01-20
Status: Launched

Summary

Enable customers to connect their GitHub organizations to Tenki Cloud and automatically provision runners for their repositories without any configuration or infrastructure management.

Problem Statement

Development teams waste significant time and money managing GitHub Actions infrastructure:

Setting up self-hosted runners requires DevOps expertise
Maintaining runner infrastructure distracts from product development
GitHub’s hosted runners are expensive and have limited customization
Scaling runners up/down based on demand is complex

Who experiences this: Engineering teams using GitHub Actions for CI/CD Impact: Teams spend 10-20 hours/month on runner management instead of shipping features

Goals & Success Metrics

Primary Goal: Zero-config GitHub Actions runners that just work

Success Metrics:

Time to first runner: < 5 minutes from signup
Runner startup time: < 30 seconds
Platform uptime: 99.9%
Customer runner cost: 50% less than GitHub hosted
Monthly active organizations: 100 by Q2

User Stories

As a developer, I want to connect my GitHub org so that runners are automatically available for all my repos
As a team lead, I want to set spending limits so that we don’t exceed our CI/CD budget
As a DevOps engineer, I want to customize runner specs so that our builds run efficiently
As a finance manager, I want to see detailed usage reports so that I can allocate costs to teams

Requirements

Must Have (MVP)

GitHub App for OAuth authentication
Automatic runner provisioning for workflow_job events
Support for Linux runners (Ubuntu 22.04)
Basic usage dashboard showing minutes used
Automatic runner cleanup after job completion
Support for public and private repositories

Should Have

Multiple runner sizes (2-16 vCPU)
Usage alerts and spending limits
Windows and macOS runners
Runner caching between jobs
Team-based access controls

Nice to Have

Custom runner images
Dedicated runner pools
GitHub Enterprise Server support
API for programmatic management

Technical Approach

GitHub App handles authentication and webhook events
Webhook handler processes workflow_job events
Temporal workflows orchestrate runner lifecycle
Kubernetes operators manage runner pods
Usage tracking via TigerBeetle for accurate billing

Risks & Mitigations

Risk	Impact	Likelihood	Mitigation
GitHub API rate limits	High	Medium	Implement caching and exponential backoff
Runner startup time > 30s	High	Medium	Pre-warm runner pools, optimize images
Security vulnerabilities	High	Low	Regular security audits, isolated runners
Cost overruns	Medium	Medium	Real-time usage tracking and limits

Timeline

Week 1-2: GitHub App development and authentication
Week 3-4: Webhook handling and runner provisioning
Week 5-6: Usage tracking and billing integration
Week 7: Beta testing with friendly customers
Week 8: Public launch

Open Questions

Should we support GitHub Enterprise? → Not in MVP
How do we handle runner caching? → Post-MVP feature
What’s our runner retention policy? → 7 days for logs
How do we handle abuse/crypto mining? → Usage anomaly detection

Post-Launch Results

Launched: 2025-04-15

Actual Metrics (as of 2024-06-01):

Time to first runner:
Runner startup time:
Platform uptime:
Cost savings:
Monthly active orgs:

Key Learnings:

Pre-warming runner pools was critical for startup time
Customers want custom images more than expected
Windows runner demand higher than anticipated

Product Roadmap

Overview

This document outlines the product roadmap for Tenki Cloud, organized by quarters and strategic themes.

Q1 2025

Core Platform

✅ GitHub integration MVP
✅ Basic runner management
✅ Usage tracking and billing
🚧 Self-service onboarding
🚧 Team management

Developer Experience

✅ CLI tool
🚧 VS Code extension
📋 IntelliJ plugin

Q2 2025

Scale and Performance

📋 Multi-region support
📋 Runner auto-scaling
📋 Performance optimizations
📋 Caching improvements

Enterprise Features

📋 SSO integration
📋 Advanced access controls
📋 Audit logging
📋 Compliance certifications

Q3 2025

Ecosystem Integration

📋 GitLab support
📋 Bitbucket support
📋 Jenkins integration
📋 Kubernetes operators

Advanced Features

📋 Custom runner images
📋 GPU runner support
📋 Spot instance integration
📋 Advanced scheduling

Q4 2025

Platform Maturity

📋 White-label solution
📋 Marketplace integrations
📋 Partner ecosystem
📋 Advanced analytics

Legend

✅ Completed
🚧 In Progress
📋 Planned

Feature Requests

Track feature requests in our GitHub Issues.

Feedback

We welcome feedback on our roadmap. Please reach out through:

GitHub Discussions
Support channels
Customer success team

Product Metrics

Overview

This document defines the key metrics we track to measure product success and guide decision-making.

North Star Metrics

Primary Metric: Weekly Active Builds

Definition: Unique organizations with at least one successful build in the past 7 days
Target: 20% month-over-month growth
Current: [Dashboard Link]

Product Metrics

Activation

Time to First Build: Time from signup to first successful build
- Target: < 10 minutes
- Measured from: Account creation to first build completion
Activation Rate: % of signups that complete first build within 7 days
- Target: > 80%
- Segmented by: Source, plan type

Engagement

Build Frequency: Average builds per organization per week
- Target: > 50 builds/week for active orgs
- Segmented by: Organization size, industry
Runner Utilization: % of time runners are actively building
- Target: > 70% during business hours
- Measured: CPU time / available time

Retention

30-Day Retention: % of orgs active after 30 days
- Target: > 85%
- Cohorted by: Signup month
90-Day Retention: % of orgs active after 90 days
- Target: > 75%
- Leading indicator: Build frequency in first week

Revenue

MRR Growth: Month-over-month recurring revenue growth
- Target: 15% MoM
- Segmented by: Plan type, acquisition channel
Net Revenue Retention: Revenue from existing customers
- Target: > 120%
- Includes: Upgrades, downgrades, churn

Operational Metrics

Performance

Build Success Rate: % of builds completing successfully
- Target: > 99%
- Excluding: User errors
API Latency: p95 response time
- Target: < 200ms
- Measured: All API endpoints

Quality

Customer Satisfaction (CSAT): Post-interaction survey
- Target: > 4.5/5
- Measured: Support interactions
Net Promoter Score (NPS): Quarterly survey
- Target: > 50
- Segmented by: Customer segment