Tenki Cloud Documentation
Welcome to Tenki Cloudβs documentation. This is your starting point for understanding the system architecture, development practices, and operational procedures.
Note: This documentation is built with mdBook. Run
pnpm docs:devto view it locally.
Quick Links
- New to Tenki? Start with Getting Started
- Architecture Overview - System Architecture
- API Reference - Backend API Guide
- Deployment - Deployment Guide
Documentation Organization
π Architecture
System design, technical decisions, and architectural diagrams.
π» Development
Everything you need to start developing on Tenki Cloud.
- Getting Started - Set up your dev environment
- Backend Development - Go services and APIs
- Frontend Development - React/Next.js apps
- Database Guide - Schema and migrations
π Operations
Deployment, monitoring, and incident response.
- Deployment Guide
- Monitoring
- Runbooks - Operational procedures
π Product
Product vision, roadmap, and requirements.
Contributing to Documentation
When to Add Documentation
- Architecture changes β Add an ADR
- New features β Add a PRD
- Operational issues β Add a runbook
- API changes β Update the relevant guide
Documentation Standards
- Keep it concise - Get to the point quickly
- Use examples - Show, donβt just tell
- Date your docs - Add βLast updated: YYYY-MM-DDβ to guides
- Test your instructions - Make sure they actually work
Quick Doc Updates
# Install mdBook (first time only)
./docs/install-mdbook.sh
# Edit documentation
vim docs/src/development/getting-started.md
# Preview locally with hot reload
pnpm docs:dev
# Build static site
pnpm docs:build
# Submit changes
git add docs/
git commit -m "docs: update getting started guide"
Finding Information
By Role
Backend Engineer
Frontend Engineer
DevOps/SRE
Product Manager
By Task
βI need toβ¦β
- Set up my development environment β Getting Started
- Understand the system design β Architecture Overview
- Deploy to production β Deployment Guide
- Debug an issue β Runbooks
- Plan a new feature β PRD Template
Maintenance
This documentation is maintained by the engineering team. Each team member is responsible for keeping their area of expertise documented.
- Backend team owns: Backend guide, database docs, API patterns
- Frontend team owns: Frontend guide, component docs
- DevOps team owns: Deployment, monitoring, runbooks
- Product team owns: Roadmap, PRDs, metrics
Last updated: 2025-06-12
Tenki Cloud System Architecture
Last updated: 2025-06-12
Overview
Tenki Cloud is a cloud compute marketplace that provides GitHub Actions runner management as a service. The system is built as a distributed microservices architecture with clear separation of concerns.
High-Level Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β GitHub.com ββββββΆβ GitHub Proxy ββββββΆβ Temporal β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βββββββββββββββββββ ββββββββββββββββββββ βΌ
β Next.js App ββββββΆβ tRPC Gateway β βββββββββββββββββββ
βββββββββββββββββββ ββββββββββββββββββββ β Backend Engine β
β βββββββββββββββββββ
βΌ β
ββββββββββββββββββββ βΌ
β Backend API β βββββββββββββββββββ
β (Connect RPC) β β PostgreSQL β
ββββββββββββββββββββ βββββββββββββββββββ
Core Components
Frontend Layer
Next.js Application (apps/app/)
- Server-side rendered React application
- TypeScript with tRPC for type-safe API calls
- Tailwind CSS with Radix UI components
- Authentication via Kratos sessions
API Gateway Layer
tRPC Router (apps/app/src/server/api/)
- Type-safe RPC layer between frontend and backend
- Handles session management and authentication
- Routes requests to appropriate backend services
Backend Services
Engine (backend/cmd/engine/)
- Main orchestrator for all backend operations
- Implements Connect RPC (gRPC-Web compatible)
- Manages service lifecycle and dependencies
Domain Services (backend/internal/domain/)
- Identity: User authentication (Kratos) and authorization (Keto)
- Workspace: Multi-tenant workspace and project management
- Runner: GitHub Actions runner lifecycle management
- Billing: Usage tracking, TigerBeetle ledger, Stripe integration
- Compute: VM provisioning via CloudStack/Kubernetes
Event Processing
GitHub Proxy (backend/cmd/github-proxy/)
- Receives GitHub webhooks
- Validates and transforms events
- Publishes to Temporal for processing
Temporal Workflows
- Long-running business processes
- Runner provisioning workflows
- Billing cycle management
- Retry and failure handling
Data Layer
PostgreSQL
- Primary data store
- Managed via migrations (
backend/schema/) - Type-safe queries via sqlc
Redpanda
- Event streaming platform
- Audit log collection
- Inter-service communication
TigerBeetle
- Financial ledger for billing
- Double-entry bookkeeping
- High-performance transaction processing
Key Design Decisions
1. Monorepo Structure
See ADR-001
2. Temporal for Workflows
See ADR-002
3. Connect RPC over REST
See ADR-003
Security Architecture
Authentication Flow
User β Next.js β Kratos β Session Cookie β tRPC β Backend
Authorization Model
- Keto for fine-grained permissions
- Workspace-based multi-tenancy
- Project-level access control
Secrets Management
- SOPS for encrypted configuration
- Kubernetes secrets for runtime
- No secrets in environment variables
Deployment Architecture
Kubernetes Deployment
- GitOps via Flux
- Horizontal pod autoscaling
- Service mesh for inter-service communication
Infrastructure Components
- Ingress: Traefik with automatic TLS
- Monitoring: Prometheus + Grafana
- Logging: Loki + Grafana
- Tracing: Tempo
Data Flow Examples
Runner Provisioning
- GitHub sends webhook to proxy
- Proxy validates and publishes to Kafka
- Backend consumes event, starts Temporal workflow
- Workflow provisions runner in Kubernetes
- Runner registers with GitHub
- Status updates flow back via Temporal
Billing Flow
- Runner usage tracked via Temporal activities
- Usage events written to TigerBeetle
- Daily aggregation job calculates costs
- Monthly billing workflow generates invoices
- Stripe processes payments
- Payment status updates ledger
Scalability Considerations
Horizontal Scaling
- Stateless services scale via replicas
- Database uses read replicas for queries
- Temporal workers scale independently
Performance Optimization
- Redis for session caching
- CDN for static assets
- Database query optimization via indexes
Reliability
- Circuit breakers for external services
- Retry logic in Temporal workflows
- Graceful degradation for non-critical features
Future Architecture Plans
- Multi-region deployment for global latency optimization
- GraphQL federation for more flexible API access
- Event sourcing for complete audit trail
- Service mesh for advanced traffic management
Related Documentation
GitHub Runners Architecture
This document provides a comprehensive overview of Tenki Cloudβs GitHub Actions runner system, detailing how we manage self-hosted runners at scale.
Overview
Tenki Cloud provides a managed GitHub Actions runner platform that allows users to run their CI/CD workflows on dedicated, scalable infrastructure. The system integrates deeply with GitHub through a GitHub App, orchestrates runner lifecycle through Temporal workflows, and manages the underlying Kubernetes infrastructure.
System Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β GitHub ββββββΆβ GitHub Proxy ββββββΆβ Temporal β
β Webhooks β β (Node.js) β β Workflows β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Kubernetes βββββββ Runner Service βββββββ Database β
β (Runners) β β (Go) β β (PostgreSQL) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
Core Components
1. GitHub Proxy
The GitHub proxy serves as the entry point for all GitHub webhook events. Built with Node.js and Probot, it:
- Receives webhook events from GitHub (installation, workflow_job, workflow_run, push)
- Validates webhook signatures for security
- Forwards events to Temporal workflows for processing
- Preserves GitHub headers for workflow_job events
Key event handlers:
installation.created/deleted: Manages GitHub App installationsworkflow_job: Processes individual CI/CD job eventsworkflow_run: Tracks overall workflow executionpush: Monitors changes to workflow files
2. Runner Service
The runner service is the core business logic layer, implemented in Go with Connect RPC:
- Manages runner lifecycle: Creation, deletion, suspension
- Handles GitHub integration: Repository synchronization, workflow analysis
- Controls Kubernetes resources: Deployments, autoscalers, secrets
- Tracks usage and billing: Job metrics, duration, failures
Key operations:
InstallRunners: Initialize a new GitHub App installationCreateRunner: Provision custom runner configurationsGetRunnerMetrics: Performance analytics (p50/p90, failure rates)
3. Temporal Workflows
Temporal provides durable workflow orchestration for long-running operations:
Primary Workflows
Runner Installation Workflow
- Long-running workflow per GitHub installation
- Responds to signals: Install, Uninstall, Suspend, AddRepositories
- Manages entire runner lifecycle
- Handles failure recovery and retries
GitHub Job Workflow
- Processes each GitHub Actions job
- Tracks state transitions (queued β in_progress β completed)
- Creates billing events for usage tracking
- Forwards requests to Actions Runner Controller
GitHub Run Workflow
- Monitors overall workflow execution
- Provides visibility into CI/CD pipeline status
- Updates database with run metadata
4. Data Models
Runner
message Runner {
string id = 1;
string name = 2;
string namespace = 3;
string runner_offering_id = 4;
repeated string repositories = 5;
string status = 6;
bool is_custom = 7;
// Resource specifications
string cpu = 8;
string memory = 9;
}
RunnerInstallation
message RunnerInstallation {
int64 installation_id = 1;
string workspace_id = 2;
string state = 3;
string github_account_type = 4;
bool is_service_enabled = 5;
}
RunnerOffering
message RunnerOffering {
string id = 1;
string name = 2;
string cpu = 3;
string memory = 4;
string image_repository = 5;
bool is_autoscale = 6;
}
Event Flow
1. GitHub App Installation
sequenceDiagram
participant GH as GitHub
participant GP as GitHub Proxy
participant T as Temporal
participant RS as Runner Service
participant K8s as Kubernetes
GH->>GP: installation.created
GP->>T: Start RunnerInstallWorkflow
T->>RS: Install signal
RS->>RS: Sync repositories
RS->>K8s: Create namespace
RS->>K8s: Deploy runners
RS->>GH: Installation complete
2. Workflow Job Execution
sequenceDiagram
participant GH as GitHub
participant GP as GitHub Proxy
participant T as Temporal
participant RS as Runner Service
participant ARC as Actions Controller
participant B as Billing
GH->>GP: workflow_job (queued)
GP->>T: Start GithubJobWorkflow
T->>RS: Create job record
T->>ARC: Forward job request
GH->>GP: workflow_job (completed)
T->>B: Create usage event
T->>RS: Update job metrics
Key Features
Multi-tenancy
- Workspace isolation: Each workspace has dedicated resources
- Project organization: Runners are scoped to projects
- Kubernetes namespaces: Physical isolation at infrastructure level
Custom Runners
- Container registry support: GCP, AWS, or custom registries
- Custom images: Build and manage custom runner images
- Resource configurations: Flexible CPU/memory specifications
Auto-scaling
- Horizontal Pod Autoscaler: Scale based on job queue
- Dynamic provisioning: Add runners based on repository activity
- Cost optimization: Scale down when idle
Observability
- Metrics collection: Job duration, success rates, queue times
- Workflow tracking: Complete visibility into CI/CD pipelines
- Performance analytics: P50/P90 latencies, failure analysis
Security Considerations
Authentication
- GitHub App: OAuth-based authentication
- Webhook validation: Signature verification on all events
- Token management: Secure storage in Kubernetes secrets
Authorization
- Workspace boundaries: Strict tenant isolation
- Repository access: Fine-grained permissions per runner
- RBAC integration: Keto-based permission system
Network Security
- Private networking: Runners in isolated VPCs
- Egress controls: Restricted outbound access
- TLS everywhere: Encrypted communication throughout
Operational Aspects
Monitoring
- Temporal UI: Workflow state and history
- Prometheus metrics: Resource usage and performance
- Application logs: Structured logging with trace IDs
Failure Handling
- Temporal retries: Automatic retry with exponential backoff
- Circuit breakers: Prevent cascading failures
- Manual recovery: Reset workflows for reconciliation
Maintenance
- Rolling updates: Zero-downtime deployments
- Database migrations: Version-controlled schema changes
- Backup strategies: Regular snapshots of critical data
Future Enhancements
- GPU Support: Enable ML/AI workloads
- Spot Instance Integration: Cost optimization with preemptible VMs
- Advanced Caching: Distributed cache for dependencies
- Windows Runners: Support for Windows-based workflows
- Enhanced Analytics: Deeper insights into CI/CD performance
Billing System Architecture
This document provides a comprehensive overview of Tenki Cloudβs billing system, which handles usage-based billing, payment processing, and financial accounting for GitHub Actions runners.
Overview
Tenki Cloudβs billing system is designed to provide accurate, reliable, and scalable billing for compute usage. It integrates multiple systems:
- TigerBeetle: High-performance financial database for double-entry bookkeeping
- Stripe: Payment processing and invoice generation
- Temporal: Workflow orchestration for billing cycles and retry logic
- PostgreSQL: Storage for billing metadata and history
System Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β GitHub Actions ββββββΆβ Runner Service ββββββΆβ Usage Events β
β Jobs β β β β (Database) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Stripe βββββββ Billing Service βββββββ Temporal β
β (Payments) β β β β Workflows β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β TigerBeetle β
β (Accounting) β
βββββββββββββββββββ
Core Components
1. Data Models
Customer
message Customer {
string id = 1;
string identity_id = 2;
string workspace_id = 3;
uint64 tb_account_id = 4; // TigerBeetle account
string stripe_customer_id = 5; // Stripe customer
string default_payment_method = 6;
bool has_payment_method = 7;
string payment_method_status = 8;
}
Invoice
message Invoice {
string id = 1;
string customer_id = 2;
string billing_period = 3; // YYYY-MM format
string status = 4; // draft, issued, paid, void
int64 amount = 5; // in cents
bytes pdf_content = 6;
string pdf_url = 7;
string stripe_invoice_id = 8;
int32 retry_count = 9;
}
UsageEvent
message UsageEvent {
string id = 1;
string workspace_id = 2;
string runner_id = 3;
google.protobuf.Timestamp started_at = 4;
google.protobuf.Timestamp finished_at = 5;
int64 seconds = 6;
string external_id = 7; // Idempotency key
uint64 tb_transfer_id = 8; // TigerBeetle transfer
}
2. TigerBeetle Accounting
The system uses double-entry bookkeeping with predefined accounts:
Fixed Accounts
- 1001 -
TENKI_RECEIVABLE: Money owed to Tenki - 1010 -
STRIPE_RECEIVABLE: Money in Stripe - 2001 -
USER: Customer liability accounts - 4001 -
REVENUE: Income account - 5010 -
STRIPE_FEE: Payment processing fees - 5020 -
MARKETING_EXPENSE: Promotional credits
Transfer Types
- 1002 -
T_StripePayment: Customer payments via Stripe - 2001 -
T_RunnerCharge: GitHub Actions runner usage charge - 2002 -
T_RunnerPromoCreditUsage: Promotional credit usage adjustment - 2003 -
T_UsageReversal: Reversal of negative usage charges - 2010 -
T_ComputeCharge: Charge for compute resources (future use) - 3001 -
T_AccountSignup: Initial signup bonus credit - 3002 -
T_MonthlyFreeCredit: Monthly free credit allowance - 3003 -
T_PromoCredit: General promotional credit - 3004 -
T_PromoCreditReversal: Reversal of promotional credits
Example Transactions
Usage Charge (Runner completes job):
Debit: USER (Customer Account) $5.00
Credit: REVENUE $5.00
Payment Received (Stripe payment):
Debit: STRIPE_RECEIVABLE $100.00
Credit: USER (Customer Account) $100.00
Debit: STRIPE_FEE $2.90
Credit: STRIPE_RECEIVABLE $2.90
Promotional Credit:
Debit: MARKETING_EXPENSE $10.00
Credit: USER (Customer Account) $10.00
Financial Flow Sequences
The following sequence diagram illustrates the complete financial flows in the Tenki Cloud billing system, showing how money moves between different accounts through various transfer codes:
sequenceDiagram
participant USER as User/Customer
participant SIGNUP as Signup Process
participant GITHUB as GitHub Actions
participant BILLING as Billing Service
participant TB as TigerBeetle Ledger
participant STRIPE as Stripe
participant CYCLE as Billing Cycle
participant AUDIT as Audit System
Note over USER,AUDIT: Tenki Cloud Financial Flow System
%% Phase 1: Account Creation & Initial Credits
rect rgb(240, 248, 255)
Note left of USER: Phase 1: Account Setup & Signup Credits
USER->>SIGNUP: Create account
SIGNUP->>BILLING: Create customer account
BILLING->>TB: Create USER account (ACCOUNT_CODE_USER)
SIGNUP->>BILLING: Add signup bonus
BILLING->>TB: Transfer: T_AccountSignup<br/>MARKETING_EXPENSE β USER<br/>($10 signup credit)
end
%% Phase 2: Service Usage
rect rgb(255, 253, 240)
Note left of GITHUB: Phase 2: Service Usage & Charges
GITHUB->>BILLING: Job execution event
BILLING->>BILLING: Calculate usage cost
BILLING->>TB: Transfer: T_RunnerCharge<br/>USER β REVENUE<br/>(Usage charges)
end
%% Phase 3: Payment Processing
rect rgb(240, 255, 240)
Note left of CYCLE: Phase 3: Billing Cycle & Payments
Note over BILLING,TB: Step 1: Promotional credit adjustments
CYCLE->>BILLING: Start billing cycle
BILLING->>BILLING: Check promo credit usage for period
BILLING->>TB: Transfer: T_RunnerPromoCreditUsage<br/>REVENUE β MARKETING_EXPENSE<br/>(Move promo usage from revenue)
Note over BILLING,STRIPE: Step 2: Invoice generation
BILLING->>STRIPE: Create Stripe invoice
STRIPE->>USER: Send payment request
alt Payment Success
USER->>STRIPE: Make payment
STRIPE->>BILLING: Payment webhook
Note over BILLING,AUDIT: Payment Success Workflow
BILLING->>TB: Transfer: T_StripePayment<br/>STRIPE_RECEIVABLE β USER<br/>(Payment received)
BILLING->>TB: Transfer: T_MonthlyFreeCredit<br/>MARKETING_EXPENSE β USER<br/>($10/month free credit reset)
BILLING->>BILLING: Create payment record in database
BILLING->>AUDIT: Create billing audit record<br/>(compliance tracking)
else Payment Failed
STRIPE->>BILLING: Payment failed webhook
BILLING->>BILLING: Schedule retry attempts
BILLING->>BILLING: Start service interruption timer
end
end
Complete Transfer Code Reference
The system uses the following transfer codes for different types of financial transactions:
Payment & Withdrawal Operations (1000s)
- 1001 -
T_BankWithdrawal: Cash withdrawal from bank account - 1002 -
T_StripePayment: Payment received from Stripe (invoice payment)
Service Charges (2000s)
- 2001 -
T_RunnerCharge: Charge for GitHub Actions runner usage - 2002 -
T_RunnerPromoCreditUsage: Adjustment to move promotional credit usage from revenue to marketing expense - 2003 -
T_UsageReversal: Reversal of negative usage charges - 2010 -
T_ComputeCharge: Charge for compute resources (future use)
Credits & Bonuses (3000s)
- 3001 -
T_AccountSignup: Initial signup bonus credit - 3002 -
T_MonthlyFreeCredit: Monthly free credit allowance (e.g., $10/month) - 3003 -
T_PromoCredit: General promotional credit (campaigns, support, etc.) - 3004 -
T_PromoCreditReversal: Reversal of promotional credits (corrections, violations, etc.)
Key Financial Flow Patterns
-
Customer Onboarding: New users receive signup credits (
T_AccountSignup) and monthly free credits (T_MonthlyFreeCredit) from the marketing expense account. -
Usage Billing: GitHub Actions runner usage generates charges (
T_RunnerCharge) that move money from customer accounts to revenue. -
Promotional Credit Accounting: When promotional credits are used for services, the system adjusts by moving the equivalent amount from revenue back to marketing expense (
T_RunnerPromoCreditUsage). -
Payment Processing: Customer payments through Stripe (
T_StripePayment) add funds to customer accounts from the Stripe receivable account. -
Administrative Corrections: The system supports reversals for both usage charges (
T_UsageReversal) and promotional credits (T_PromoCreditReversal) for corrections and violations.
3. Billing Service
The billing service provides APIs for:
- Customer Management: Creating and retrieving billing customers
- Balance Operations: Checking workspace credits/debits
- Invoice Management: Generating and managing monthly invoices
- Usage Tracking: Recording compute usage events
- Payment Methods: Managing cards and payment details
- Stripe Integration: Setup intents and billing portal
Key service methods:
// Record runner usage
RecordUsage(ctx, workspaceID, runnerID, startTime, endTime)
// Process monthly billing
ProcessInvoiceAndCharge(ctx, workspaceID, billingPeriod)
// Add promotional credits
AddPromotionalCredits(ctx, workspaceID, amount, description)
Workflow Orchestration
1. Billing Cycle Workflow
Runs monthly for each workspace:
flowchart TD
A[Start Monthly Billing] --> B[Generate Stripe Invoice]
B --> C[Send Invoice Email]
C --> D{Amount > 0?}
D -->|Yes| E[Charge Payment Method]
D -->|No| F[Complete]
E --> G{Payment Success?}
G -->|Yes| H[Payment Succeeded Workflow]
G -->|No| I[Payment Failed Workflow]
H --> F
I --> F
2. Payment Processing Workflows
Payment Succeeded:
- Record payment in TigerBeetle
- Create payment record in database
- Update invoice status
Payment Failed:
- Send failure notification
- Schedule retry attempts (max 5)
- Start service interruption timer
3. Retry Logic
Failed payments are retried with exponential backoff:
- Retry 1: 3 days later
- Retry 2: 5 days later
- Retry 3: 7 days later
- Retry 4: 14 days later
- Retry 5: 21 days later
If all retries fail by the 9th of the following month, services are suspended on the 10th.
4. Credit Management
Long-running workflow that handles credit operations via signals:
AddPromotionalCredits: Adds credits to workspaceDeductPromotionalCredits: Removes credits- Maintains audit trail in TigerBeetle
Usage Flow
1. Recording Usage
When a GitHub Actions job completes:
sequenceDiagram
participant Job as GitHub Job
participant Runner as Runner Service
participant Billing as Billing Service
participant TB as TigerBeetle
Job->>Runner: Job completed
Runner->>Billing: Record usage event
Billing->>Billing: Calculate cost
Billing->>TB: Create usage transfer
TB->>TB: Debit user account
TB->>TB: Credit revenue account
Billing->>Runner: Usage recorded
2. Monthly Billing
At the start of each month:
sequenceDiagram
participant Temporal
participant Billing as Billing Service
participant Stripe
participant Customer
Temporal->>Billing: Start billing cycle
Billing->>Billing: Calculate usage for month
Billing->>Stripe: Create invoice
Stripe->>Customer: Send invoice email
Billing->>Stripe: Charge payment method
alt Payment successful
Stripe->>Billing: Payment confirmed
Billing->>Billing: Record in TigerBeetle
else Payment failed
Stripe->>Billing: Payment failed
Billing->>Temporal: Schedule retry
end
Key Features
Precision Accounting
- All amounts stored as micro-cents (1/1,000,000 of a cent)
- Prevents rounding errors in usage calculations
- Supports high-frequency micro-transactions
Idempotency
- External IDs prevent duplicate usage records
- Workflow IDs ensure single execution
- TigerBeetle provides transaction guarantees
Audit Trail
- Every financial transaction recorded in TigerBeetle
- Complete history of charges, payments, and credits
- Immutable ledger for compliance
Self-Service
- Stripe billing portal for payment method management
- Invoice history and downloads
- Usage reports by billing period
Graceful Degradation
- Billing continues even if Stripe is unavailable
- TigerBeetle ensures accounting accuracy
- Workflows retry transient failures
Security Considerations
Payment Security
- No credit card data stored in Tenki systems
- All payment processing through PCI-compliant Stripe
- Secure token-based payment method references
Access Control
- Workspace-scoped billing operations
- Admin-only credit management
- Audit logs for all financial operations
Data Protection
- Encrypted storage for sensitive data
- TLS for all external communications
- Regular backups of financial data
Operational Aspects
Monitoring
- Temporal workflow status for billing cycles
- TigerBeetle consistency checks
- Stripe webhook processing metrics
- Failed payment alerts
Troubleshooting
- Workflow history in Temporal UI
- TigerBeetle account balances
- Stripe dashboard for payment issues
- Database queries for usage history
Common Issues
- Payment failures: Check Stripe logs and retry status
- Missing usage: Verify runner job completion events
- Balance discrepancies: Audit TigerBeetle transfers
- Invoice generation: Check Temporal workflow status
Future Enhancements
- Volume Discounts: Tiered pricing based on usage
- Prepaid Packages: Bulk minute purchases
- Cost Alerts: Notifications for spending thresholds
- Multi-Currency: Support for international customers
- Advanced Analytics: Detailed cost breakdowns by repository/workflow
Architecture Decision Records
This directory contains Architecture Decision Records (ADRs) - documents that capture important architectural decisions made during the development of Tenki Cloud.
What is an ADR?
An ADR is a document that captures an important architectural decision made along with its context and consequences. Each ADR describes a single decision and is immutable once accepted.
ADR Template
# ADR-XXX: Title
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-YYY]
## Context
What is the issue that we're seeing that is motivating this decision or change?
## Decision
What is the change that we're proposing and/or doing?
## Consequences
What becomes easier or more difficult to do because of this change?
### Positive
- List of positive consequences
### Negative
- List of negative consequences
## Alternatives Considered
What other options were evaluated and why were they rejected?
Current ADRs
- ADR-001: Monorepo Structure - Using monorepo for all services
- ADR-002: Temporal for Workflow Orchestration - Workflow engine choice
- ADR-003: Connect RPC over REST - API protocol decision
Creating a New ADR
- Copy the template above
- Create a new file:
XXX-short-description.md(increment XXX) - Fill out all sections
- Submit PR for review
- Once accepted, the ADR becomes immutable
When to Write an ADR
Write an ADR when:
- Selecting key technologies (databases, frameworks, protocols)
- Defining major architectural patterns
- Making security decisions
- Choosing between significant alternatives
- Deprecating existing patterns
ADR Lifecycle
- Proposed - Under discussion
- Accepted - Decision made and being implemented
- Deprecated - No longer recommended but still in use
- Superseded - Replaced by another ADR
ADR-001: Monorepo Structure
Status
Accepted (2024-01-15)
Context
Tenki Cloud consists of multiple interconnected services:
- Frontend applications (Next.js web app, future mobile apps)
- Backend services (Go microservices)
- Shared packages (TypeScript utilities, proto definitions)
- Infrastructure code (Kubernetes manifests, Terraform)
We need a repository structure that:
- Enables code sharing between services
- Ensures coordinated deployments
- Maintains clear boundaries between services
- Provides good developer experience
Decision
We will use a monorepo structure with:
- pnpm workspaces for TypeScript/JavaScript projects
- Go modules with
replacedirectives for Go services - Turborepo for orchestrated builds
- Shared tooling across all services
Repository structure:
tenki.app/
βββ apps/ # Deployable applications
βββ backend/ # Go services
βββ packages/ # Shared libraries
βββ proto/ # Protocol buffer definitions
βββ infra/ # Infrastructure code
Consequences
Positive
- Atomic changes - Features spanning multiple services can be implemented in a single commit
- Shared tooling - Linting, formatting, and testing tools configured once
- Simplified dependencies - No need for private package registries
- Consistent versioning - All services released together
- Easier refactoring - Moving code between services is straightforward
- Single source of truth - Proto definitions shared directly
Negative
- Larger repository - Clone and fetch times increase over time
- Complex CI/CD - Need to determine which services to build/deploy
- Steeper learning curve - New developers must understand entire structure
- Potential for coupling - Easier to create inappropriate dependencies
- Tooling requirements - Requires pnpm, Go, and other tools installed
Alternatives Considered
1. Separate Repositories
Rejected because:
- Coordination overhead for cross-service changes
- Dependency version management complexity
- Need for private package registry
- Difficult to maintain API contracts
2. Git Submodules
Rejected because:
- Poor developer experience
- Complex update workflows
- Easy to get into inconsistent states
- Limited tool support
3. Lerna (instead of Turborepo)
Rejected because:
- Turborepo has better performance
- Native pnpm workspace support
- Better caching mechanisms
- Simpler configuration
Implementation Notes
-
Use
pnpmfilters for targeted operations:pnpm -F app dev # Run only app pnpm -F "backend/*" test # Test all backend -
Go services use local replace:
replace github.com/luxorlabs/proto => ../../proto -
CI uses Turborepo caching:
{ "pipeline": { "build": { "cache": true } } }
ADR-002: Temporal Workflows
Status
Accepted
Context
We need a reliable workflow orchestration system for managing complex, long-running processes like GitHub runner lifecycle management, billing operations, and asynchronous tasks.
Decision
We will use Temporal for workflow orchestration because it provides:
- Durable execution with automatic retries
- Built-in error handling and compensation
- Strong consistency guarantees
- Visibility into workflow state and history
- Language-specific SDKs with good Go support
Consequences
Positive
- Reliable execution of critical business processes
- Built-in observability and debugging capabilities
- Simplified error handling for distributed operations
- Ability to handle long-running workflows (hours/days)
Negative
- Additional infrastructure to maintain
- Learning curve for developers new to Temporal
- Potential vendor lock-in for workflow logic
Implementation
Temporal workflows will be used for:
- GitHub runner provisioning and lifecycle management
- Billing and subscription management
- Asynchronous job processing
- Scheduled maintenance tasks
ADR-003: gRPC Gateway
Status
Accepted
Context
We need to expose our internal gRPC services to web clients that donβt support gRPC directly. We also want to maintain a single source of truth for our API definitions while supporting both gRPC and REST/JSON clients.
Decision
We will use grpc-gateway to automatically generate a RESTful HTTP API from our gRPC service definitions. This allows us to:
- Maintain a single API definition in protobuf
- Support both gRPC and REST clients
- Auto-generate OpenAPI documentation
- Preserve strong typing across the stack
Consequences
Positive
- Single source of truth for API definitions
- Automatic REST API generation from protobuf
- Built-in OpenAPI/Swagger documentation
- Consistent API behavior between gRPC and REST
- Strong typing preserved through code generation
Negative
- Additional build step for gateway generation
- Some gRPC features donβt map perfectly to REST
- Slightly increased complexity in the API layer
- Need to carefully design protos for good REST mappings
Implementation
The grpc-gateway will:
- Run as a reverse proxy in front of gRPC services
- Translate HTTP/JSON requests to gRPC
- Use protobuf annotations for REST endpoint configuration
- Generate OpenAPI specs for documentation
Architecture Diagrams
This directory contains architectural diagrams for the Tenki Cloud platform.
Overview
Our architecture diagrams use Mermaid for easy maintenance and version control. Each diagram is stored as a .md file with embedded Mermaid syntax.
Available Diagrams
- System Overview: High-level view of all components
- Data Flow: How data moves through the system
- Deployment Architecture: Infrastructure and deployment topology
- Security Model: Authentication and authorization flows
Creating New Diagrams
- Create a new
.mdfile in this directory - Use Mermaid syntax for the diagram
- Include a description of what the diagram represents
- Update this README with a link to the new diagram
Viewing Diagrams
These diagrams are rendered automatically in:
- GitHub markdown preview
- Our documentation site (mdBook)
- Most modern markdown editors
Mermaid Resources
Feature Specification: New Pricing + Free Credits Policy
Feature Branch: 001-new-pricing
Created: 2025-10-22
Status: Draft
Input: User description: βNew Pricing + Free Credits Policyβ
User Scenarios & Testing (mandatory)
User Story 1 - New User Onboarding with Free Credits (Priority: P1)
A new user signs up for Tenki and receives 1,000 free minutes (normalized to 2 vCPU runners) to explore the platform without requiring payment information upfront. They can access all features and offerings during this trial period.
Why this priority: This is the primary entry point for all new users and directly addresses the problem of acquiring users while deferring payment collection until value is demonstrated.
Independent Test: Can be fully tested by creating a new account, running jobs on various runner sizes, and verifying that free minutes are properly calculated and consumed based on vCPU scaling (e.g., 500 minutes on 4 vCPU, 250 minutes on 8 vCPU).
Acceptance Scenarios:
- Given a new user signs up for Tenki, When their account is created, Then they receive 1,000 free minutes normalized to 2 vCPU runners
- Given a user has free minutes remaining, When they use a 2 vCPU runner for 10 minutes, Then 10 minutes are deducted from their balance
- Given a user has free minutes remaining, When they use a 4 vCPU runner for 10 minutes, Then 20 minutes are deducted from their balance (scaled by vCPU ratio)
- Given a user has free minutes remaining, When they use an 8 vCPU runner for 10 minutes, Then 40 minutes are deducted from their balance (scaled by vCPU ratio)
- Given a new user with free credits, When they access the platform, Then they can use all features and runner types without restrictions
User Story 2 - Payment Information Collection After Free Credits (Priority: P1)
When a user exhausts their 1,000 free minutes, the system prompts them to enter credit card information to continue using the platform on a pay-as-you-go basis.
Why this priority: This is the critical conversion point from free trial to paid customer and directly addresses the problem of verifying payment intent before allowing continued usage.
Independent Test: Can be tested by consuming all free minutes and verifying that the system blocks further usage until valid payment information is provided, then allows continued usage after payment details are entered.
Acceptance Scenarios:
- Given a user has consumed all 1,000 free minutes, When they attempt to run a new job, Then they are prompted to enter credit card information before proceeding
- Given a user is prompted for payment, When they enter valid credit card details, Then their account transitions to pay-as-you-go billing and jobs can proceed
- Given a user is prompted for payment, When they close the prompt without entering payment details, Then their jobs remain blocked until payment is provided
- Given a user has entered payment information, When they consume additional minutes, Then usage is tracked and billed according to PAYG pricing
User Story 3 - Pay-As-You-Go Usage and Billing (Priority: P2)
A paid user runs CI/CD jobs on various runner types and is charged per-minute based on the runner SKU pricing. They receive transparent billing for their actual usage with no upfront commitments.
Why this priority: This is the core revenue model for the platform and must work reliably for sustainable business operations.
Independent Test: Can be tested by running jobs on different runner SKUs, verifying per-minute charges match the pricing table, and confirming accurate invoice generation.
Acceptance Scenarios:
- Given a paid user runs a job on a 2c-4GB x64 runner for 10 minutes, When billing is calculated, Then they are charged $0.03 (10 min Γ $0.003/min)
- Given a paid user runs a job on a 4c-8GB x64 runner for 15 minutes, When billing is calculated, Then they are charged $0.09 (15 min Γ $0.006/min)
- Given a paid user with 40 concurrent jobs included, When they run 40 or fewer concurrent jobs, Then no additional concurrency charges apply
- Given a paid user, When they view their billing dashboard, Then they see itemized usage by runner type, duration, and total costs
User Story 4 - Add-On Purchase and Management (Priority: P2)
A user on the PAYG plan purchases optional add-ons such as macOS M4 runner access, priority support, or priority queue boost to enhance their experience.
Why this priority: Add-ons provide upsell opportunities and feature-based segmentation, addressing the problem of revenue expansion and feature monetization.
Independent Test: Can be tested by purchasing an add-on (e.g., macOS access for $39/month), verifying access is granted, and confirming the recurring charge appears on invoices.
Acceptance Scenarios:
- Given a PAYG user, When they purchase macOS M4 runner access for $39/month, Then they can create and run jobs on macOS runners
- Given a user without macOS access, When they attempt to use macOS runners, Then they are prompted to purchase the add-on
- Given a user purchases priority support for $250/month, When they submit a support request, Then it is routed to the priority queue with private chat access
- Given a user purchases priority queue boost for $49/month per workspace, When their jobs are queued, Then they receive higher priority in job scheduling
- Given a user with add-ons, When they view their billing, Then add-on charges are itemized separately from usage charges
User Story 5 - Additional Concurrent Job Slot Purchase (Priority: P3)
A user exceeding the 40 included concurrent job slots purchases additional slots at $7/slot/month for x64 runners or $49/slot/month for macOS M4 runners.
Why this priority: This supports teams with high parallelism needs and provides incremental revenue, but is less critical than core pricing and add-ons.
Independent Test: Can be tested by running more than 40 concurrent jobs, purchasing additional slots, and verifying jobs execute in parallel up to the new limit.
Acceptance Scenarios:
- Given a user with 40 included concurrent slots, When they attempt to run 50 concurrent jobs, Then 10 jobs are queued until slots become available
- Given a user, When they purchase 10 additional x64 concurrent slots, Then they are charged $70/month and can run up to 50 concurrent x64 jobs
- Given a user, When they purchase 5 additional macOS M4 concurrent slots, Then they are charged $245/month and can run up to 45 concurrent macOS jobs (assuming base 40 applies to all runner types)
User Story 6 - Storage Billing (Priority: P3)
A user stores build artifacts, caches, and other data on the platform and is billed $0.20 per GB per month for storage consumption.
Why this priority: Storage is a necessary cost component but secondary to compute billing in terms of implementation priority and revenue impact.
Independent Test: Can be tested by uploading data, tracking storage usage over time, and verifying charges match $0.20/GB/month prorated.
Acceptance Scenarios:
- Given a user stores 50 GB of data, When monthly billing is calculated, Then they are charged $10 for storage
- Given a user uploads 20 GB on day 15 of the month, When monthly billing is calculated, Then they are charged approximately $2 (prorated for half month)
- Given a user has 10 GB of transparent cache included, When they use 10 GB or less total storage, Then no storage charges apply [NEEDS CLARIFICATION: Is the 10GB transparent cache counted toward the $0.20/GB storage billing, or is it separate?]
User Story 7 - Enterprise Custom Pricing Inquiry (Priority: P3)
An enterprise customer with predictable, high-volume usage requests committed use discounts and white-glove onboarding through a sales inquiry process.
Why this priority: Enterprise deals provide revenue predictability and larger contracts, but represent a smaller percentage of total users and require sales team involvement.
Independent Test: Can be tested by submitting an enterprise inquiry form, receiving a response from sales, and negotiating custom pricing terms outside the automated PAYG system.
Acceptance Scenarios:
- Given a user interested in enterprise pricing, When they submit an inquiry, Then they are contacted by the sales team within [NEEDS CLARIFICATION: response SLA not specified - 24 hours? 48 hours?]
- Given an enterprise customer commits to a minimum usage level, When their contract is established, Then they receive discounted per-minute rates compared to PAYG
- Given an enterprise customer, When they onboard, Then they receive dedicated success engineer support for migration and setup
User Story 8 - Premium Runner Pricing (Priority: P3)
A user opts to use premium runners (indicated by βPremium Pricingβ in the SKU table) and is charged an additional fee on top of the base runner cost.
Why this priority: Premium runners provide differentiated service levels but are an optional enhancement to the base offering.
Independent Test: Can be tested by selecting a premium runner option, running a job, and verifying the charge includes the base price plus the premium surcharge.
Acceptance Scenarios:
- Given a user runs a job on a premium 2c-4GB runner for 10 minutes, When billing is calculated, Then they are charged $0.045 (10 min Γ ($0.003 + $0.0015)/min)
- Given a user runs a job on a premium 4c-8GB runner for 10 minutes, When billing is calculated, Then they are charged $0.090 (10 min Γ ($0.006 + $0.003)/min)
- Given a user, When they select a runner, Then they can choose between standard and premium options with clear pricing displayed [NEEDS CLARIFICATION: What specific benefits do premium runners provide - faster provisioning, dedicated resources, better SLA?]
Edge Cases
- What happens when a userβs payment method fails after exhausting free credits? Are jobs blocked immediately or is there a grace period?
- How are partial minutes billed (e.g., a job that runs for 3.5 minutes)?
- What happens if a user deletes stored data mid-month? Is storage billing prorated daily?
- How are concurrent job limits enforced when a user has both x64 and macOS runners? Are the limits separate or combined?
- What happens when a user downgrades or cancels add-ons mid-billing cycle? Do they receive prorated refunds or credits?
- How are discounts (up to 50% offered by sales) applied to the billing system? Are they percentage discounts or fixed credits?
- What happens when a user exhausts free credits in the middle of a running job? Is the job terminated or allowed to complete?
- How is abuse detection handled for users who repeatedly create new accounts to exploit free credits?
Requirements (mandatory)
Functional Requirements
Free Credits System
- FR-001: System MUST allocate 1,000 free minutes (normalized to 2 vCPU runners) to all new user accounts upon creation
- FR-002: System MUST scale free minute consumption based on runner vCPU count (e.g., 4 vCPU uses 2Γ minutes, 8 vCPU uses 4Γ minutes)
- FR-003: System MUST track free minute balance in real-time and display remaining balance to users
- FR-004: System MUST allow users with free minutes to access all runner types and platform features without restrictions
- FR-005: System MUST prevent job execution when free minutes are exhausted and payment information has not been provided
Payment Collection
- FR-006: System MUST prompt users to enter credit card information when free minutes are exhausted
- FR-007: System MUST validate and securely store payment information using industry-standard tokenization
- FR-008: System MUST transition user accounts from free trial to PAYG billing status after payment information is collected
- FR-009: System MUST block job execution for users who decline to provide payment information after exhausting free credits
Pay-As-You-Go Billing
- FR-010: System MUST calculate per-minute charges for all runner types according to the defined pricing table (x64, macOS, premium)
- FR-011: System MUST track actual usage time for each job execution down to the minute
- FR-012: System MUST generate itemized invoices showing usage by runner type, duration, and cost
- FR-013: System MUST charge payment methods on a monthly billing cycle for accumulated usage
- FR-014: System MUST include 40 concurrent job slots in all PAYG accounts at no additional charge
- FR-015: System MUST include 10 GB of transparent caching in all PAYG accounts at no additional charge
Add-On Management
- FR-016: System MUST allow users to purchase macOS M4 runner access for $39/month per workspace
- FR-017: System MUST allow users to purchase priority support for $250/month
- FR-018: System MUST allow users to purchase priority queue boost for $49/month per workspace
- FR-019: System MUST restrict access to add-on features until the corresponding add-on is purchased
- FR-020: System MUST bill add-on charges as recurring monthly fees separate from usage charges
- FR-021: System MUST allow users to enable, disable, or modify add-ons at any time
- FR-022: System MUST grant macOS runner access only to users with the macOS add-on active
- FR-023: System MUST route support requests to priority queue for users with priority support add-on
- FR-024: System MUST prioritize job scheduling for workspaces with priority queue boost add-on
Concurrent Job Slot Management
- FR-025: System MUST allow users to purchase additional concurrent job slots at $7/slot/month for x64 runners
- FR-026: System MUST allow users to purchase additional concurrent job slots at $49/slot/month for macOS M4 runners
- FR-027: System MUST enforce concurrent job limits based on base allocation plus purchased slots
- FR-028: System MUST queue jobs that exceed concurrent slot limits until slots become available
Storage Billing
- FR-029: System MUST track total storage consumption for each user account across artifacts, caches, and data
- FR-030: System MUST bill storage at $0.20 per GB per month
- FR-031: System MUST calculate storage billing based on average daily usage over the billing period
- FR-032: System MUST display current storage usage and projected monthly costs to users
Runner Pricing
- FR-033: System MUST support all x64 runner SKUs with specified pricing (2c-4GB through 64c-256GB)
- FR-034: System MUST support macOS runner SKUs with specified pricing (6 vCPU, 12 vCPU)
- FR-035: System MUST support premium pricing tier for eligible x64 runner SKUs with additional charges
- FR-036: System MUST clearly display runner pricing to users when selecting runner types
Enterprise Tier
- FR-037: System MUST provide a mechanism for users to request enterprise pricing and custom contracts
- FR-038: System MUST support custom pricing configurations for enterprise accounts with committed use discounts
- FR-039: System MUST allow sales team to configure account-specific discounts up to 50%
- FR-040: System MUST support white-glove onboarding workflows for enterprise customers
Annual Prepayment Options (Internal)
- FR-041: System MUST support 12-month prepayment for priority queue boost at $499 (15% discount)
- FR-042: System MUST support 12-month prepayment for macOS M4 access at $399 (15% discount)
- FR-043: System MUST apply prepaid add-ons to user accounts for 12-month duration
Abuse Prevention
- FR-044: System MUST implement mechanisms to detect and prevent abuse patterns (repeated free credit exploitation, cryptocurrency mining, unauthorized Minecraft servers)
- FR-045: System MUST require payment information as a verification gate to prevent abusive users from continuing operations
Key Entities
- User Account: Represents an individual or organization using Tenki, with free credit balance, payment status, billing tier, and usage history
- Free Credit Balance: The remaining free minutes available to a user, normalized to 2 vCPU baseline, consumed based on runner vCPU scaling
- Payment Method: Tokenized credit card information associated with a user account for billing purposes
- Add-On Subscription: A purchased add-on feature (macOS access, priority support, priority queue boost) with recurring billing
- Concurrent Job Slot: Allocated capacity for running parallel jobs, includes base allocation plus purchased additional slots
- Runner SKU: A specific runner configuration (vCPU, memory) with associated per-minute pricing
- Usage Record: A log of job execution including runner type, duration, and calculated cost
- Invoice: A monthly billing statement showing itemized usage charges, add-on fees, and total amount due
- Enterprise Contract: A custom pricing agreement with committed use discounts and negotiated terms
- Storage Allocation: The amount of data stored by a user, tracked for billing at $0.20/GB/month
- Workspace: An organizational unit within a user account, relevant for workspace-specific add-ons (priority queue boost, macOS access)
Success Criteria (mandatory)
Measurable Outcomes
- SC-001: 90% of new users successfully start their first job using free credits within 24 hours of signup
- SC-002: Free credit system accurately scales minute consumption across all runner SKU types with 100% precision
- SC-003: Payment conversion rate from free to paid users reaches at least 15% within 30 days of signup
- SC-004: Billing calculations are accurate to the cent with zero disputes related to calculation errors in the first 90 days
- SC-005: Users can view real-time usage and cost projections with data latency under 5 minutes
- SC-006: Add-on purchases are reflected in user accounts and billing within 60 seconds of confirmation
- SC-007: Abuse detection mechanisms block at least 95% of identified abusive patterns (mining, unauthorized servers) within 24 hours of detection
- SC-008: Enterprise inquiry response time averages under 24 hours during business days
- SC-009: Concurrent job limits are enforced in real-time with zero jobs exceeding purchased slot allocation
- SC-010: Monthly revenue predictability improves by at least 30% through enterprise contracts and add-on subscriptions within 6 months of launch
- SC-011: Customer support tickets related to billing and pricing decrease by 40% compared to the previous pricing model within 3 months
- SC-012: Average revenue per user (ARPU) increases by at least 20% through add-on adoption within 6 months
Assumptions
- Users understand vCPU-based scaling of free credits and can calculate their effective free minutes for different runner sizes
- Industry-standard payment processing (Stripe or similar) is available and integrated for secure credit card handling
- Enterprise sales team has capacity and process to handle custom pricing negotiations and white-glove onboarding
- Abuse detection can leverage usage patterns, payment verification, and potentially behavioral analysis to identify bad actors
- Storage billing is calculated daily and averaged over the monthly billing period for prorated charges
- Partial minutes are rounded up to the next whole minute for billing purposes (industry standard for compute billing)
- Payment method failures trigger automated retry logic and user notifications before blocking service
- Annual prepayment options are available to sales team but not publicly advertised on the pricing page
- The 10 GB transparent cache is included in base PAYG pricing and does not count toward the $0.20/GB storage billing
- Concurrent job limits are enforced separately for x64 and macOS runners (not combined)
- Add-ons can be canceled at any time but billing continues through the end of the current billing cycle (no prorated refunds)
- Discounts applied by sales team are percentage-based and apply to all usage charges, not just specific SKUs
- Jobs in progress when free credits are exhausted are allowed to complete before payment is required
- Free credit abuse is mitigated by requiring unique email verification and detecting suspicious signup patterns.
Getting Started with Tenki Cloud Development
Last updated: 2025-06-12
This guide will help you set up your development environment and run Tenki Cloud locally.
Prerequisites
Required Software
- Nix
- Devenv
- Direnv
- 1Password
- 1Password CLI
- Integrate with 1Password CLI
- Verify you have access to luxor/Engineering by running:
op account list op vault list --account=luxor - If
luxorisnβt showing asluxor.1password.com, try other accounts fromop account list:op vault list --account=<account_name> - If
luxoraccount isnβt showing at all, contact administrator
Verify Prerequisites
nix --version
devenv version
direnv --version
op --version
Hardware Requirements
- RAM: 16GB minimum (32GB recommended)
- CPU: 4 cores minimum (8 cores recommended)
- Disk: 50GB free space
Initial Setup
1. Clone the Repository
git clone https://github.com/luxorlabs/tenki.app.git
cd tenki.app
2. Pull Setup Keys
sh tools/scripts/setup.sh
If you run into an issue where itβs using the wrong account, try this:
op account list
# account can be `luxor` if url is `luxor.1password.com`,
# or `my` if url is `my.1password.com` and there's only one account
# or the 3rd column (USER ID) if you have multiple accounts if you're getting the same url for all accounts.
sh tools/scripts/setup.sh <account>
Example: sh tools/scripts/setup.sh luxor
If you run into permission issues, try
chmod +x ./tools/scripts/*.sh
3. Enable Development Environment
direnv allow
This will:
- Install all required tools (Go, Node.js, pnpm, etc.)
- Set up environment variables
- Configure Git hooks
4. Install Dependencies
# Install all npm dependencies
pnpm install
# Generate protobuf code
bufgen
# Run Go mod tidy
tidy
# Update /etc/hosts entries
sync-hosts
# Initialize database
tb-format
5. Hosts and Other Setup
NOTE: Before running this script, the host needs to have
hostctlinstalled since it requires elevated execution. Verify withhostctl --version
Add tenki.lab hosts:
sync-hosts
Format TigerBeetle database:
tb-format
Managing Environment and Secrets
Secret Files
resources/secrets/*.sops.yaml- Encrypted secrets pushed/committed to gitresources/secrets/*.local.yaml- Decrypted secrets (gitignored)
Commands
env-sync- Decrypt secrets and create a copy on individual apps/backend foldersenv-decrypt- Decrypt secrets,*.sops.yamlto*.local.yamlenv-encrypt- Encrypt secrets,*.local.yamlto*.sops.yaml
Pulling Latest Secrets
git pullto get the latest secretsenv-syncto create your own copy of the secrets
Updating Secrets
- Locate the secret you want to update in
resources/secrets/*.local.yaml - Run
env-encryptto encrypt the secret - Commit the changes and push to GitHub
Overwriting Secrets (usually only needed once per setup)
- For Next.js apps:
.env.localwill overwrite.env. Copy.env.sampleto.env.localand update values - For backend:
engine.local.yamlwill overwriteengine.yaml. Copyengine.sample.yamltoengine.local.yamland update values
Seeding and Migrations
NOTE: Before running these commands, the database must be up and running. Run
dev upordevenv up
- Run
db upto migrate the database - Run
psql -U postgres -d tenki -f ./tools/seed/20240915152331_seed.sqlto seed the database with CSP & related data
Or run db deploy to run both.
- Run
db seedto seed the database with users, workspaces, projects, VMs. After this you can start the dev server & login with:- Email:
odin@tenki.cloud - Password:
tenki.app
- Email:
Database Commands
db up # Run migrations
db down # Rollback last migration
db reset # Reset database
db status # Check migration status
db create add_users_table # Create new migration
db deploy # Run migrations and seed
db nuke # Complete database reset
db seed # Seed test data
For Redpanda, see internal docs to set it up.
NOTE: This should be automated in the future
Running the Application
Quick Start
# Start all services
dev up
# Access the application
open https://app.tenki.lab:4001
Development Domains
- Frontend: https://app.tenki.lab:4001
- Temporal UI: https://temporal.tenki.lab
- Redpanda Console: https://redpanda.tenki.lab
- Grafana: https://grafana.tenki.lab
- API: https://api.tenki.lab
Individual Services
# Start specific service
dev up postgres
dev up temporal
dev up engine
# Other options
dev up --simple # Minimal output
dev up -D # Detached mode
# Service management
dev start [service] # Start specific service
dev stop [service] # Stop specific service
dev restart [service] # Restart service
dev logs [service] # View service logs
dev list # List all services
# Examples
dev start # (enter, then choose services, hit tab to select multiple)
dev start engine
dev logs -f postgres # Follow logs
Development Workflow
Frontend Development
cd apps/app
pnpm dev
# Run type checking
pnpm type-check
# Run linting
pnpm lint
pnpm lint:fix
Backend Development
# Run Go services
cd backend
go run cmd/engine/main.go
# Run tests
gotest
# Generate mocks
gomocks
# Build binaries
make build-engine
Database Changes
# Create new migration
db create add_user_preferences
# Apply migrations
db up
# Rollback migration
db down
# Reset database
db reset
Resetting Existing/Flaky Local Environment
- Close/stop all services
- Run
reset-local - In another terminal:
- Run
db deploy
- Run
Common Tasks
Adding shadcn/ui Components
# Add a new component
pnpm -F @shared/ui ui:add
# or
pnpm -F @shared/ui ui:add <component>
Then add the component to the exports in packages/ui/package.json:
"exports": {
"./button": "./src/components/ui/button.tsx",
"./alert-dialog": "./src/components/ui/alert-dialog.tsx"
}
Generating App Icons
pnpm -F app generate-icon
Then:
- Copy
public/images/favicon-196.pngto:src/app/favicon.pngsrc/app/icon.png
- Copy all
rel="apple-touch-startup-image"fromsrc/asset-generator.htmltosrc/app/layout.tsx
Adding a New API Endpoint
- Define proto in
proto/tenki/cloud/ - Run
bufgento generate code - Implement service in
backend/internal/domain/ - Add tRPC router in
apps/app/src/server/api/
Running Tests
# All tests
pnpm test
gotest
# Specific package
pnpm -F app test
cd backend && go test ./pkg/...
# Integration tests
gotest-integration
# With coverage
cd backend && go test -v -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
Debugging
# Check service health
dev list
# View all logs
dev logs
# Restart a service
dev restart engine
# Database console (direct connection)
psql -h localhost -U postgres -d tenki
# Temporal CLI
temporal workflow list
Troubleshooting
Port Already in Use
# Find process using port
lsof -i :4001
# Kill process
kill -9 <PID>
Database Connection Issues
# Restart postgres
dev restart postgres
# Check logs
dev logs postgres
# Reset database
db nuke
Proto Generation Fails
# Clean and regenerate
rm -rf backend/pkg/proto
bufgen
Node Modules Issues
# Clean and reinstall
rm -rf node_modules apps/*/node_modules packages/*/node_modules
pnpm install
Next Steps
- Read the Architecture Overview
- Set up your IDE/Editor
- Join the development Slack channel
- Pick a starter issue from GitHub
Editor Setup
VS Code
-
Install recommended extensions:
- Go
- ESLint
- Prettier
- Proto3
-
Use workspace settings:
{ "editor.formatOnSave": true, "go.lintTool": "golangci-lint" }
GoLand/WebStorm
- Enable Go modules
- Set up file watchers for:
- gofmt
- prettier
- eslint
Runner Prerequisites
Setting up GitHub App
- Create Github Organization, skip this if you already have one
- Create a github app
- run
pnpm -F github-app run createorpnpm -F github-app run create -o <org> - if
namealready exist, change it and continue - once done itβll redirect you to a success screen, close the tab,
- run
- In
github-app-response.jsontake note of theslug,pem,webhook_secret,client_idandclient_secret
Backend Development Guide
Last updated: 2025-06-12
Overview
The Tenki backend is built with Go and follows Domain-Driven Design principles. Services communicate via Connect RPC (gRPC-Web compatible) and use Temporal for workflow orchestration.
Project Structure
backend/
βββ cmd/ # Application entry points
β βββ engine/ # Main backend service
β βββ tenki-cli/ # CLI tool
βββ internal/ # Private application code
β βββ app/ # Application layer
β βββ domain/ # Business domains
β βββ billing/ # Billing domain
β βββ compute/ # VM management
β βββ identity/ # Auth & users
β βββ runner/ # GitHub runners
β βββ workspace/ # Multi-tenancy
βββ pkg/ # Public packages
β βββ proto/ # Generated protobuf
βββ queries/ # SQL queries (sqlc)
βββ schema/ # Database migrations
Development Workflow
Running the Backend
# Start dependencies
dev up postgres temporal kafka
# Run migrations
db deploy
# Start engine
cd backend
go run cmd/engine/main.go
# Or use the dev script
dev restart engine
Adding a New Feature
-
Define the API
// proto/tenki/cloud/workspace/v1/project.proto service ProjectService { rpc CreateProject(CreateProjectRequest) returns (CreateProjectResponse); } -
Generate code
bufgen -
Implement domain logic
// internal/domain/workspace/service/project.go func (s *Service) CreateProject(ctx context.Context, req *params.CreateProject) (*models.Project, error) { // Business logic here } -
Write SQL queries
-- queries/workspace/project.sql -- name: CreateProject :one INSERT INTO projects (name, workspace_id) VALUES ($1, $2) RETURNING *; -
Generate SQL code
cd backend && sqlc generate
Testing
Unit Tests
func TestService_CreateProject(t *testing.T) {
tests := []struct {
name string
input *params.CreateProject
want *models.Project
wantErr bool
}{
{
name: "valid project",
input: ¶ms.CreateProject{
Name: "test-project",
WorkspaceID: "ws-123",
},
want: &models.Project{
Name: "test-project",
},
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// Test implementation
})
}
}
Integration Tests
//go:build integration
var _ = Describe("Project Service", func() {
var (
service *workspace.Service
db *sql.DB
)
BeforeEach(func() {
db = setupTestDB()
service = workspace.NewService(workspace.WithDB(db))
})
It("should create a project", func() {
project, err := service.CreateProject(ctx, params)
Expect(err).NotTo(HaveOccurred())
Expect(project.Name).To(Equal("test"))
})
})
Running Tests
# Unit tests only
gotest
# Integration tests
gotest-integration
# Specific package
cd backend && go test ./internal/domain/workspace/...
# With coverage
cd backend && go test -cover ./...
Database Operations
Migrations
# Create migration
echo "CREATE TABLE features (id uuid PRIMARY KEY);" > backend/schema/$(date +%Y%m%d%H%M%S)_add_features.sql
# Apply migrations
db up
# Rollback
db down
Query Development
- Write query in
backend/queries/ - Run
sqlc generate - Use generated code in service
// Generated code usage
project, err := s.db.CreateProject(ctx, db.CreateProjectParams{
Name: req.Name,
WorkspaceID: req.WorkspaceID,
})
Temporal Workflows
Workflow Definition
func RunnerProvisioningWorkflow(ctx workflow.Context, params RunnerParams) error {
// Configure workflow
ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Minute,
RetryPolicy: &temporal.RetryPolicy{
MaximumAttempts: 3,
},
})
// Execute activities
var runner *models.Runner
err := workflow.ExecuteActivity(ctx, CreateRunnerActivity, params).Get(ctx, &runner)
if err != nil {
return fmt.Errorf("create runner: %w", err)
}
return nil
}
Testing Workflows
func TestRunnerProvisioningWorkflow(t *testing.T) {
suite := testsuite.WorkflowTestSuite{}
env := suite.NewTestWorkflowEnvironment()
// Mock activities
env.OnActivity(CreateRunnerActivity, mock.Anything).Return(&models.Runner{ID: "123"}, nil)
// Execute workflow
env.ExecuteWorkflow(RunnerProvisioningWorkflow, params)
require.True(t, env.IsWorkflowCompleted())
require.NoError(t, env.GetWorkflowError())
}
API Patterns
Service Options
// Use functional options pattern
type Service struct {
db *db.Queries
temporal client.Client
logger *slog.Logger
}
type Option func(*Service)
func WithDB(db *db.Queries) Option {
return func(s *Service) {
s.db = db
}
}
func NewService(opts ...Option) *Service {
s := &Service{
logger: slog.Default(),
}
for _, opt := range opts {
opt(s)
}
return s
}
Error Handling
// Define domain errors
var (
ErrProjectNotFound = errors.New("project not found")
ErrUnauthorized = errors.New("unauthorized")
)
// Wrap errors with context
if err != nil {
return fmt.Errorf("fetch project %s: %w", projectID, err)
}
// Check errors
if errors.Is(err, ErrProjectNotFound) {
return connect.NewError(connect.CodeNotFound, err)
}
Debugging
Local Debugging
# Enable debug logging
export LOG_LEVEL=debug
# Run with delve
dlv debug cmd/engine/main.go
# Attach to running process
dlv attach $(pgrep engine)
Temporal UI
# View workflows
open https://temporal.tenki.lab
# List workflows via CLI
temporal workflow list --query 'WorkflowType="RunnerProvisioningWorkflow"'
# Describe workflow
temporal workflow describe -w <workflow-id>
Database Queries
# Connect to database
dev exec postgres psql -U postgres tenki
# Useful queries
SELECT * FROM runners WHERE created_at > NOW() - INTERVAL '1 hour';
SELECT COUNT(*) FROM workflow_runs GROUP BY status;
Performance Tips
- Use prepared statements - sqlc does this automatically
- Batch operations - Use
CopyFromfor bulk inserts - Connection pooling - Configure in engine.yaml
- Context cancellation - Always respect context.Done()
- Concurrent operations - Use errgroup for parallel work
Common Patterns
Repository Pattern
type RunnerRepository interface {
Create(ctx context.Context, runner *Runner) error
GetByID(ctx context.Context, id string) (*Runner, error)
List(ctx context.Context, filter Filter) ([]*Runner, error)
}
Builder Pattern
query := NewQueryBuilder().
Where("status", "active").
OrderBy("created_at", "DESC").
Limit(10).
Build()
Middleware Pattern
func LoggingMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
next.ServeHTTP(w, r)
slog.Info("request", "method", r.Method, "path", r.URL.Path, "duration", time.Since(start))
})
}
Resources
Frontend Development Guide
Last updated: 2025-06-12
Overview
The Tenki frontend is built with Next.js 15, React 19, and TypeScript. We use tRPC for type-safe API communication, Tailwind CSS for styling, and Radix UI for accessible components.
Tech Stack
- Framework: Next.js 15 (App Router)
- Language: TypeScript
- Styling: Tailwind CSS + Radix UI
- State: React Context + Zustand
- API: tRPC
- Forms: React Hook Form + Zod
- Testing: Jest + React Testing Library
Project Structure
apps/app/
βββ src/
β βββ app/ # Next.js app router pages
β β βββ (dashboard)/ # Protected routes
β β βββ auth/ # Auth pages
β β βββ api/ # API routes
β βββ components/ # Reusable components
β βββ hooks/ # Custom hooks
β βββ server/ # Server-side code
β β βββ api/ # tRPC routers
β βββ trpc/ # tRPC client setup
β βββ utils/ # Utilities
βββ public/ # Static assets
βββ next.config.mjs # Next.js config
Development Workflow
Running the Frontend
# Start all services (recommended)
pnpm dev
# Or just the frontend
pnpm -F app dev
# Access at
open https://app.tenki.lab:4001
Creating Components
// components/project-card.tsx
interface ProjectCardProps {
project: Project;
onSelect?: (project: Project) => void;
}
export function ProjectCard({ project, onSelect }: ProjectCardProps) {
return (
<Card onClick={() => onSelect?.(project)} className="cursor-pointer transition-shadow hover:shadow-lg">
<CardHeader>
<CardTitle>{project.name}</CardTitle>
</CardHeader>
<CardContent>
<p className="text-muted-foreground">{project.description}</p>
</CardContent>
</Card>
);
}
Using tRPC
// In a client component
"use client";
import { trpc } from "@/trpc/client";
export function ProjectList() {
const { data: projects, isLoading } = trpc.project.list.useQuery();
const createProject = trpc.project.create.useMutation({
onSuccess: () => {
// Invalidate and refetch
utils.project.list.invalidate();
},
});
if (isLoading) return <Skeleton />;
return (
<div>
{projects?.map((project) => (
<ProjectCard key={project.id} project={project} />
))}
</div>
);
}
Creating tRPC Routes
// server/api/routers/project.ts
export const projectRouter = createTRPCRouter({
list: protectedProcedure.query(async ({ ctx }) => {
return ctx.db.project.findMany({
where: { workspaceId: ctx.session.workspaceId },
});
}),
create: protectedProcedure
.input(
z.object({
name: z.string().min(1),
description: z.string().optional(),
}),
)
.mutation(async ({ ctx, input }) => {
return ctx.db.project.create({
data: {
...input,
workspaceId: ctx.session.workspaceId,
},
});
}),
});
Styling Guidelines
Using Tailwind
// Use semantic color classes
<div className="bg-background text-foreground">
<button className="bg-primary text-primary-foreground hover:bg-primary/90">
Click me
</button>
</div>
// Responsive design
<div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4">
{/* Grid items */}
</div>
// Dark mode support (automatic)
<div className="bg-white dark:bg-gray-900">
Content adapts to theme
</div>
Component Composition
// Use Radix UI primitives
import * as Dialog from "@radix-ui/react-dialog";
export function CreateProjectDialog() {
return (
<Dialog.Root>
<Dialog.Trigger asChild>
<Button>Create Project</Button>
</Dialog.Trigger>
<Dialog.Portal>
<Dialog.Overlay className="fixed inset-0 bg-black/50" />
<Dialog.Content className="bg-background fixed top-1/2 left-1/2 -translate-x-1/2 -translate-y-1/2 rounded-lg p-6">
<Dialog.Title>Create Project</Dialog.Title>
{/* Form content */}
</Dialog.Content>
</Dialog.Portal>
</Dialog.Root>
);
}
State Management
Local State
// For simple component state
const [isOpen, setIsOpen] = useState(false);
Context for Feature State
// contexts/project-context.tsx
const ProjectContext = createContext<ProjectContextType | null>(null);
export function ProjectProvider({ children }: { children: ReactNode }) {
const [selectedProject, setSelectedProject] = useState<Project | null>(null);
return <ProjectContext.Provider value={{ selectedProject, setSelectedProject }}>{children}</ProjectContext.Provider>;
}
export function useProject() {
const context = useContext(ProjectContext);
if (!context) throw new Error("useProject must be used within ProjectProvider");
return context;
}
Global State with Zustand
// stores/user-preferences.ts
import { create } from "zustand";
interface PreferencesStore {
theme: "light" | "dark" | "system";
setTheme: (theme: PreferencesStore["theme"]) => void;
}
export const usePreferences = create<PreferencesStore>((set) => ({
theme: "system",
setTheme: (theme) => set({ theme }),
}));
Forms
With React Hook Form + Zod
const ProjectSchema = z.object({
name: z.string().min(1, "Name is required"),
description: z.string().optional(),
isPublic: z.boolean().default(false),
});
type ProjectForm = z.infer<typeof ProjectSchema>;
export function CreateProjectForm() {
const form = useForm<ProjectForm>({
resolver: zodResolver(ProjectSchema),
defaultValues: {
name: "",
isPublic: false,
},
});
const onSubmit = async (data: ProjectForm) => {
await createProject.mutateAsync(data);
};
return (
<Form {...form}>
<form onSubmit={form.handleSubmit(onSubmit)}>
<FormField
control={form.control}
name="name"
render={({ field }) => (
<FormItem>
<FormLabel>Project Name</FormLabel>
<FormControl>
<Input {...field} />
</FormControl>
<FormMessage />
</FormItem>
)}
/>
<Button type="submit">Create</Button>
</form>
</Form>
);
}
Testing
Component Tests
// __tests__/project-card.test.tsx
import { ProjectCard } from "@/components/project-card";
import { render, screen } from "@testing-library/react";
import userEvent from "@testing-library/user-event";
describe("ProjectCard", () => {
it("displays project information", () => {
const project = { id: "1", name: "Test Project", description: "Test" };
render(<ProjectCard project={project} />);
expect(screen.getByText("Test Project")).toBeInTheDocument();
expect(screen.getByText("Test")).toBeInTheDocument();
});
it("calls onSelect when clicked", async () => {
const onSelect = jest.fn();
const project = { id: "1", name: "Test Project" };
render(<ProjectCard project={project} onSelect={onSelect} />);
await userEvent.click(screen.getByRole("article"));
expect(onSelect).toHaveBeenCalledWith(project);
});
});
Running Tests
# Run all tests
pnpm test
# Watch mode
pnpm test:watch
# With coverage
pnpm test:coverage
Performance
Image Optimization
import Image from "next/image";
<Image
src="/logo.png"
alt="Logo"
width={200}
height={50}
priority // For above-the-fold images
/>;
Code Splitting
// Dynamic imports for heavy components
const HeavyChart = dynamic(() => import("@/components/heavy-chart"), {
loading: () => <Skeleton className="h-96" />,
ssr: false, // Disable SSR for client-only components
});
Data Fetching
// Server component (default in app router)
async function ProjectPage({ params }: { params: { id: string } }) {
const project = await api.project.get({ id: params.id });
return <ProjectDetails project={project} />;
}
// Parallel data fetching
async function DashboardPage() {
const [projects, stats] = await Promise.all([api.project.list(), api.stats.get()]);
return (
<>
<StatsCard stats={stats} />
<ProjectList projects={projects} />
</>
);
}
Common Patterns
Error Boundaries
export function ProjectErrorBoundary({ children }: { children: ReactNode }) {
return (
<ErrorBoundary
fallback={
<Alert variant="destructive">
<AlertTitle>Something went wrong</AlertTitle>
<AlertDescription>Unable to load projects. Please try again.</AlertDescription>
</Alert>
}
>
{children}
</ErrorBoundary>
);
}
Loading States
export function ProjectListSkeleton() {
return (
<div className="space-y-4">
{Array.from({ length: 3 }).map((_, i) => (
<Skeleton key={i} className="h-24" />
))}
</div>
);
}
Accessibility
// Always include ARIA labels
<button
aria-label="Delete project"
onClick={handleDelete}
>
<TrashIcon />
</button>
// Keyboard navigation
<div
role="button"
tabIndex={0}
onKeyDown={(e) => {
if (e.key === 'Enter' || e.key === ' ') {
handleClick();
}
}}
>
Interactive element
</div>
Debugging
React DevTools
- Install React Developer Tools extension
- Use Components tab to inspect props/state
- Use Profiler tab for performance analysis
tRPC DevTools
// Automatically included in development
// View network tab for tRPC requests
// Check request/response payloads
Common Issues
Hydration Errors
// Ensure client/server render match
{
typeof window !== "undefined" && <ClientOnlyComponent />;
}
State Not Updating
// Use callbacks for state depending on previous
setItems((prev) => [...prev, newItem]);
Resources
Frontend Testing
Unit Tests
- Unit tests are colocated with the code they test. Good examples of test files can be found in the
/apps/app/src/utils/__tests__directory, which contains unit tests for the files in the/apps/app/src/utils/folder. - Additional examples of unit test files exist throughout the frontend codebase. They have a
.test.{ts,tsx}extension and are sometimes located in__tests__directories.
Unit Test Approach
- Implemented using vitest
- Add as many unit tests as possible, especially for pure functions and complex business logic that can be tested independently without relying on extensive mocking and external dependencies.
- Prioritize testing different properties and scenarios to catch hard-to-miss edge cases instead of only following the happy path with a few examples.
Running Unit Tests
- Run
pnpm test:unitto run all unit tests - Run
pnpm test:unit:coverageto run all unit tests and get a coverage report in the terminal
Frontend Test Cases Guide
THIS DOCUMENT IS STILL A WIPβ¦
The test-cases directory inside apps/app contains structured test specifications that define the expected behavior of our application. These specifications serve as a bridge between product requirements and automated tests.
Overview
The test cases are defined in JSON files and follow a strict schema (defined in schema.json). Each test case is identified by a unique ID and contains detailed information about what needs to be tested.
File Structure
schema.json- Defines the structure and validation rules for test case specificationsonboarding.spec.json- Test specifications for user onboarding flows- Additional
.spec.jsonfiles for other features
Test Case Schema
Each test case follows this structure:
{
"TEST-001": {
"title": "Test case title",
"priority": "P0",
"preconditions": ["List of conditions that must be met"],
"steps": ["Step-by-step test instructions"],
"acceptance_criteria": ["List of criteria that must be met"]
}
}
Fields Explained
title: A descriptive name for the test casepriority: Importance level (P0-P3)- P0: Critical path, must not break and must be covered by automated tests
- P1: Core functionality
- P2: Important but not critical
- P3: Nice to have
preconditions: Required setup or state before running the teststeps: Detailed test stepsacceptance_criteria: What must be true for the test to pass
Priority Levels
- P0: Critical business flows (e.g., user registration, login)
- P1: Core features that significantly impact user experience
- P2: Secondary features that enhance user experience
- P3: Edge cases and nice-to-have features
Updating Test Cases
-
Adding a New Test Case:
- Choose an appropriate spec file (or create a new one for new features)
- Add a new entry with a unique ID (format:
XXX-###) - Fill in all required fields according to the schema
- Validate against schema.json
-
Modifying Existing Test Cases:
- Update the relevant fields
- Ensure changes are reflected in the corresponding automated tests
- Keep the test ID unchanged
-
Best Practices:
- Keep steps clear and actionable
- Write acceptance criteria that can be automated
- Include edge cases and error scenarios
- Document dependencies between test cases
Integration with Automated Tests
The test specifications in this directory serve as a source of truth for our automated tests. The relationship works as follows:
- Test specs define WHAT needs to be tested
- Automated tests implement HOW to test it
- Automated tests written for a test case should reference its corresponding test case ID
Example:
describe("ONB-001: User Registration - with email", () => {
it("should complete registration flow successfully", async () => {
// Test implementation
});
});
Maintaining Test Coverage
- Every new feature should have corresponding test cases
- Test cases should be reviewed along with code changes
- Regular audits ensure test coverage matches specifications
- Update or deprecate test cases when features change
Database Guide
Overview
This guide covers database development practices for Tenki Cloud, including schema management, migrations, and query patterns.
Database Stack
- PostgreSQL: Primary database
- sqlc: Type-safe SQL query generation
- golang-migrate: Database migration management
Schema Management
Migrations
All database schema changes must be made through migrations:
# Create a new migration
make migration name=add_user_settings
# Run migrations
make migrate-up
# Rollback last migration
make migrate-down
Best Practices
- Always include both up and down migrations
- Keep migrations small and focused
- Test rollbacks before merging
- Never modify existing migrations
Query Development
We use sqlc for type-safe database queries:
Writing Queries
- Add queries to
pkg/db/queries/*.sql - Use named parameters:
@param_name - Follow naming conventions:
GetUserByIDfor single rowListUsersByOrgfor multiple rowsCreateUserfor insertsUpdateUserfor updatesDeleteUserfor deletes
Generating Code
# Generate Go code from SQL
make sqlc
Performance
Indexing
- Add indexes for frequently queried columns
- Use composite indexes for multi-column queries
- Monitor slow query logs
Query Optimization
- Use EXPLAIN ANALYZE for query planning
- Avoid N+1 queries
- Batch operations when possible
- Use database views for complex queries
Testing
Unit Tests
- Mock database interfaces
- Test query logic separately from business logic
Integration Tests
- Use test database containers
- Clean up test data after each test
- Test migration up/down paths
Testing Guide
This guide covers the testing strategies and patterns used in Tenki Cloud, with a focus on writing effective tests for backend services, particularly those using Temporal workflows.
Overview
Tenki Cloud uses a comprehensive testing approach that includes:
- Unit Tests: Fast, isolated tests using mocks to verify business logic
- Integration Tests: End-to-end tests running in a real environment
- Table-Driven Tests: Systematic approach for testing multiple scenarios
- BDD-Style Tests: Behavior-driven tests using Ginkgo/Gomega
Testing Stack
Core Libraries
// Unit Testing
import (
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/stretchr/testify/mock"
)
// Integration Testing
import (
"github.com/onsi/ginkgo/v2"
"github.com/onsi/gomega"
)
// Temporal Testing
import (
"go.temporal.io/sdk/testsuite"
)
Project Structure
internal/domain/{domain}/
βββ service/ # Business logic
βββ db/ # Database queries (sqlc generated)
βββ interface.go # Service interfaces
βββ mock_*.go # Generated mocks
βββ worker/ # Temporal workers
βββ activities/ # Temporal activities
β βββ *.go # Activity implementations
β βββ *_test.go # Activity unit tests
βββ workflows/ # Temporal workflows
β βββ *.go # Workflow implementations
β βββ *_test.go # Workflow unit tests
βββ integration_*.go # Integration tests
Unit Testing
Activity Testing
Activities should be tested with mocked dependencies to ensure business logic correctness.
Basic Pattern
func TestActivities_GetRunnerInstallation(t *testing.T) {
t.Parallel()
tests := []struct {
name string
installationId int64
mockResponse *connect.Response[runnerproto.GetRunnerInstallationResponse]
mockError error
expectedResult *runnerproto.RunnerInstallation
expectErr bool
}{
{
name: "success",
installationId: 1234,
mockResponse: connect.NewResponse(&runnerproto.GetRunnerInstallationResponse{
RunnerInstallation: &runnerproto.RunnerInstallation{
Id: "abc123",
},
}),
expectedResult: &runnerproto.RunnerInstallation{Id: "abc123"},
},
{
name: "service error",
installationId: 1234,
mockError: connect.NewError(connect.CodeInternal, nil),
expectErr: true,
},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
// Setup mock
svc := &runner.MockService{}
svc.On("GetRunnerInstallation", mock.Anything, mock.Anything).
Return(tc.mockResponse, tc.mockError)
// Create activities with mock
a := newTestActivities(svc, t)
// Execute
result, err := a.GetRunnerInstallation(context.Background(), tc.installationId)
// Assert
if tc.expectErr {
assert.Error(t, err)
assert.Nil(t, result)
} else {
assert.NoError(t, err)
assert.Equal(t, tc.expectedResult, result)
}
})
}
}
Testing with Complex Arguments
// Use MatchedBy for complex argument validation
svc.On("UpdateRunners", mock.Anything,
mock.MatchedBy(func(req *connect.Request[runnerproto.UpdateRunnersRequest]) bool {
return assert.ElementsMatch(t, req.Msg.Ids, expectedIds) &&
assert.Equal(t, req.Msg.State, expectedState)
})).Return(nil, nil)
Test Helper Functions
Create reusable test helpers to reduce boilerplate:
func newTestActivities(svc runner.Service, t *testing.T) *activities {
logger := log.NewTestLogger(t)
sr := trace.NewSpanRecorder()
tracer, _ := trace.NewTestTracer(sr)
return &activities{
logger: logger,
svc: svc,
tracer: tracer,
}
}
Workflow Testing
Workflows require mocking activities since they orchestrate multiple operations.
Basic Workflow Test
func TestGithubJobWorkflow(t *testing.T) {
var ts testsuite.WorkflowTestSuite
t.Run("happy path", func(t *testing.T) {
env := ts.NewTestWorkflowEnvironment()
// Register activities with stubs
env.RegisterActivityWithOptions(stubFunc,
temporal.RegisterOptions{Name: runner.GithubJobWorkflowActivity})
// Mock activity responses
env.OnActivity(runner.GithubJobWorkflowActivity, mock.Anything, mock.Anything).
Return(nil, nil)
// Execute workflow
event := github.WorkflowJobEvent{
Action: github.String("completed"),
Installation: &github.Installation{ID: github.Int64(123)},
}
env.ExecuteWorkflow((&workflows{}).GithubJobWorkflow, event)
// Assert completion
require.True(t, env.IsWorkflowCompleted())
require.NoError(t, env.GetWorkflowError())
})
}
Testing Retry Logic
t.Run("retry on transient error", func(t *testing.T) {
env := ts.NewTestWorkflowEnvironment()
callCount := 0
env.OnActivity(runner.SomeActivity, mock.Anything, mock.Anything).
Return(func(context.Context, interface{}) error {
callCount++
if callCount < 3 {
return errors.New("transient error")
}
return nil
})
env.ExecuteWorkflow(workflow, input)
require.True(t, env.IsWorkflowCompleted())
require.NoError(t, env.GetWorkflowError())
assert.Equal(t, 3, callCount)
})
Integration Testing
Integration tests verify the entire system working together with real dependencies.
Setup with Ginkgo
Test Suite Entry Point
//go:build integration
func TestIntegration(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Runner Worker Integration Tests")
}
Suite Configuration
var _ = BeforeSuite(func() {
// Start Temporal dev server
cmd := exec.Command("temporal", "server", "start-dev",
"--port", "7233",
"--ui-port", "8233",
"--db-filename", filepath.Join(tempDir, "temporal.db"))
// Initialize global dependencies
initializeDatabase()
initializeTracing()
})
var _ = AfterSuite(func() {
// Clean up
stopTemporalServer()
closeDatabase()
})
var _ = BeforeEach(func() {
// Start transaction for test isolation
tx = db.BeginTx()
// Create service instances
runnerService = createRunnerService(tx)
// Start worker
worker = temporal.NewWorker(client, taskQueue, temporal.WorkerOptions{})
temporal.RegisterWorkflows(worker)
temporal.RegisterActivities(worker, activities)
worker.Start()
})
var _ = AfterEach(func() {
// Rollback transaction
tx.Rollback()
// Stop worker
worker.Stop()
})
Writing Integration Tests
var _ = Describe("Runner Installation", func() {
Context("when installing runners", func() {
It("should install runner successfully", func() {
// Start workflow
workflowId := fmt.Sprintf("test-install-%s", uuid.New())
run, err := temporalClient.ExecuteWorkflow(
context.Background(),
client.StartWorkflowOptions{
ID: workflowId,
TaskQueue: runner.TaskQueue,
},
runner.RunnerInstallWorkflow,
installationId,
)
Expect(err).ToNot(HaveOccurred())
// Trigger installation via service
_, err = runnerService.InstallRunners(ctx, connect.NewRequest(
&runnerproto.InstallRunnersRequest{
InstallationId: installationId,
WorkspaceId: workspaceId,
},
))
Expect(err).ToNot(HaveOccurred())
// Send signal to workflow
err = temporalClient.SignalWorkflow(
context.Background(),
workflowId,
"",
runner.InstallSignal,
runner.InstallSignalPayload{},
)
Expect(err).ToNot(HaveOccurred())
// Wait for expected state
Eventually(func() string {
ins, err := runnerService.GetRunnerInstallation(ctx, req)
if err != nil || ins == nil {
return ""
}
return ins.Msg.RunnerInstallation.State
}, 30*time.Second, 1*time.Second).Should(Equal("active"))
// Verify final state
var result runner.RunnerInstallWorkflowResult
err = run.Get(context.Background(), &result)
Expect(err).ToNot(HaveOccurred())
Expect(result.Success).To(BeTrue())
})
})
})
Testing Patterns & Best Practices
1. Table-Driven Tests
Use table-driven tests to cover multiple scenarios systematically:
tests := []struct {
name string
input string
want string
wantErr bool
errMsg string
}{
{
name: "valid input",
input: "test",
want: "TEST",
},
{
name: "empty input",
input: "",
wantErr: true,
errMsg: "input cannot be empty",
},
}
2. Mock Best Practices
- Mock at interface boundaries
- Use
mock.MatchedByfor complex argument matching - Verify mock expectations when needed:
defer svc.AssertExpectations(t)
3. Test Isolation
- Each test should be independent
- Use database transactions with rollback
- Clean up created resources
- Reset global state between tests
4. Async Testing
Use Eventually for testing async operations:
Eventually(func() bool {
// Check condition
return conditionMet
}, timeout, interval).Should(BeTrue())
5. Error Testing
Always test both success and failure paths:
{
name: "network error",
mockError: errors.New("connection refused"),
expectErr: true,
},
{
name: "timeout error",
mockError: context.DeadlineExceeded,
expectErr: true,
},
6. Test Naming
Use descriptive test names that explain the scenario:
t.Run("returns error when installation not found", func(t *testing.T) {
// test
})
7. Tracing in Tests
Verify tracing behavior when applicable:
sr := trace.NewSpanRecorder()
tracer, _ := trace.NewTestTracer(sr)
// After execution
spans := sr.Ended()
assert.Len(t, spans, 1)
assert.Equal(t, "OperationName", spans[0].Name())
assert.Equal(t, codes.Ok, spans[0].Status().Code)
Common Testing Scenarios
Testing Database Operations
func TestDatabaseOperation(t *testing.T) {
// Use test database
db := setupTestDatabase(t)
defer cleanupDatabase(db)
// Create queries
queries := runnerdb.New(db)
// Test operation
err := queries.CreateRunner(context.Background(), params)
require.NoError(t, err)
// Verify
runner, err := queries.GetRunner(context.Background(), id)
require.NoError(t, err)
assert.Equal(t, expectedName, runner.Name)
}
Testing Kubernetes Operations
func TestKubernetesOperation(t *testing.T) {
// Create fake client
objects := []runtime.Object{
&corev1.Namespace{
ObjectMeta: metav1.ObjectMeta{Name: "test"},
},
}
k8sClient := fake.NewSimpleClientset(objects...)
// Test operation
err := createDeployment(k8sClient, namespace, deployment)
require.NoError(t, err)
// Verify
deploy, err := k8sClient.AppsV1().Deployments(namespace).Get(
context.Background(), name, metav1.GetOptions{})
require.NoError(t, err)
assert.Equal(t, expectedReplicas, *deploy.Spec.Replicas)
}
Testing External API Calls
func TestExternalAPI(t *testing.T) {
// Create mock HTTP server
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
assert.Equal(t, "/api/v1/resource", r.URL.Path)
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(expectedResponse)
}))
defer server.Close()
// Test with mock server URL
client := NewAPIClient(server.URL)
result, err := client.GetResource(context.Background(), "id")
require.NoError(t, err)
assert.Equal(t, expectedResponse, result)
}
Running Tests
Unit Tests
# Run all unit tests
gotest
# Run specific package tests
cd backend && go test ./internal/domain/runner/...
# Run with coverage
cd backend && go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
# Run specific test
cd backend && go test -run TestActivities_GetRunnerInstallation ./...
Integration Tests
# Ensure services are running
dev up
# Run all integration tests
gotest-integration
# Run specific integration test suite
cd backend && ginkgo -v ./internal/domain/runner/worker/
Continuous Integration
Tests should be part of your CI pipeline:
test:
script:
- gotest
- gotest-integration
coverage: '/coverage: \d+\.\d+%/'
Debugging Tests
Verbose Output
go test -v ./...
Focus on Specific Tests (Ginkgo)
FIt("should focus on this test", func() {
// This test will run exclusively
})
Debug Logging
logger := log.NewTestLogger(t)
logger.Debug("test state", "value", someValue)
Test Timeouts
func TestLongRunning(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
// Use ctx for operations
}
Summary
Effective testing in Tenki Cloud requires:
- Clear separation between unit and integration tests
- Proper use of mocks for isolation
- Table-driven tests for comprehensive coverage
- Integration tests for end-to-end validation
- Consistent patterns across the codebase
Follow these patterns to ensure your code is well-tested, maintainable, and reliable.
Release System
Tenki Cloud uses a custom release system designed for polyglot monorepos, handling both TypeScript/Node.js applications and Go binaries seamlessly.
Overview
The release system automates version management, changelog generation, and artifact building across all components in the monorepo. It provides a developer-friendly workflow similar to Changesets but with full support for Go modules and Docker deployments.
Key Features
- Polyglot Support: Handles both Node.js packages and Go binaries
- Shared Go Versioning: All Go binaries use coordinated versions
- Deployment Awareness: Different strategies for Docker vs binary deployment
- Automatic PR Management: Creates and updates Release PRs
- GitHub Integration: Native releases with artifact uploads
- Developer-Friendly CLI: Interactive changelog creation
Quick Start
Creating a Release
-
Create a changelog:
changelog add # Interactive with fzf (if available) changelog add --empty # Empty changelog for internal changes -
Commit and push:
git add .releases/your-changelog.md git commit -m "feat: add new feature" git push origin main -
Review the Release PR that gets created automatically
-
Merge the Release PR to trigger the release
Checking Status
changelog status
Components
The release system manages these components:
Frontend Applications
@tenki/appβ Docker image:app:vX.Y.Z@tenki/sentinelβ Docker image:sentinel:vX.Y.Z
Go Services (Docker)
@tenki/engineβ Docker image:engine:vX.Y.Z@tenki/github-proxyβ Docker image:github-proxy:vX.Y.Z
Go Binaries (Direct Deployment)
@tenki/cliβ Binary releases:tenki-cli-vX.Y.Z-{os}-{arch}@tenki/node-agentβ Binary releases:node-agent-vX.Y.Z-{os}-{arch}@tenki/vm-agentβ Binary releases:vm-agent-vX.Y.Z-{os}-{arch}
Changelog Format
Changelog files use YAML frontmatter to specify affected packages and version bump types:
---
"@tenki/app": minor
"@tenki/engine": patch
"@tenki/cli": major
---
Add new authentication features
- **Frontend**: Added MFA support with TOTP
- **Engine**: Fixed token refresh race condition
- **CLI**: Breaking change: new login command structure
This release improves security and fixes several authentication issues.
Version Bump Types
patch(0.0.X): Bug fixes, small improvementsminor(0.X.0): New features, backwards compatiblemajor(X.0.0): Breaking changes
Go Binary Versioning
All Go binaries share the same version from backend/go.mod. When any Go binary is updated, all Go binaries receive the same version bump using the highest bump type among them.
Example: If @tenki/cli needs a patch and @tenki/engine needs a minor, all Go binaries get a minor bump.
Workflow
1. Changelog Detection
- Trigger:
.releases/*.mdfiles pushed to main - Action: Parses changelog files, determines version bumps
- Result: Creates or updates Release PR
2. Release PR
- Contains: Version bumps for all affected components
- Updates: Individual
CHANGELOG.mdfiles - Shows: Artifacts that will be built
- Cleanup: Deletes temporary changelog files
3. Release Automation
- Trigger: Release PR merged to main
- Actions:
- Creates Git tags
- Builds Docker images
- Builds cross-platform binaries
- Creates GitHub release with artifacts
CLI Commands
Interactive Changelog Creation
changelog add
Guides you through:
- Selecting affected packages
- Choosing version bump types
- Writing changelog content
Status Check
changelog status
Shows:
- Pending changelog files
- Existing Release PRs
- Current component versions
File Structure
.releases/
βββ config.json # Release configuration
βββ *.md # Temporary changelog files
# Individual changelogs
apps/app/CHANGELOG.md
apps/sentinel/CHANGELOG.md
backend/cmd/engine/CHANGELOG.md
backend/cmd/tenki-cli/CHANGELOG.md
backend/cmd/node-agent/CHANGELOG.md
backend/cmd/github-proxy/CHANGELOG.md
backend/cmd/vm-agent/CHANGELOG.md
GitHub Actions
Changelog Detection
File: .github/workflows/changelog-detection.yml
- Trigger: Push to main with
.releases/*.mdchanges - Action: Processes changelogs and creates Release PR
Release Automation
File: .github/workflows/release.yml
- Trigger: Release PR merged to main
- Actions: Creates tags, builds artifacts, publishes releases
Configuration
The system is configured in .releases/config.json:
{
"packages": {
"@tenki/app": {
"path": "apps/app",
"type": "node",
"changelog": "apps/app/CHANGELOG.md",
"version_file": "apps/app/package.json",
"deployment": "docker"
},
"@tenki/engine": {
"path": "backend/cmd/engine",
"type": "go-binary",
"changelog": "backend/cmd/engine/CHANGELOG.md",
"version_file": "backend/cmd/engine/VERSION",
"binary_name": "engine",
"deployment": "docker",
"docker": {
"component": "engine",
"dockerfile": "backend/cmd/engine/Dockerfile",
"context": "backend"
}
}
},
"release_branch": "release/next",
"release_pr_title": "chore(release): version packages [skip ci]",
"commit_message": "chore(release): version packages [skip ci]"
}
Best Practices
Changelog Writing
- One changelog per logical change - Donβt combine unrelated features
- Clear descriptions - Explain what changed and why
- User-focused content - Write for end users, not developers
- Appropriate bump types - Follow semantic versioning strictly
Release Management
- Review Release PRs carefully - Verify versions and changelog entries
- Test before merging - Ensure all CI checks pass
- Coordinate deployments - Plan releases during appropriate windows
- Monitor releases - Watch for issues after deployment
Package Dependencies
- Shared package changes go in consuming app changelogs
- No separate changelogs for
packages/*directories - Document impact where users will see the changes
Troubleshooting
Release PR Not Created
- Check changelog format - Ensure YAML frontmatter is correct
- Verify file location - Files must be in
.releases/directory - Check GitHub Actions - Review workflow logs for errors
Build Failures
- Run tests locally - Ensure all tests pass before merging
- Check Docker configs - Verify Dockerfile and build contexts
- Validate Go modules - Ensure
go.modis properly formatted
Version Conflicts
- Understand versioning - Go binaries share versions, Node.js apps are independent
- Check existing versions - Use
changelog statusto see current state - Review Release PR - Verify calculated versions are correct
Examples
Simple Bug Fix
---
"@tenki/app": patch
---
Fix authentication token refresh issue
- Fixed race condition in token refresh logic
- Improved error handling for expired tokens
New Feature with Breaking Change
---
"@tenki/app": minor
"@tenki/cli": major
"@tenki/engine": minor
---
Add workspace management features
- **App**: New workspace dashboard with team management
- **CLI**: Breaking change: `tenki workspace` command restructured
- **Engine**: Added workspace isolation and resource quotas
Multi-Component Update
---
"@tenki/app": minor
"@tenki/engine": minor
"@tenki/node-agent": patch
---
Improve runner monitoring and management
- **App**: Added real-time runner status dashboard
- **Engine**: Implemented auto-scaling for custom images
- **Node Agent**: Fixed memory leak in status reporting
Release Quick Reference
Quick reference for the Tenki Cloud release system.
Commands
# Create new changelog (interactive with fzf if available)
changelog add
# Create empty changelog for internal changes
changelog add --empty
# Check status
changelog status
# Show help
changelog help
Changelog Format
---
"@tenki/app": minor
"@tenki/engine": patch
---
Brief description of changes
- Detailed change 1
- Detailed change 2
Components
| Component | Type | Deployment | Output |
|---|---|---|---|
@tenki/app | Node.js | Docker | app:vX.Y.Z |
@tenki/sentinel | Node.js | Docker | sentinel:vX.Y.Z |
@tenki/engine | Go | Docker | engine:vX.Y.Z |
@tenki/github-proxy | Go | Docker | github-proxy:vX.Y.Z |
@tenki/cli | Go | Binary | tenki-cli-vX.Y.Z-{os}-{arch} |
@tenki/node-agent | Go | Binary | node-agent-vX.Y.Z-{os}-{arch} |
@tenki/vm-agent | Go | Binary | vm-agent-vX.Y.Z-{os}-{arch} |
Version Bump Types
| Type | Version Change | Use Case |
|---|---|---|
patch | 1.0.0 β 1.0.1 | Bug fixes, small improvements |
minor | 1.0.0 β 1.1.0 | New features, backwards compatible |
major | 1.0.0 β 2.0.0 | Breaking changes |
Workflow
- Create changelog β
changelog add - Commit & push β
git add .releases/*.md && git commit && git push - Review Release PR β Automatically created
- Merge Release PR β Triggers release automation
- Artifacts built β Docker images + binaries published
File Locations
.releases/
βββ config.json # Configuration
βββ your-feature.md # Temporary changelog
apps/app/CHANGELOG.md # App changelog
apps/sentinel/CHANGELOG.md # Sentinel changelog
backend/cmd/engine/CHANGELOG.md # Engine changelog
backend/cmd/tenki-cli/CHANGELOG.md # CLI changelog
backend/cmd/node-agent/CHANGELOG.md # Node agent changelog
backend/cmd/github-proxy/CHANGELOG.md # GitHub proxy changelog
backend/cmd/vm-agent/CHANGELOG.md # VM agent changelog
Common Patterns
Bug Fix
---
"@tenki/app": patch
---
Fix login redirect issue
New Feature
---
"@tenki/app": minor
"@tenki/engine": minor
---
Add workspace management
Breaking Change
---
"@tenki/cli": major
---
Restructure CLI commands
Multi-Component
---
"@tenki/app": minor
"@tenki/engine": patch
"@tenki/cli": patch
---
Improve runner monitoring
Troubleshooting
| Issue | Solution |
|---|---|
| Release PR not created | Check changelog format and GitHub Actions logs |
| Build failure | Ensure tests pass and Docker configs are correct |
| Wrong version calculated | Review frontmatter and component dependencies |
| CLI not working | Run direnv reload to pick up new scripts |
Go Binary Versioning
- All Go binaries share the same version from
backend/go.mod - Highest bump type among Go components is used for all
- Example:
cli: patch+engine: minor= all Go binaries getminor
Deployment Guide
Last updated: 2025-06-12
Overview
Tenki Cloud uses GitOps with Flux for Kubernetes deployments. All deployments are triggered via Git commits and automatically reconciled by Flux.
Deployment Environments
| Environment | Domain | Branch | Cluster |
|---|---|---|---|
| Development | *.tenki.lab | feature/* | Local |
| Staging | *.staging.tenki.cloud | staging | tenki-staging |
| Production | *.tenki.cloud | main | tenki-prod |
Deployment Process
1. Local Development β Staging
# 1. Ensure tests pass
pnpm test
gotest
# 2. Build and push images
make docker-build
make docker-push TAG=staging-$(git rev-parse --short HEAD)
# 3. Update staging manifests
cd infra/flux/apps/staging
vim engine-deployment.yaml # Update image tag
git add .
git commit -m "deploy: engine staging-abc123"
git push
# 4. Monitor deployment
kubectl --context=staging get pods -w
flux logs -f
2. Staging β Production
# 1. Create release PR
gh pr create --base main --title "Release v1.2.3"
# 2. After approval and merge, tag release
git checkout main
git pull
git tag -a v1.2.3 -m "Release v1.2.3"
git push origin v1.2.3
# 3. CI/CD builds and pushes production images
# 4. Update production manifests
cd infra/flux/apps/production
# Update image tags to v1.2.3
git commit -m "deploy: production v1.2.3"
git push
# 5. Monitor rollout
kubectl --context=production rollout status deployment/engine
Service-Specific Deployments
Backend Engine
# Build
cd backend
make build-engine
# Test
make test
# Docker image
docker build -t tenki/engine:$TAG .
docker push tenki/engine:$TAG
# Update manifest
kubectl set image deployment/engine engine=tenki/engine:$TAG
Frontend App
# Build
cd apps/app
pnpm build
# Docker image
docker build -t tenki/app:$TAG .
docker push tenki/app:$TAG
# Deploy
kubectl set image deployment/app app=tenki/app:$TAG
Database Migrations
# Always run migrations before deploying new code
kubectl exec -it deploy/engine -- /app/migrate up
# Verify migrations
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "\dt"
Rollback Procedures
Quick Rollback (< 5 mins)
# 1. Rollback deployment
kubectl rollout undo deployment/engine
# 2. Verify rollback
kubectl rollout status deployment/engine
kubectl logs -l app=engine --tail=100
# 3. Rollback database if needed
kubectl exec -it deploy/engine -- /app/migrate down
GitOps Rollback
# 1. Revert commit in Git
git revert <commit-hash>
git push
# 2. Flux will automatically sync
flux reconcile source git flux-system
# 3. Monitor
watch flux get kustomizations
Health Checks
Pre-deployment
# Check cluster health
kubectl get nodes
kubectl top nodes
# Check dependencies
kubectl get pods -n default
kubectl get pvc
# Verify secrets
kubectl get secrets
During Deployment
# Watch rollout
kubectl rollout status deployment/engine -w
# Monitor pods
kubectl get pods -l app=engine -w
# Check logs
kubectl logs -f -l app=engine --tail=50
Post-deployment
# Smoke tests
curl https://api.tenki.cloud/health
curl https://app.tenki.cloud
# Check metrics
open https://grafana.tenki.cloud/d/deployment
# Run integration tests
cd backend && gotest-integration
Monitoring Deployments
Grafana Dashboards
Key Metrics to Watch
- Request rate changes
- Error rate spikes
- Response time increases
- CPU/Memory usage
- Database connections
Alerts
# Deployment alerts configured in Prometheus
- name: deployment_failed
expr: kube_deployment_status_replicas_unavailable > 0
for: 5m
- name: high_error_rate_after_deploy
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
Blue-Green Deployments
For high-risk changes:
# 1. Deploy to green environment
kubectl apply -f engine-deployment-green.yaml
# 2. Test green environment
curl https://api-green.tenki.cloud/health
# 3. Switch traffic
kubectl patch service engine -p '{"spec":{"selector":{"version":"green"}}}'
# 4. Monitor
watch 'kubectl get pods -l app=engine'
# 5. If issues, switch back
kubectl patch service engine -p '{"spec":{"selector":{"version":"blue"}}}'
Deployment Checklist
Pre-deployment
- All tests passing
- Code reviewed and approved
- Database migrations tested
- Rollback plan prepared
- Team notified in Slack
Deployment
- Images built and pushed
- Manifests updated
- Deployment monitored
- Health checks passing
- Smoke tests completed
Post-deployment
- Metrics normal
- No error spikes
- Customer reports checked
- Documentation updated
- Deployment logged
Troubleshooting
Pod Wonβt Start
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --sort-by='.lastTimestamp'
Image Pull Errors
# Check secret
kubectl get secret regcred -o yaml
# Re-create if needed
kubectl create secret docker-registry regcred \
--docker-server=registry.tenki.cloud \
--docker-username=$USER \
--docker-password=$PASS
Configuration Issues
# Check ConfigMaps
kubectl get configmap
kubectl describe configmap engine-config
# Check Secrets
kubectl get secrets
kubectl describe secret engine-secrets
CI/CD Pipeline
Our GitHub Actions pipeline:
- On PR: Run tests, build images, deploy to preview
- On merge to main: Build, tag, push to registry
- On tag: Build production images, create release
See .github/workflows/deploy.yml in the repository root
Monitoring Guide
Overview
This guide covers monitoring and observability practices for Tenki Cloud operations.
Stack
- Prometheus: Metrics collection
- Grafana: Visualization and dashboards
- Loki: Log aggregation
- Tempo: Distributed tracing
- Alertmanager: Alert routing
Metrics
Application Metrics
Key metrics to monitor:
- Request rate and latency
- Error rates (4xx, 5xx)
- Database connection pool stats
- Background job queue depth
- GitHub API rate limits
Infrastructure Metrics
- CPU and memory usage
- Disk I/O and space
- Network throughput
- Container health
- Database performance
Dashboards
Available Dashboards
- Application Overview: High-level health metrics
- API Performance: Request rates, latencies, errors
- Database Health: Connections, query performance
- GitHub Integration: Runner stats, API usage
- Billing System: Transaction volumes, failures
Creating Dashboards
- Use Grafana dashboard as code
- Store dashboards in
deployments/grafana/dashboards/ - Follow naming convention:
category-name.json - Include appropriate tags and metadata
Alerts
Alert Rules
Critical alerts:
- API availability < 99.9%
- Database CPU > 80%
- Disk space < 20%
- Error rate > 5%
- GitHub API rate limit < 1000
Alert Routing
- Critical: PagerDuty (immediate response)
- Warning: Slack #alerts channel
- Info: Email daily digest
Logs
Log Levels
- ERROR: Actionable errors requiring investigation
- WARN: Potential issues, degraded performance
- INFO: Important business events
- DEBUG: Detailed troubleshooting information
Structured Logging
Always use structured logging with consistent fields:
trace_id: Request correlation IDuser_id: User identifierorg_id: Organization identifiererror: Error message and stack trace
Tracing
Instrumentation
- Trace all API endpoints
- Include database queries
- Add custom spans for business logic
- Propagate trace context to external services
Sampling
- 100% sampling for errors
- 10% sampling for successful requests
- Adjust based on traffic volume
SLOs and SLIs
Service Level Indicators
- API latency (p50, p95, p99)
- Error rate
- Availability
- Database query time
Service Level Objectives
- 99.9% API availability
- p95 latency < 500ms
- Error rate < 0.1%
- Zero data loss
Manual Billing Workflow Execution
This guide provides information for manually executing billing workflows using Temporal CLI or other workflow execution tools.
Prerequisites
- Access to Temporal cluster
- Proper authentication and permissions
- Understanding of workspace IDs and billing periods
Common Parameters
All billing workflows use the following task queue:
- Task Queue:
BILLING_TASK_QUEUE
Workflows
BillingListWorkspaceBalanceWorkflow
Retrieves invoice line items and balance information for a specific workspace and billing period.
Workflow Details:
- Task Queue:
BILLING_TASK_QUEUE - Workflow Type:
BillingListWorkspaceBalanceWorkflow
Payload Example:
{
"workspace_id": "123e4567-e89b-12d3-a456-426614174000",
"billing_period": "2024-01",
"billing_period_start": "2024-01-01T00:00:00Z",
"billing_period_end": "2024-01-31T23:59:59.999Z"
}
Parameters:
workspace_id(string): UUID of the workspacebilling_period(string): Billing period in YYYY-MM formatbilling_period_start(time): Start of billing period (ISO 8601)billing_period_end(time): End of billing period (ISO 8601)
Expected Result:
{
"line_items": [
{
"description": "Runner Usage - tenki-standard-autoscale",
"runner_label": "tenki-standard-autoscale",
"quantity": 120,
"unit_price": 0.01,
"amount": 1.2
}
],
"total_amount": 1.2,
"timestamp": "2024-01-31T23:59:59Z"
}
BillingCycleScheduleWorkflow
Parent workflow that orchestrates billing cycles for all workspaces. Typically triggered by a scheduled cron job.
Workflow Details:
- Task Queue:
BILLING_TASK_QUEUE - Workflow Type:
BillingCycleScheduleWorkflow
Payload Example:
{
"luxor_only": false,
"exclude_workspaces": ["123e4567-e89b-12d3-a456-426614174000", "987fcdeb-51a2-43d1-b567-123456789abc"]
}
Parameters:
luxor_only(boolean, optional): Filter to process only Luxor customers (is_luxor = true)exclude_workspaces(array of UUIDs, optional): List of workspace IDs to exclude from billing cycle
Behavior:
- Queries all active workspaces with billing accounts
- Spawns individual
BillingCycleWorkflowchild workflows for each workspace - Handles the current billing period automatically
BillingCycleWorkflow
Individual workspace billing processing workflow. Handles invoice generation, charging, and payment processing for a single workspace.
Workflow Details:
- Task Queue:
BILLING_TASK_QUEUE - Workflow Type:
BillingCycleWorkflow
Payload Example:
{
"workspace_id": "123e4567-e89b-12d3-a456-426614174000",
"billing_period": "2024-01",
"billing_period_start": "2024-01-01T00:00:00Z",
"billing_period_end": "2024-01-31T23:59:59.999Z"
}
Parameters:
workspace_id(UUID): The workspace to process billing forbilling_period(string): Billing period in YYYY-MM formatbilling_period_start(time): Start of billing period (ISO 8601)billing_period_end(time): End of billing period (ISO 8601)
Workflow Steps:
- Generate Stripe invoice with line items
- Process invoice and attempt payment
- Handle TigerBeetle accounting transfers
- Create billing payment records
- Process promotional credit adjustments
- Reset monthly free credits
BillingPaymentReversalWorkflow
Reverses a payment by creating a reversal transfer in TigerBeetle and updating the payment status to βreversedβ. Used for refunds, chargebacks, or administrative corrections.
Workflow Details:
- Task Queue:
BILLING_TASK_QUEUE - Workflow Type:
BillingPaymentReversalWorkflow
Payload Example:
{
"payment_id": "123e4567-e89b-12d3-a456-426614174000",
"workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
"reason": "Customer requested refund",
"initiated_by_email": "admin@tenki.cloud"
}
Parameters:
payment_id(UUID): The payment ID to reverseworkspace_id(UUID): The workspace that owns the paymentreason(string): Reason for the reversal (required)initiated_by_email(string): Email of the person initiating the reversal (required)
Expected Result:
{
"success": true,
"reversal_transfer_id": "base64-encoded-transfer-id",
"original_amount": "12.50",
"reversed_at": "2024-01-31T15:30:00Z"
}
Workflow Steps:
- Validate required parameters (payment_id, workspace_id, reason, initiated_by_email)
- Lookup payment details from database and TigerBeetle
- Create reversal transfer in TigerBeetle using original transfer details
- Update payment status to βreversedβ with reversal details
BillingUsageReversalWorkflow
Reverses a usage event by creating a reversal transfer in TigerBeetle and deleting the usage event record. Used for correcting erroneous charges or administrative adjustments to usage records.
Workflow Details:
- Task Queue:
BILLING_TASK_QUEUE - Workflow Type:
BillingUsageReversalWorkflow
Payload Example:
{
"usage_event_id": "123e4567-e89b-12d3-a456-426614174000",
"workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
"reason": "Incorrect runner charge - job failed",
"initiated_by_email": "admin@tenki.cloud"
}
Parameters:
usage_event_id(UUID): The usage event ID to reverse (required)workspace_id(UUID): The workspace that owns the usage event (required)reason(string): Reason for the reversal (required)initiated_by_email(string): Email of the person initiating the reversal (required)
Expected Result:
{
"success": true,
"reversal_transfer_id": "base64-encoded-transfer-id",
"original_amount": "0.50",
"reversed_at": "2024-01-31T15:30:00Z"
}
Workflow Steps:
- Validate required parameters (usage_event_id, workspace_id, reason, initiated_by_email)
- Fetch usage event details from database
- Verify workspace ownership matches provided workspace_id
- Lookup actual transfer details from TigerBeetle
- Create reversal transfer in TigerBeetle using original transfer amounts
- Delete the usage event record from database
Important Notes:
- Unlike payment reversals, usage event reversals permanently delete the record (no audit trail in the usage_events table)
- The reversal transfer in TigerBeetle maintains the financial audit trail
- Workspace ID validation ensures the usage event belongs to the specified workspace
Temporal CLI Examples
Execute BillingListWorkspaceBalanceWorkflow
temporal workflow start \
--task-queue BILLING_TASK_QUEUE \
--type BillingListWorkspaceBalanceWorkflow \
--input '{
"workspace_id": "123e4567-e89b-12d3-a456-426614174000",
"billing_period": "2024-01",
"billing_period_start": "2024-01-01T00:00:00Z",
"billing_period_end": "2024-01-31T23:59:59.999Z"
}'
Execute BillingCycleScheduleWorkflow
temporal workflow start \
--task-queue BILLING_TASK_QUEUE \
--type BillingCycleScheduleWorkflow \
--input '{
"luxor_only": false,
"exclude_workspaces": []
}'
Execute BillingCycleWorkflow
temporal workflow start \
--task-queue BILLING_TASK_QUEUE \
--type BillingCycleWorkflow \
--input '{
"workspace_id": "123e4567-e89b-12d3-a456-426614174000",
"billing_period": "2024-01",
"billing_period_start": "2024-01-01T00:00:00Z",
"billing_period_end": "2024-01-31T23:59:59.999Z"
}'
Execute BillingPaymentReversalWorkflow
temporal workflow start \
--task-queue BILLING_TASK_QUEUE \
--type BillingPaymentReversalWorkflow \
--input '{
"payment_id": "123e4567-e89b-12d3-a456-426614174000",
"workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
"reason": "Customer requested refund",
"initiated_by_email": "admin@tenki.cloud"
}'
Execute BillingUsageReversalWorkflow
temporal workflow start \
--task-queue BILLING_TASK_QUEUE \
--type BillingUsageReversalWorkflow \
--input '{
"usage_event_id": "123e4567-e89b-12d3-a456-426614174000",
"workspace_id": "987fcdeb-51a2-43d1-b567-123456789abc",
"reason": "Incorrect runner charge - job failed",
"initiated_by_email": "admin@tenki.cloud"
}'
Notes
- All timestamps should be in UTC
- Workspace IDs must be valid UUIDs
- Billing periods follow YYYY-MM format
- The
BillingCycleScheduleWorkflowis typically run automatically via Temporal schedules - Individual
BillingCycleWorkflowexecutions can be run manually for specific workspaces - Use
BillingListWorkspaceBalanceWorkflowto preview billing information before processing
Operational Runbooks
This section contains runbooks for common operational scenarios and incident response.
Available Runbooks
- High Database CPU - When database CPU exceeds 80%
Runbook Template
When creating a new runbook, use this template:
# Runbook: [Issue Name]
## Alert Details
- **Alert Name**: `AlertNameInPrometheus`
- **Severity**: P1 | P2 | P3
- **Team**: Backend | Frontend | Platform
- **Last Updated**: YYYY-MM-DD
## Symptoms
- What the user/system experiences
- What metrics are affected
- What alerts fire
## Quick Diagnostics
\```bash
# Commands to quickly assess the situation
\```
## Resolution Steps
### 1. Immediate Mitigation (X mins)
Steps to stop the bleeding
### 2. Root Cause Analysis (X mins)
How to find what caused the issue
### 3. Fix Implementation
How to fix the underlying problem
### 4. Verification
How to confirm the fix worked
## Prevention
Long-term fixes to prevent recurrence
## Escalation Path
When and who to escalate to
## Related Runbooks
Links to related procedures
Writing Good Runbooks
- Be specific - Include exact commands and expected outputs
- Time-box steps - Indicate how long each step should take
- Include rollback - Always have a way to undo changes
- Test regularly - Run through the runbook quarterly
- Keep updated - Update after each incident
Incident Response Process
- Acknowledge the alert
- Assess using quick diagnostics
- Mitigate following the runbook
- Communicate status updates
- Resolve the root cause
- Document in incident report
Runbook: High Database CPU
Alert Details
- Alert Name:
HighDatabaseCPU - Severity: P2
- Team: Backend/Platform
- Last Updated: 2025-06-12
Symptoms
- Database CPU usage > 80% for 5+ minutes
- API response times > 500ms
- Increased error rates in logs
- Grafana dashboard shows CPU spike
Quick Diagnostics
# 1. Check current connections
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;"
# 2. Find slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT
substring(query, 1, 50) as query_start,
calls,
mean_exec_time,
total_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 100
ORDER BY mean_exec_time DESC
LIMIT 10;"
# 3. Check for locks
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT
pid,
usename,
pg_blocking_pids(pid) as blocked_by,
query_start,
substring(query, 1, 50) as query
FROM pg_stat_activity
WHERE pg_blocking_pids(pid)::text != '{}';"
Resolution Steps
1. Immediate Mitigation (5 mins)
# Scale up API to reduce per-instance load
kubectl scale deployment/engine --replicas=10
# Kill long-running queries (>5 minutes)
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 minutes'
AND query NOT LIKE '%pg_stat_activity%';"
2. Identify Root Cause (10 mins)
Check recent deployments:
kubectl get deployments -o wide | grep engine
kubectl rollout history deployment/engine
Review slow query log:
kubectl logs postgres-0 | grep "duration:" | tail -50
Check for missing indexes:
-- Run on affected tables
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM workflow_runs
WHERE status = 'pending'
AND created_at > NOW() - INTERVAL '1 hour';
3. Fix Implementation
If missing index:
-- Create index (be careful on large tables)
CREATE INDEX CONCURRENTLY idx_workflow_runs_status_created
ON workflow_runs(status, created_at)
WHERE status IN ('pending', 'running');
If bad query from recent deploy:
# Rollback to previous version
kubectl rollout undo deployment/engine
# Or deploy hotfix
git checkout main
git pull
# Fix query
git commit -am "fix: optimize workflow query"
git push
# Deploy via CI/CD
4. Verify Resolution
# Monitor CPU (should drop within 5 mins)
watch -n 5 "kubectl exec -it postgres-0 -- psql -U postgres -c 'SELECT round(100 * cpu_usage) as cpu_percent FROM pg_stat_database_stats;'"
# Check API latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.tenki.lab/health
# Verify no more slow queries
kubectl exec -it postgres-0 -- psql -U postgres tenki -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '1 minute';"
Long-term Prevention
- Add query timeout to engine configuration
- Set up query monitoring in Datadog/NewRelic
- Regular ANALYZE on high-traffic tables
- Consider read replicas for analytics queries
- Implement connection pooling with PgBouncer
Escalation Path
- 15 mins: If CPU still high β Page backend on-call
- 30 mins: If impacting customers β Incident Commander
- 45 mins: If data corruption risk β CTO
Related Runbooks
Post-Incident
- Create incident report
- Add missing monitoring
- Update this runbook with findings
- Schedule postmortem if customer impact
Runbook: High API Latency
Overview
This runbook covers troubleshooting and resolving high API latency issues.
Symptoms
- p95 latency > 500ms
- User reports of slow loading
- Timeout errors in client applications
- Increased error rates due to timeouts
Impact
- Poor user experience
- Increased error rates
- Potential cascading failures
- Customer complaints
Detection
- Alert:
APILatencyHigh - Threshold: p95 > 500ms for 5 minutes
- Dashboard: API Performance
Response
Immediate Actions
-
Check current latency
- View p50, p95, p99 latencies
- Identify affected endpoints
- Check error rates
-
Verify system health
# Check pod status kubectl get pods -n production # Check resource usage kubectl top pods -n production # Check recent deployments kubectl rollout history deployment/api -n production -
Enable detailed logging (temporarily)
kubectl set env deployment/api LOG_LEVEL=debug -n production
Diagnosis
-
Database performance
- Check slow query log
- Review connection pool status
- Look for lock contention
-
External dependencies
- GitHub API response times
- Payment processor latency
- CDN performance
-
Application issues
- Memory leaks (increasing memory usage)
- CPU bottlenecks
- Inefficient algorithms
Common Causes and Fixes
1. Database Queries
Symptom: High database CPU, slow queries Fix:
-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Add missing indexes
CREATE INDEX CONCURRENTLY idx_table_column ON table(column);
2. Cache Misses
Symptom: High cache miss rate Fix:
- Warm up caches after deployment
- Increase cache TTL for stable data
- Review cache key generation
3. Resource Constraints
Symptom: High CPU/memory usage Fix:
# Scale horizontally
kubectl scale deployment api --replicas=6 -n production
# Or scale vertically (requires restart)
kubectl set resources deployment api -c api --requests=memory=2Gi,cpu=1000m -n production
4. Inefficient Code
Symptom: Specific endpoints consistently slow Fix:
- Profile the endpoint
- Optimize algorithms
- Implement pagination
- Add caching layer
Recovery
-
Quick wins
- Increase cache TTLs
- Scale out services
- Enable read replicas
-
Rollback if needed
kubectl rollout undo deployment/api -n production -
Communicate status
- Update status page
- Notify affected customers
- Post in #incidents channel
Prevention
- Load testing before major releases
- Gradual rollouts with canary deployments
- Query performance regression tests
- Capacity planning reviews
Monitoring
Key metrics to watch:
- API latency percentiles
- Database query time
- Cache hit rates
- Resource utilization
- Error rates
Related
Runbook: High Database Connections
Overview
This runbook describes how to handle situations where database connection pool is exhausted or nearing limits.
Symptoms
- Application errors: βtoo many connectionsβ
- Slow API responses
- Connection pool metrics showing high usage
- Database showing max_connections limit reached
Impact
- API requests fail
- Background jobs unable to process
- Users experience errors and timeouts
Detection
- Alert:
DatabaseConnectionsHigh - Threshold: > 80% of max_connections
- Dashboard: Database Health
Response
Immediate Actions
-
Check current connections
SELECT count(*) FROM pg_stat_activity; SELECT usename, application_name, count(*) FROM pg_stat_activity GROUP BY usename, application_name ORDER BY count DESC; -
Identify idle connections
SELECT pid, usename, application_name, state, state_change FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '10 minutes'; -
Kill long-idle connections (if safe)
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '30 minutes';
Root Cause Analysis
-
Check for connection leaks
- Review recent deployments
- Check for missing
defer db.Close() - Look for transactions not being committed/rolled back
-
Review pool configuration
- Current settings in environment
- Calculate optimal pool size
- Check for misconfigured services
-
Analyze traffic patterns
- Sudden spike in requests
- New feature causing more queries
- Background job issues
Long-term Fixes
-
Optimize connection pool settings
db.SetMaxOpenConns(25) db.SetMaxIdleConns(10) db.SetConnMaxLifetime(5 * time.Minute) -
Implement connection pooler
- Consider PgBouncer for connection multiplexing
- Configure pool modes appropriately
-
Code improvements
- Use prepared statements
- Batch queries where possible
- Implement query result caching
Prevention
- Monitor connection pool metrics
- Load test with realistic concurrency
- Regular code reviews for database usage
- Implement circuit breakers
Related
Runbook: Database Failover
Overview
This runbook covers the process of failing over to a standby database in case of primary database failure.
Symptoms
- Primary database unreachable
- Replication lag increasing indefinitely
- Database corruption detected
- Catastrophic hardware failure
Impact
- Complete service outage
- Data writes blocked
- Potential data loss (depending on replication lag)
Detection
- Alert:
DatabasePrimaryDown - Alert:
DatabaseReplicationLagHigh - Dashboard: Database Health
Pre-failover Checks
1. Verify Primary is Down
# Check connectivity
pg_isready -h primary.db.tenki.cloud -p 5432
# Check from multiple locations
for host in api-1 api-2 worker-1; do
ssh $host "pg_isready -h primary.db.tenki.cloud"
done
2. Check Replication Status
-- On standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
-- Check if standby is receiving updates
SELECT * FROM pg_stat_replication;
3. Assess Data Loss Risk
- Note the last transaction timestamp
- Document replication lag
- Make go/no-go decision based on business impact
Failover Process
1. Stop All Application Traffic
# Scale down applications
kubectl scale deployment api worker --replicas=0 -n production
# Verify no active connections
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';
2. Promote Standby
# On standby server
pg_ctl promote -D /var/lib/postgresql/data
# Or using managed service commands
gcloud sql instances promote-replica tenki-db-standby
3. Update Connection Strings
# Update DNS
terraform apply -var="database_host=standby.db.tenki.cloud"
# Or update environment variables
kubectl set env deployment/api deployment/worker \
DATABASE_URL=postgres://user:pass@standby.db.tenki.cloud/tenki \
-n production
4. Verify New Primary
-- Check if accepting writes
SELECT pg_is_in_recovery(); -- Should return false
-- Test write
INSERT INTO health_check (timestamp) VALUES (now());
5. Resume Application Traffic
# Scale up applications
kubectl scale deployment api --replicas=3 -n production
kubectl scale deployment worker --replicas=2 -n production
# Monitor for errors
kubectl logs -f deployment/api -n production
Post-Failover Tasks
1. Immediate
- Monitor application health
- Check for data inconsistencies
- Communicate status to stakeholders
2. Within 1 Hour
- Set up new standby from old primary (if recoverable)
- Update monitoring to reflect new topology
- Document timeline and impact
3. Within 24 Hours
- Root cause analysis
- Update disaster recovery procedures
- Test backup restoration process
Rollback Procedure
If failover was premature or primary recovers:
- Stop applications again
- Ensure data consistency
- Compare transaction IDs
- Check for split-brain scenarios
- Resync if needed
pg_rewind --target-pgdata=/var/lib/postgresql/data \ --source-server="host=primary.db.tenki.cloud" - Switch back to primary
- Resume traffic
Prevention
- Regular failover drills
- Monitor replication lag closely
- Implement automatic failover with proper fencing
- Use synchronous replication for critical data
Related
Runbook: Playwright Scenario Failed
Test Failure Due to Multiple Matching Elements with Similar Text
Alert Details
- Alert Name:
Tenki Production - App Can Login - Severity: P2
- Team: Frontend
- Last Updated: 2025-09-08
Symptoms
- should allow entering email and password
Quick Diagnostics
kubectx tenki-prod-apps
Resolution Steps
1. Immediate Mitigation (5-10 mins)
- checked staging and production if I can successfully login - seems to be working on my end upon testing
- ran
kubectx tenki-prod-appsand ran logs from a namespace - everything is inRunningstatus
2. Root Cause Analysis (10 mins)
- The test failed due to a strict mode violation in Playwright.
- locator detected multiple
Projectsin the code, and it didnβt know which one to interact with. - Playwright expects a single unique element when using
.toBeVisible()in strict mode.
3. Fix Implementation / Possible Resolution
- Add a unique internal ID to the correct element so the test can reliably target it without confusion from similar elements.
- Update the test to match exact text to avoid picking up similar elements.
4. Verification
- Successful test when i ran the scenario in
Monitors
Prevention
- ensure a proper unique id for dynamic or conditionally rendered UI elements
Related Runbooks
Not Applicable
On-call Guide
Last updated: 2025-06-30
Qualification
Watch the initial onboarding video.
Refer to this Notion document.
Duplicate this sample and use your name as the title.
It is very important to go through the steps to ensure proper qualifications and awareness of the responsibilities.
Product Requirement Documents (PRDs)
This directory contains PRDs for major features and initiatives. Each PRD captures the why, what, and success criteria for a feature.
PRD Template
# PRD-XXX: Feature Name
**Author**: Name
**Date**: YYYY-MM-DD
**Status**: Draft | In Review | Approved | In Development | Launched
## Summary
One paragraph overview of what we're building and why.
## Problem Statement
What problem are we solving? Who experiences this problem? Why does it matter?
## Goals & Success Metrics
- **Primary Goal**: What we must achieve
- **Success Metrics**:
- Metric 1: Target value
- Metric 2: Target value
## User Stories
1. As a [user type], I want to [action] so that [benefit]
2. As a [user type], I want to [action] so that [benefit]
## Requirements
### Must Have (MVP)
- [ ] Requirement 1
- [ ] Requirement 2
### Should Have
- [ ] Requirement 3
- [ ] Requirement 4
### Nice to Have
- [ ] Requirement 5
## Technical Approach
High-level technical approach. Details go in technical design docs.
## Risks & Mitigations
| Risk | Impact | Likelihood | Mitigation |
| ------ | ------ | ---------- | ------------------- |
| Risk 1 | High | Medium | How we'll handle it |
## Timeline
- Week 1-2: Design and planning
- Week 3-4: Implementation
- Week 5: Testing and rollout
## Open Questions
- [ ] Question 1
- [ ] Question 2
Current PRDs
- PRD-001: GitHub Integration - Connect GitHub organizations
Writing a Good PRD
Doβs
- Start with the problem, not the solution
- Include measurable success criteria
- Keep it concise (2-3 pages max)
- Focus on the βwhatβ and βwhyβ, not βhowβ
- Include user stories
Donβts
- Donβt include implementation details
- Donβt skip the problem statement
- Donβt forget about edge cases
- Donβt ignore risks
PRD Process
- Draft - PM creates initial PRD
- Review - Engineering, Design, and stakeholders review
- Approval - Leadership approves
- Development - Engineering implements
- Launch - Feature released
- Retrospective - Measure against success criteria
PRD-001: GitHub Integration
Author: Product Team
Date: 2024-01-20
Status: Launched
Summary
Enable customers to connect their GitHub organizations to Tenki Cloud and automatically provision runners for their repositories without any configuration or infrastructure management.
Problem Statement
Development teams waste significant time and money managing GitHub Actions infrastructure:
- Setting up self-hosted runners requires DevOps expertise
- Maintaining runner infrastructure distracts from product development
- GitHubβs hosted runners are expensive and have limited customization
- Scaling runners up/down based on demand is complex
Who experiences this: Engineering teams using GitHub Actions for CI/CD Impact: Teams spend 10-20 hours/month on runner management instead of shipping features
Goals & Success Metrics
Primary Goal: Zero-config GitHub Actions runners that just work
Success Metrics:
- Time to first runner: < 5 minutes from signup
- Runner startup time: < 30 seconds
- Platform uptime: 99.9%
- Customer runner cost: 50% less than GitHub hosted
- Monthly active organizations: 100 by Q2
User Stories
- As a developer, I want to connect my GitHub org so that runners are automatically available for all my repos
- As a team lead, I want to set spending limits so that we donβt exceed our CI/CD budget
- As a DevOps engineer, I want to customize runner specs so that our builds run efficiently
- As a finance manager, I want to see detailed usage reports so that I can allocate costs to teams
Requirements
Must Have (MVP)
- GitHub App for OAuth authentication
- Automatic runner provisioning for workflow_job events
- Support for Linux runners (Ubuntu 22.04)
- Basic usage dashboard showing minutes used
- Automatic runner cleanup after job completion
- Support for public and private repositories
Should Have
- Multiple runner sizes (2-16 vCPU)
- Usage alerts and spending limits
- Windows and macOS runners
- Runner caching between jobs
- Team-based access controls
Nice to Have
- Custom runner images
- Dedicated runner pools
- GitHub Enterprise Server support
- API for programmatic management
Technical Approach
- GitHub App handles authentication and webhook events
- Webhook handler processes workflow_job events
- Temporal workflows orchestrate runner lifecycle
- Kubernetes operators manage runner pods
- Usage tracking via TigerBeetle for accurate billing
Risks & Mitigations
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| GitHub API rate limits | High | Medium | Implement caching and exponential backoff |
| Runner startup time > 30s | High | Medium | Pre-warm runner pools, optimize images |
| Security vulnerabilities | High | Low | Regular security audits, isolated runners |
| Cost overruns | Medium | Medium | Real-time usage tracking and limits |
Timeline
- Week 1-2: GitHub App development and authentication
- Week 3-4: Webhook handling and runner provisioning
- Week 5-6: Usage tracking and billing integration
- Week 7: Beta testing with friendly customers
- Week 8: Public launch
Open Questions
- Should we support GitHub Enterprise? β Not in MVP
- How do we handle runner caching? β Post-MVP feature
- Whatβs our runner retention policy? β 7 days for logs
- How do we handle abuse/crypto mining? β Usage anomaly detection
Post-Launch Results
Launched: 2025-04-15
Actual Metrics (as of 2024-06-01):
- Time to first runner:
- Runner startup time:
- Platform uptime:
- Cost savings:
- Monthly active orgs:
Key Learnings:
- Pre-warming runner pools was critical for startup time
- Customers want custom images more than expected
- Windows runner demand higher than anticipated
Product Roadmap
Overview
This document outlines the product roadmap for Tenki Cloud, organized by quarters and strategic themes.
Q1 2025
Core Platform
- β GitHub integration MVP
- β Basic runner management
- β Usage tracking and billing
- π§ Self-service onboarding
- π§ Team management
Developer Experience
- β CLI tool
- π§ VS Code extension
- π IntelliJ plugin
Q2 2025
Scale and Performance
- π Multi-region support
- π Runner auto-scaling
- π Performance optimizations
- π Caching improvements
Enterprise Features
- π SSO integration
- π Advanced access controls
- π Audit logging
- π Compliance certifications
Q3 2025
Ecosystem Integration
- π GitLab support
- π Bitbucket support
- π Jenkins integration
- π Kubernetes operators
Advanced Features
- π Custom runner images
- π GPU runner support
- π Spot instance integration
- π Advanced scheduling
Q4 2025
Platform Maturity
- π White-label solution
- π Marketplace integrations
- π Partner ecosystem
- π Advanced analytics
Legend
- β Completed
- π§ In Progress
- π Planned
Feature Requests
Track feature requests in our GitHub Issues.
Feedback
We welcome feedback on our roadmap. Please reach out through:
- GitHub Discussions
- Support channels
- Customer success team
Product Metrics
Overview
This document defines the key metrics we track to measure product success and guide decision-making.
North Star Metrics
Primary Metric: Weekly Active Builds
- Definition: Unique organizations with at least one successful build in the past 7 days
- Target: 20% month-over-month growth
- Current: [Dashboard Link]
Product Metrics
Activation
- Time to First Build: Time from signup to first successful build
- Target: < 10 minutes
- Measured from: Account creation to first build completion
- Activation Rate: % of signups that complete first build within 7 days
- Target: > 80%
- Segmented by: Source, plan type
Engagement
-
Build Frequency: Average builds per organization per week
- Target: > 50 builds/week for active orgs
- Segmented by: Organization size, industry
-
Runner Utilization: % of time runners are actively building
- Target: > 70% during business hours
- Measured: CPU time / available time
Retention
-
30-Day Retention: % of orgs active after 30 days
- Target: > 85%
- Cohorted by: Signup month
-
90-Day Retention: % of orgs active after 90 days
- Target: > 75%
- Leading indicator: Build frequency in first week
Revenue
-
MRR Growth: Month-over-month recurring revenue growth
- Target: 15% MoM
- Segmented by: Plan type, acquisition channel
-
Net Revenue Retention: Revenue from existing customers
- Target: > 120%
- Includes: Upgrades, downgrades, churn
Operational Metrics
Performance
-
Build Success Rate: % of builds completing successfully
- Target: > 99%
- Excluding: User errors
-
API Latency: p95 response time
- Target: < 200ms
- Measured: All API endpoints
Quality
-
Customer Satisfaction (CSAT): Post-interaction survey
- Target: > 4.5/5
- Measured: Support interactions
-
Net Promoter Score (NPS): Quarterly survey
- Target: > 50
- Segmented by: Customer segment
Leading Indicators
Feature Adoption
- CLI usage rate
- API integration rate
- Advanced features usage
Customer Health
- Support ticket volume
- Feature request patterns
- Churn risk scores
Data Collection
Tools
- Amplitude: Product analytics
- Segment: Event tracking
- Metabase: Business intelligence
- Custom dashboards: Real-time metrics
Privacy
- All metrics are aggregated
- No PII in analytics
- GDPR compliant tracking
- User consent required
Reporting
Weekly
- North star metric update
- Key metric changes
- Anomaly alerts
Monthly
- Full metrics review
- Cohort analysis
- Revenue metrics
- OKR progress
Quarterly
- Strategic metric review
- NPS survey results
- Market comparison
- Forecast updates
π§ͺ Testing Plan: Tenki GitHub Runners Evaluation
π Thursday 6/26 β Phase 1 & Phase 2: Staging & Controlled Evaluation
Phase 1: Staging Load Test
Objective: Validate stability and responsiveness of the new VM-based runners under parallel job load.
Setup:
Trigger ~50 GitHub Actions jobs in parallel using the gh-runner-test repository.
Definition of Done (DoD):
- Jobs are picked up within 30 seconds.
- Job duration is within +5% of baseline execution time from existing Docker-based runners.
Phase 2: Tenki Test Suite Evaluation
Condition: Executed only if Phase 1 is successful.
Objective: Assess runner performance using real-world workflows from the test suite.
Setup:
- Switch the GitHub Actions workspace used by
LuxorLabs/tenki-teststo the new VM-based runners. - Monitor CI jobs for performance and reliability.
Definition of Done (DoD):
- End-to-end performance delta is < 5% compared to current production metrics.
π Friday β Phase 3: βPre-Productionβ Migration
Phase 3: Luxor Workflow Migration
Precondition: All DoDs from Phase 1 and Phase 2 must be fully met.
Objective: Transition production workloads to the new runners, based on successful Thursday validation.
Setup:
- Migrate all GitHub workflows under the Luxor Tenki Workspace to the new VM-based runners.
Definition of Done (DoD):
- All jobs are successful.
- Performance delta is < 5% compared to current production metrics.
Documentation Roadmap
This roadmap tracks documentation that needs to be written for Tenki Cloud. Items are prioritized based on impact and frequency of use.
π¨ Priority 1: Critical Gaps
These affect daily development and operations:
- Environment Variables Reference - Complete list of all env vars
- API Reference - tRPC endpoints and Connect/gRPC services
- GitHub App Setup Guide - Step-by-step installation
- Secrets Management Guide - SOPS usage and key rotation
- Troubleshooting Guide - Common issues and solutions
π§ Priority 2: Configuration & Setup
Essential for proper deployment and configuration:
- Service Configuration Guide - engine.yaml and other configs
- Authentication Setup - Kratos and Keto configuration
- Notification Service Guide - Email and webhook setup
- Database Guide - Schema, migrations, and optimization
- CLI Tool Documentation - tenki-cli command reference
π Priority 3: Operational Excellence
For production operations and monitoring:
- Monitoring & Observability - Metrics, logs, and tracing
- Backup & Restore Procedures - Database and state backup
- Scaling Guidelines - When and how to scale services
- Security Best Practices - Hardening and compliance
- Audit Logging Guide - Event tracking and retention
π Priority 4: Advanced Features
For power users and advanced scenarios:
- Custom Runner Images - Building and managing
- Temporal Workflows Guide - Patterns and testing
- TigerBeetle Integration - Ledger design and reconciliation
- Multi-region Setup - Geographic distribution
- Performance Tuning - Optimization techniques
π Contributing
To add documentation:
- Pick an item from this roadmap
- Create the documentation in the appropriate section
- Update
SUMMARY.mdto include your new page - Remove the item from this roadmap
- Submit a PR
Progress Tracking
- Total items: 24
- Completed: 0
- In Progress: 0
- Remaining: 24
Last updated: 2025-06-12