Confidential Document

This document is restricted to RRI leadership.

Incorrect password
SCALE — Now Grow
S2

Heroku to K8s Migration

NOT STARTED Wave 2-3 · 16 weeks

Executive Summary

RRI runs its entire production infrastructure on Heroku — 27+ repositories across dozens of dynos. The Heroku contract ends in September 2026. Missing this deadline means a $150K+ contract extension plus ongoing operational risk on an unmigrated platform. Only 3 of 27+ repos currently run on K8s.

The target architecture is Talos Linux + ArgoCD + Cilium service mesh on RRI-owned infrastructure. This is a 16-week phased migration that must start by April 15 to hit the September deadline. The migration covers order-ingestion (12 dynos — the highest-risk, most critical service), event-api (8 dynos), members-portal, and all supporting services.

The financial case is strong: K8s is 40-60% cheaper than Heroku at RRI’s dyno count, producing $96K-$180K/year in infrastructure savings. But this cannot be a one-person operation — Zach Hardesty is bus factor 1 on infrastructure, and the D1 bus factor elimination program must complete before S2 can safely execute.

What Needs to Happen

  1. Architecture design: Talos Linux + ArgoCD + Cilium service mesh — Define the target K8s architecture. Talos for immutable OS, ArgoCD for GitOps deployments, Cilium for service mesh and network policies. Week 1-2.
  2. Set up K8s staging environment — Mirror production topology. 3 of 27+ repos already run on K8s — use these as the foundation. Validate ArgoCD deployment pipeline. Week 2-4.
  3. Migrate non-critical services first (canary approach) — Start with low-risk, low-traffic services to validate the migration playbook and build confidence. Week 4-6.
  4. Migrate order-ingestion (12 dynos) — Highest risk, most critical service. Handles all payment processing from Stripe/ClickFunnels/Shopify/CopeCart/POS. Requires extensive testing, rollback plan, and zero-downtime cutover. Week 6-10.
  5. Migrate event-api (8 dynos) — Handles Obv.io sync and magic link generation. Critical during events. Schedule migration between events for safety. Week 8-12.
  6. Migrate members-portal — Customer-facing portal (currently Node 11 on Heroku). Coordinate with D2 (Node 22 upgrade) — migrate the upgraded version. Week 10-13.
  7. Migrate supporting services — All remaining Heroku services: webhooks, background workers, internal tools. Week 12-15.
  8. Decommission Heroku — Final cutover. Must complete before September 2026 contract end. Validate all services running on K8s, confirm no Heroku dependencies remain. Week 15-16.
  9. Post-migration validation — Monitoring, cost validation, performance benchmarks. Confirm $96K-$180K/year savings materialize. Compare latency, error rates, and throughput against Heroku baselines. Week 16+.

Non-negotiable deadline: Heroku contract ends September 2026. Missing this deadline triggers a $150K+ contract extension. Migration must start by April 15 to have any chance of completing on time. Under Scenario A (current team only), this deadline is at MEDIUM probability of being missed.

Claude Code acceleration: K8s manifests, ArgoCD configuration, Helm charts, Cilium network policies, and migration scripts are all highly automatable. Claude Code saves ~4-6 weeks on infrastructure-as-code, bringing 16 weeks down to 10-12 weeks. The human work (testing, validation, zero-downtime cutover planning) cannot be compressed.

Migration Scope

ServiceCurrent (Heroku)Risk LevelMigration Window
order-ingestion12 dynos, BullMQ/RedisCRITICALWeek 6-10
event-api8 dynosHIGHWeek 8-12
members-portalNode 11, HerokuHIGHWeek 10-13
Supporting servicesVariousMEDIUMWeek 12-15
Already on K8s3 reposN/AComplete

Target Architecture

ComponentTechnologyPurpose
Operating SystemTalos LinuxImmutable, API-driven K8s OS. No SSH, no shell — security by design.
GitOpsArgoCDDeclarative deployments from Git. Automatic sync, drift detection, rollback.
Service MeshCiliumeBPF-based networking, network policies, observability. Replaces kube-proxy.
MonitoringGrafana + PrometheusExisting stack (from D6). Extended with K8s-specific dashboards.
InfrastructureRRI-ownedNo vendor lock-in. Full control over compute, networking, storage.

Completion Criteria

  • All 27+ repos migrated from Heroku to K8s
  • order-ingestion (12 dynos) running on K8s with zero-downtime cutover completed
  • event-api (8 dynos) running on K8s
  • members-portal running on K8s (Node 22 version from D2)
  • ArgoCD GitOps pipeline operational for all services
  • Cilium service mesh deployed with network policies
  • Heroku contract terminated — no remaining Heroku dependencies
  • Infrastructure savings of $96K-$180K/year validated
  • Performance benchmarks match or exceed Heroku baselines

Initiative Attributes

S2 — Heroku to K8s Migration
Cost
Net savings of $96K-$180K/year (K8s is 40-60% cheaper than Heroku at RRI’s dyno count)
Timeline (Original)
16 weeks (Waves 2-3, starts April 15 — MUST start to hit September deadline)
Timeline (With Claude Code)
10-12 weeks
K8s manifests, ArgoCD, Helm charts
Owner
Zach Hardesty (architecture) + new DevOps hire (execution)
Dependencies
Requires D1 (bus factor — Zach can’t be sole operator), D5 (Redis must be managed/portable)
Deadline
September 2026 — Heroku contract end. Non-negotiable.
Revenue Impact
$96K-$180K/year infrastructure savings. Missing deadline = $150K+ contract extension.
Success Metrics
All services on K8s before September 2026; infrastructure costs reduced 40-60%; zero-downtime migration

Related Risks

IDRiskSeverityProbabilityMitigation
RF5 S2 Heroku migration misses September deadline HIGH MEDIUM (Scenario A) / LOW (Scenario B) Must start April 15. New DevOps hire (H3) essential — Zach alone cannot run infra AND execute 16-week migration. Contract extension fallback: $150K+.

Scenario dependency: Under Scenario A (current team only), S2 is at MEDIUM probability of missing the September deadline because Zach is bus factor 1 and cannot simultaneously run infrastructure and execute a 16-week migration. Scenario B (fully resourced) adds a DevOps hire, reducing the probability to LOW.