Heroku to K8s Migration
Executive Summary
RRI runs its entire production infrastructure on Heroku — 27+ repositories across dozens of dynos. The Heroku contract ends in September 2026. Missing this deadline means a $150K+ contract extension plus ongoing operational risk on an unmigrated platform. Only 3 of 27+ repos currently run on K8s.
The target architecture is Talos Linux + ArgoCD + Cilium service mesh on RRI-owned infrastructure. This is a 16-week phased migration that must start by April 15 to hit the September deadline. The migration covers order-ingestion (12 dynos — the highest-risk, most critical service), event-api (8 dynos), members-portal, and all supporting services.
The financial case is strong: K8s is 40-60% cheaper than Heroku at RRI’s dyno count, producing $96K-$180K/year in infrastructure savings. But this cannot be a one-person operation — Zach Hardesty is bus factor 1 on infrastructure, and the D1 bus factor elimination program must complete before S2 can safely execute.
What Needs to Happen
- Architecture design: Talos Linux + ArgoCD + Cilium service mesh — Define the target K8s architecture. Talos for immutable OS, ArgoCD for GitOps deployments, Cilium for service mesh and network policies. Week 1-2.
- Set up K8s staging environment — Mirror production topology. 3 of 27+ repos already run on K8s — use these as the foundation. Validate ArgoCD deployment pipeline. Week 2-4.
- Migrate non-critical services first (canary approach) — Start with low-risk, low-traffic services to validate the migration playbook and build confidence. Week 4-6.
- Migrate order-ingestion (12 dynos) — Highest risk, most critical service. Handles all payment processing from Stripe/ClickFunnels/Shopify/CopeCart/POS. Requires extensive testing, rollback plan, and zero-downtime cutover. Week 6-10.
- Migrate event-api (8 dynos) — Handles Obv.io sync and magic link generation. Critical during events. Schedule migration between events for safety. Week 8-12.
- Migrate members-portal — Customer-facing portal (currently Node 11 on Heroku). Coordinate with D2 (Node 22 upgrade) — migrate the upgraded version. Week 10-13.
- Migrate supporting services — All remaining Heroku services: webhooks, background workers, internal tools. Week 12-15.
- Decommission Heroku — Final cutover. Must complete before September 2026 contract end. Validate all services running on K8s, confirm no Heroku dependencies remain. Week 15-16.
- Post-migration validation — Monitoring, cost validation, performance benchmarks. Confirm $96K-$180K/year savings materialize. Compare latency, error rates, and throughput against Heroku baselines. Week 16+.
Non-negotiable deadline: Heroku contract ends September 2026. Missing this deadline triggers a $150K+ contract extension. Migration must start by April 15 to have any chance of completing on time. Under Scenario A (current team only), this deadline is at MEDIUM probability of being missed.
Claude Code acceleration: K8s manifests, ArgoCD configuration, Helm charts, Cilium network policies, and migration scripts are all highly automatable. Claude Code saves ~4-6 weeks on infrastructure-as-code, bringing 16 weeks down to 10-12 weeks. The human work (testing, validation, zero-downtime cutover planning) cannot be compressed.
Migration Scope
| Service | Current (Heroku) | Risk Level | Migration Window |
|---|---|---|---|
| order-ingestion | 12 dynos, BullMQ/Redis | CRITICAL | Week 6-10 |
| event-api | 8 dynos | HIGH | Week 8-12 |
| members-portal | Node 11, Heroku | HIGH | Week 10-13 |
| Supporting services | Various | MEDIUM | Week 12-15 |
| Already on K8s | 3 repos | N/A | Complete |
Target Architecture
| Component | Technology | Purpose |
|---|---|---|
| Operating System | Talos Linux | Immutable, API-driven K8s OS. No SSH, no shell — security by design. |
| GitOps | ArgoCD | Declarative deployments from Git. Automatic sync, drift detection, rollback. |
| Service Mesh | Cilium | eBPF-based networking, network policies, observability. Replaces kube-proxy. |
| Monitoring | Grafana + Prometheus | Existing stack (from D6). Extended with K8s-specific dashboards. |
| Infrastructure | RRI-owned | No vendor lock-in. Full control over compute, networking, storage. |
Completion Criteria
- All 27+ repos migrated from Heroku to K8s
- order-ingestion (12 dynos) running on K8s with zero-downtime cutover completed
- event-api (8 dynos) running on K8s
- members-portal running on K8s (Node 22 version from D2)
- ArgoCD GitOps pipeline operational for all services
- Cilium service mesh deployed with network policies
- Heroku contract terminated — no remaining Heroku dependencies
- Infrastructure savings of $96K-$180K/year validated
- Performance benchmarks match or exceed Heroku baselines
Initiative Attributes
Related Risks
| ID | Risk | Severity | Probability | Mitigation |
|---|---|---|---|---|
| RF5 | S2 Heroku migration misses September deadline | HIGH | MEDIUM (Scenario A) / LOW (Scenario B) | Must start April 15. New DevOps hire (H3) essential — Zach alone cannot run infra AND execute 16-week migration. Contract extension fallback: $150K+. |
Scenario dependency: Under Scenario A (current team only), S2 is at MEDIUM probability of missing the September deadline because Zach is bus factor 1 and cannot simultaneously run infrastructure and execute a 16-week migration. Scenario B (fully resourced) adds a DevOps hire, reducing the probability to LOW.