Confidential Document

This document is restricted to RRI leadership.

Incorrect password
DERISK — Remove What Can Kill You
D5

Redis Single Point of Failure

NOT STARTED Wave 0-2 · 3-4 weeks

Executive Summary

All BullMQ job queues — including the order-ingestion pipeline that processes every payment — run on a single unmanaged Redis instance on Linode (45.79.132.111:29579). No redundancy. No failover. No monitoring. No circuit breaker. If that single VM goes down, every payment webhook from Stripe, ClickFunnels, Shopify, and CopeCart stops processing.

The fix comes in two phases: a pre-UPW quick fix (Redis Sentinel on Linode for automatic failover) and a post-UPW permanent fix (migration to Upstash Redis Fixed 1GB plan — managed, Heroku-compatible, K8s-portable).

Critical detail: Must use Upstash Fixed plan, NOT Pay-as-You-Go. BullMQ polls Redis aggressively even when idle, making per-command billing unpredictable and potentially expensive.

What Needs to Happen

  1. Audit maxmemory-policy TODAY — If not set to noeviction, BullMQ jobs are silently disappearing right now. March 3.
  2. Deploy Redis Sentinel on Linode — Master + replica + 3 sentinels. 15-30 second automatic failover. $5-6/month extra. Pre-UPW quick fix, 3-5 days.
  3. Provision Upstash Redis Fixed 1GB — $15/month. Managed Redis with automatic failover. Heroku-compatible add-on, K8s-portable. Week 1 post-UPW.
  4. Configure BullMQ circuit breaker — Set enableOfflineQueue: false on BullMQ Queue instances. Redis down = 503 response to Stripe. Stripe retries webhooks for 72 hours — zero orders lost. Week 1.
  5. Migration: parallel queues — Run BullMQ on both Upstash and Linode simultaneously. Cutover via feature flag. Weeks 2-3.
  6. Decommission Linode Redis — After parallel validation period. Week 4.

Eliminated options: ElastiCache (VPC-only, Heroku incompatible without VPN), MemoryDB (cluster-mode only, VPC-only, overkill), Redis Cluster (not needed — this is an availability problem, not a scaling problem).

Claude Code acceleration: BullMQ circuit breaker implementation, migration scripts with feature flags, and Redis Sentinel configuration are all ideal for AI-assisted development. Estimated savings: 1-2 weeks from the original 3-4 week timeline.

Completion Criteria

  • maxmemory-policy verified as noeviction on Linode Redis
  • Redis Sentinel deployed on Linode with automatic failover tested (pre-UPW)
  • Upstash Redis Fixed 1GB provisioned and connected
  • BullMQ circuit breaker active: enableOfflineQueue: false on all Queue instances
  • Parallel queue migration validated — both Upstash and Linode processing jobs correctly
  • Linode Redis decommissioned after cutover validation
  • Stripe webhook retry behavior verified: 503 responses trigger 72-hour retry cycle

Initiative Attributes

D5 — Redis Single Point of Failure
Cost
$5-30/month (Sentinel) → $15/month (Upstash post-migration)
Timeline (Original)
3-4 weeks (Pre-UPW: 3-5 days. Full migration: Wave 2.)
Timeline (With Claude Code)
2 weeks
BullMQ circuit breaker + migration scripts
Owner
Zach Hardesty + Johnny Yarlott + Spork
Dependencies
Soft: D6 (monitoring detects Redis failures faster), D7 (CI gates deployment changes to Redis config)
Unblocks
S2 (K8s migration requires Redis to be managed and portable out of Linode)
Revenue at Risk
Complete payment pipeline halt — every dollar of event revenue during Redis failure
Success Metrics
Upstash Redis in production with automatic failover; BullMQ circuit breaker tested

Tools Required

ToolPurposeCost
Upstash Redis (Fixed 1GB)Managed Redis for BullMQ — Heroku-compatible, K8s-portable, automatic failover$15/month
Redis SentinelPre-UPW quick fix — master + replica + 3 sentinels on Linode$5-6/month
BullMQCircuit breaker config: enableOfflineQueue: falseFree (OSS)

Related Risks

No direct risk register entries. However, Redis failure is implicitly the highest-severity infrastructure risk — it would halt the entire payment pipeline. The circuit breaker (BullMQ enableOfflineQueue: false) combined with Stripe’s 72-hour webhook retry is the architectural mitigation.