D5. Redis Single Point of Failure — RRI Technical Roadmap

Executive Summary

All BullMQ job queues — including the order-ingestion pipeline that processes every payment — run on a single unmanaged Redis instance on Linode (45.79.132.111:29579). No redundancy. No failover. No monitoring. No circuit breaker. If that single VM goes down, every payment webhook from Stripe, ClickFunnels, Shopify, and CopeCart stops processing.

The fix comes in two phases: a pre-UPW quick fix (Redis Sentinel on Linode for automatic failover) and a post-UPW permanent fix (migration to Upstash Redis Fixed 1GB plan — managed, Heroku-compatible, K8s-portable).

Critical detail: Must use Upstash Fixed plan, NOT Pay-as-You-Go. BullMQ polls Redis aggressively even when idle, making per-command billing unpredictable and potentially expensive.

What Needs to Happen

Audit maxmemory-policy TODAY — If not set to noeviction, BullMQ jobs are silently disappearing right now. March 3.
Deploy Redis Sentinel on Linode — Master + replica + 3 sentinels. 15-30 second automatic failover. $5-6/month extra. Pre-UPW quick fix, 3-5 days.
Provision Upstash Redis Fixed 1GB — $15/month. Managed Redis with automatic failover. Heroku-compatible add-on, K8s-portable. Week 1 post-UPW.
Configure BullMQ circuit breaker — Set enableOfflineQueue: false on BullMQ Queue instances. Redis down = 503 response to Stripe. Stripe retries webhooks for 72 hours — zero orders lost. Week 1.
Migration: parallel queues — Run BullMQ on both Upstash and Linode simultaneously. Cutover via feature flag. Weeks 2-3.
Decommission Linode Redis — After parallel validation period. Week 4.

Eliminated options: ElastiCache (VPC-only, Heroku incompatible without VPN), MemoryDB (cluster-mode only, VPC-only, overkill), Redis Cluster (not needed — this is an availability problem, not a scaling problem).

Claude Code acceleration: BullMQ circuit breaker implementation, migration scripts with feature flags, and Redis Sentinel configuration are all ideal for AI-assisted development. Estimated savings: 1-2 weeks from the original 3-4 week timeline.

Completion Criteria

maxmemory-policy verified as noeviction on Linode Redis
Redis Sentinel deployed on Linode with automatic failover tested (pre-UPW)
Upstash Redis Fixed 1GB provisioned and connected
BullMQ circuit breaker active: enableOfflineQueue: false on all Queue instances
Parallel queue migration validated — both Upstash and Linode processing jobs correctly
Linode Redis decommissioned after cutover validation
Stripe webhook retry behavior verified: 503 responses trigger 72-hour retry cycle

Initiative Attributes

D5 — Redis Single Point of Failure

Cost

$5-30/month (Sentinel) → $15/month (Upstash post-migration)

Timeline (Original)

3-4 weeks (Pre-UPW: 3-5 days. Full migration: Wave 2.)

Timeline (With Claude Code)

2 weeks

⚡ BullMQ circuit breaker + migration scripts

Owner

Zach Hardesty + Johnny Yarlott + Spork

Dependencies

Soft: D6 (monitoring detects Redis failures faster), D7 (CI gates deployment changes to Redis config)

Unblocks

S2 (K8s migration requires Redis to be managed and portable out of Linode)

Revenue at Risk

Complete payment pipeline halt — every dollar of event revenue during Redis failure

Success Metrics

Upstash Redis in production with automatic failover; BullMQ circuit breaker tested

Tools Required

Tool	Purpose	Cost
Upstash Redis (Fixed 1GB)	Managed Redis for BullMQ — Heroku-compatible, K8s-portable, automatic failover	$15/month
Redis Sentinel	Pre-UPW quick fix — master + replica + 3 sentinels on Linode	$5-6/month
BullMQ	Circuit breaker config: `enableOfflineQueue: false`	Free (OSS)

Related Risks

No direct risk register entries. However, Redis failure is implicitly the highest-severity infrastructure risk — it would halt the entire payment pipeline. The circuit breaker (BullMQ enableOfflineQueue: false) combined with Stripe’s 72-hour webhook retry is the architectural mitigation.

Confidential Document

Executive Summary

What Needs to Happen

Completion Criteria

Initiative Attributes

Tools Required

Related Risks