Redis Single Point of Failure
Executive Summary
All BullMQ job queues — including the order-ingestion pipeline that processes every payment — run on a single unmanaged Redis instance on Linode (45.79.132.111:29579). No redundancy. No failover. No monitoring. No circuit breaker. If that single VM goes down, every payment webhook from Stripe, ClickFunnels, Shopify, and CopeCart stops processing.
The fix comes in two phases: a pre-UPW quick fix (Redis Sentinel on Linode for automatic failover) and a post-UPW permanent fix (migration to Upstash Redis Fixed 1GB plan — managed, Heroku-compatible, K8s-portable).
Critical detail: Must use Upstash Fixed plan, NOT Pay-as-You-Go. BullMQ polls Redis aggressively even when idle, making per-command billing unpredictable and potentially expensive.
What Needs to Happen
- Audit
maxmemory-policyTODAY — If not set tonoeviction, BullMQ jobs are silently disappearing right now. March 3. - Deploy Redis Sentinel on Linode — Master + replica + 3 sentinels. 15-30 second automatic failover. $5-6/month extra. Pre-UPW quick fix, 3-5 days.
- Provision Upstash Redis Fixed 1GB — $15/month. Managed Redis with automatic failover. Heroku-compatible add-on, K8s-portable. Week 1 post-UPW.
- Configure BullMQ circuit breaker — Set
enableOfflineQueue: falseon BullMQ Queue instances. Redis down = 503 response to Stripe. Stripe retries webhooks for 72 hours — zero orders lost. Week 1. - Migration: parallel queues — Run BullMQ on both Upstash and Linode simultaneously. Cutover via feature flag. Weeks 2-3.
- Decommission Linode Redis — After parallel validation period. Week 4.
Eliminated options: ElastiCache (VPC-only, Heroku incompatible without VPN), MemoryDB (cluster-mode only, VPC-only, overkill), Redis Cluster (not needed — this is an availability problem, not a scaling problem).
Claude Code acceleration: BullMQ circuit breaker implementation, migration scripts with feature flags, and Redis Sentinel configuration are all ideal for AI-assisted development. Estimated savings: 1-2 weeks from the original 3-4 week timeline.
Completion Criteria
maxmemory-policyverified asnoevictionon Linode Redis- Redis Sentinel deployed on Linode with automatic failover tested (pre-UPW)
- Upstash Redis Fixed 1GB provisioned and connected
- BullMQ circuit breaker active:
enableOfflineQueue: falseon all Queue instances - Parallel queue migration validated — both Upstash and Linode processing jobs correctly
- Linode Redis decommissioned after cutover validation
- Stripe webhook retry behavior verified: 503 responses trigger 72-hour retry cycle
Initiative Attributes
Tools Required
| Tool | Purpose | Cost |
|---|---|---|
| Upstash Redis (Fixed 1GB) | Managed Redis for BullMQ — Heroku-compatible, K8s-portable, automatic failover | $15/month |
| Redis Sentinel | Pre-UPW quick fix — master + replica + 3 sentinels on Linode | $5-6/month |
| BullMQ | Circuit breaker config: enableOfflineQueue: false | Free (OSS) |
Related Risks
No direct risk register entries. However, Redis failure is implicitly the highest-severity infrastructure risk — it would halt the entire payment pipeline. The circuit breaker (BullMQ enableOfflineQueue: false) combined with Stripe’s 72-hour webhook retry is the architectural mitigation.