Confidential Document

This document is restricted to RRI leadership.

Incorrect password
DERISK — Remove What Can Kill You
D6

Pre-Event Load Testing & Monitoring Infrastructure

IN PROGRESS (Phase 1) Wave 0-2 · 3 weeks

Executive Summary

RRI runs 9 major revenue events per year with no load testing, no synthetic monitoring, and no real-time operational dashboard. Leadership makes decisions during events on gut and panic because data is 3-4 hours delayed. The Grafana + data lake infrastructure exists (Zach built it) but leadership won’t use it because they don’t trust the data.

The solution combines three tools: k6 OSS for load testing (free, hybrid HTTP + browser VUs, runs on Zach’s K8s), Checkly Team for synthetic monitoring ($64/month, 22 global locations, Playwright-based), and Grafana (existing LGTM stack) for the operational dashboard.

For March 12 (scoped down): (1) Grafana dashboard with live order-ingestion metrics, (2) Cognito auth surge test validating D4, (3) Checkly checkout uptime monitoring from 12 regions. Full k6 suite is Q2.

What Needs to Happen

  1. Instrument order-ingestion with prometheus-client — Expose /metrics endpoint for Grafana scraping. Week 1, pre-UPW.
  2. Build Grafana event ops dashboard — 3-row layout: Business Metrics (orders/minute, revenue) / Pipeline Health (queue depth, processing time) / Infrastructure (CPU, memory, Redis). Week 1.
  3. Deploy Checkly monitors for checkout flow — Playwright-based synthetic monitoring from 12 global regions. Alerts on failure. Week 1.
  4. Build k6 test scenarios — Four scenarios: checkout-sanity-load (1,500 VUs), order-ingestion-webhook-flood (200 concurrent), cognito-auth-surge (3,000 VUs, validates D4), full-pipeline-e2e-browser (50 VUs). Q2.
  5. Full k6 suite deployed and running pre-event — Automated pre-event load test runs as part of event readiness checklist. Q2.

Critical warning: NEVER load test against production Salesforce (governor limits) or production Redis (job corruption). Requires staging environment.

Claude Code acceleration: k6 test scripts, Grafana dashboard JSON definitions, and Checkly monitor configurations are all highly automatable. Claude Code can generate complete test scenarios from API documentation and produce monitoring-as-code configurations. Estimated savings: 1.5 weeks from the original 3-week timeline.

Completion Criteria

  • Order-ingestion instrumented with prometheus-client and /metrics endpoint live
  • Grafana event ops dashboard deployed with 3-row layout showing live metrics
  • Checkly checkout monitoring active from 12 global regions with alert routing
  • k6 test scenarios covering all 4 critical paths: checkout, webhook flood, auth surge, e2e browser
  • Pre-event load test integrated into event readiness checklist
  • Grafana dashboard live and leadership-accessible before March 12

Initiative Attributes

D6 — Pre-Event Load Testing & Monitoring
Cost
$64-$388/month ongoing + ~$8,125 one-time setup
Timeline (Original)
3 weeks (dashboard Week 1, k6 Week 2, Checkly Week 3)
Timeline (With Claude Code)
1.5 weeks
k6 test scripts + Grafana JSON + Checkly monitors
Owner
Zach Hardesty + Spork + Johnny Yarlott (order-ingestion metrics)
Dependencies
Hard: D4 (Cognito hardening must be in place before auth surge test). Soft: D5 (Redis HA makes load test results valid)
Unblocks
U9 (k6 validates throughput improvements), S8 (real-time dashboard uses Grafana infrastructure built here)
Revenue at Risk
Unquantified — every event operates blind without this
Success Metrics
Grafana dashboard live before March 12; full k6 suite covering all 4 Heroku apps by Q2

Tools Required

ToolPurposeCost
k6 OSSLoad testing — hybrid HTTP + browser VUs, runs on K8s via k6-operatorFree (OSS)
Checkly TeamSynthetic monitoring — 22 global locations, Playwright-based, monitoring-as-code$64/month
Grafana (existing)Dashboards + LGTM observability stack — already deployed by ZachAlready deployed
prometheus-clientNode.js metrics instrumentation for order-ingestionFree (OSS)

Related Risks

IDRiskSeverityProbabilityMitigation
RF7 Spork overload in Wave 0 (4 initiatives in 9 days) MEDIUM HIGH Kill 6+ daily meetings before March 12. Route status through Kingler. D6 Phase 1 is Zach-led, reducing Spork’s direct involvement.