D6. Pre-Event Load Testing & Monitoring

Executive Summary

RRI runs 9 major revenue events per year with no load testing, no synthetic monitoring, and no real-time operational dashboard. Leadership makes decisions during events on gut and panic because data is 3-4 hours delayed. The Grafana + data lake infrastructure exists (Zach built it) but leadership won’t use it because they don’t trust the data.

The solution combines three tools: k6 OSS for load testing (free, hybrid HTTP + browser VUs, runs on Zach’s K8s), Checkly Team for synthetic monitoring ($64/month, 22 global locations, Playwright-based), and Grafana (existing LGTM stack) for the operational dashboard.

For March 12 (scoped down): (1) Grafana dashboard with live order-ingestion metrics, (2) Cognito auth surge test validating D4, (3) Checkly checkout uptime monitoring from 12 regions. Full k6 suite is Q2.

What Needs to Happen

Instrument order-ingestion with prometheus-client — Expose /metrics endpoint for Grafana scraping. Week 1, pre-UPW.
Build Grafana event ops dashboard — 3-row layout: Business Metrics (orders/minute, revenue) / Pipeline Health (queue depth, processing time) / Infrastructure (CPU, memory, Redis). Week 1.
Deploy Checkly monitors for checkout flow — Playwright-based synthetic monitoring from 12 global regions. Alerts on failure. Week 1.
Build k6 test scenarios — Four scenarios: checkout-sanity-load (1,500 VUs), order-ingestion-webhook-flood (200 concurrent), cognito-auth-surge (3,000 VUs, validates D4), full-pipeline-e2e-browser (50 VUs). Q2.
Full k6 suite deployed and running pre-event — Automated pre-event load test runs as part of event readiness checklist. Q2.

Critical warning: NEVER load test against production Salesforce (governor limits) or production Redis (job corruption). Requires staging environment.

Claude Code acceleration: k6 test scripts, Grafana dashboard JSON definitions, and Checkly monitor configurations are all highly automatable. Claude Code can generate complete test scenarios from API documentation and produce monitoring-as-code configurations. Estimated savings: 1.5 weeks from the original 3-week timeline.

Completion Criteria

Order-ingestion instrumented with prometheus-client and /metrics endpoint live
Grafana event ops dashboard deployed with 3-row layout showing live metrics
Checkly checkout monitoring active from 12 global regions with alert routing
k6 test scenarios covering all 4 critical paths: checkout, webhook flood, auth surge, e2e browser
Pre-event load test integrated into event readiness checklist
Grafana dashboard live and leadership-accessible before March 12

Initiative Attributes

D6 — Pre-Event Load Testing & Monitoring

Cost

$64-$388/month ongoing + ~$8,125 one-time setup

Timeline (Original)

3 weeks (dashboard Week 1, k6 Week 2, Checkly Week 3)

Timeline (With Claude Code)

1.5 weeks

⚡ k6 test scripts + Grafana JSON + Checkly monitors

Owner

Zach Hardesty + Spork + Johnny Yarlott (order-ingestion metrics)

Dependencies

Hard: D4 (Cognito hardening must be in place before auth surge test). Soft: D5 (Redis HA makes load test results valid)

Unblocks

U9 (k6 validates throughput improvements), S8 (real-time dashboard uses Grafana infrastructure built here)

Revenue at Risk

Unquantified — every event operates blind without this

Success Metrics

Grafana dashboard live before March 12; full k6 suite covering all 4 Heroku apps by Q2

Tools Required

Tool	Purpose	Cost
k6 OSS	Load testing — hybrid HTTP + browser VUs, runs on K8s via k6-operator	Free (OSS)
Checkly Team	Synthetic monitoring — 22 global locations, Playwright-based, monitoring-as-code	$64/month
Grafana (existing)	Dashboards + LGTM observability stack — already deployed by Zach	Already deployed
prometheus-client	Node.js metrics instrumentation for order-ingestion	Free (OSS)

Related Risks

ID	Risk	Severity	Probability	Mitigation
RF7	Spork overload in Wave 0 (4 initiatives in 9 days)	MEDIUM	HIGH	Kill 6+ daily meetings before March 12. Route status through Kingler. D6 Phase 1 is Zach-led, reducing Spork’s direct involvement.

Confidential Document

Pre-Event Load Testing & Monitoring Infrastructure

Executive Summary

What Needs to Happen

Completion Criteria

Initiative Attributes

Tools Required

Related Risks