Pre-Event Load Testing & Monitoring Infrastructure
Executive Summary
RRI runs 9 major revenue events per year with no load testing, no synthetic monitoring, and no real-time operational dashboard. Leadership makes decisions during events on gut and panic because data is 3-4 hours delayed. The Grafana + data lake infrastructure exists (Zach built it) but leadership won’t use it because they don’t trust the data.
The solution combines three tools: k6 OSS for load testing (free, hybrid HTTP + browser VUs, runs on Zach’s K8s), Checkly Team for synthetic monitoring ($64/month, 22 global locations, Playwright-based), and Grafana (existing LGTM stack) for the operational dashboard.
For March 12 (scoped down): (1) Grafana dashboard with live order-ingestion metrics, (2) Cognito auth surge test validating D4, (3) Checkly checkout uptime monitoring from 12 regions. Full k6 suite is Q2.
What Needs to Happen
- Instrument order-ingestion with
prometheus-client— Expose/metricsendpoint for Grafana scraping. Week 1, pre-UPW. - Build Grafana event ops dashboard — 3-row layout: Business Metrics (orders/minute, revenue) / Pipeline Health (queue depth, processing time) / Infrastructure (CPU, memory, Redis). Week 1.
- Deploy Checkly monitors for checkout flow — Playwright-based synthetic monitoring from 12 global regions. Alerts on failure. Week 1.
- Build k6 test scenarios — Four scenarios: checkout-sanity-load (1,500 VUs), order-ingestion-webhook-flood (200 concurrent), cognito-auth-surge (3,000 VUs, validates D4), full-pipeline-e2e-browser (50 VUs). Q2.
- Full k6 suite deployed and running pre-event — Automated pre-event load test runs as part of event readiness checklist. Q2.
Critical warning: NEVER load test against production Salesforce (governor limits) or production Redis (job corruption). Requires staging environment.
Claude Code acceleration: k6 test scripts, Grafana dashboard JSON definitions, and Checkly monitor configurations are all highly automatable. Claude Code can generate complete test scenarios from API documentation and produce monitoring-as-code configurations. Estimated savings: 1.5 weeks from the original 3-week timeline.
Completion Criteria
- Order-ingestion instrumented with
prometheus-clientand/metricsendpoint live - Grafana event ops dashboard deployed with 3-row layout showing live metrics
- Checkly checkout monitoring active from 12 global regions with alert routing
- k6 test scenarios covering all 4 critical paths: checkout, webhook flood, auth surge, e2e browser
- Pre-event load test integrated into event readiness checklist
- Grafana dashboard live and leadership-accessible before March 12
Initiative Attributes
Tools Required
| Tool | Purpose | Cost |
|---|---|---|
| k6 OSS | Load testing — hybrid HTTP + browser VUs, runs on K8s via k6-operator | Free (OSS) |
| Checkly Team | Synthetic monitoring — 22 global locations, Playwright-based, monitoring-as-code | $64/month |
| Grafana (existing) | Dashboards + LGTM observability stack — already deployed by Zach | Already deployed |
| prometheus-client | Node.js metrics instrumentation for order-ingestion | Free (OSS) |
Related Risks
| ID | Risk | Severity | Probability | Mitigation |
|---|---|---|---|---|
| RF7 | Spork overload in Wave 0 (4 initiatives in 9 days) | MEDIUM | HIGH | Kill 6+ daily meetings before March 12. Route status through Kingler. D6 Phase 1 is Zach-led, reducing Spork’s direct involvement. |