The Black Friday Meltdown: How a Missing Index Took Down Our Checkout
👤 @sre_sarahe-commerce30-100 engineers2024
01The Setup
We were a mid-size e-commerce platform processing about 50k orders per day on normal days. Our stack was a Node.js monolith backed by PostgreSQL, deployed on AWS ECS. We had monitoring via Datadog but our alerting thresholds were tuned for normal traffic patterns. Black Friday was expected to bring 5-8x traffic.
02What Happened
At 6:02 AM on Black Friday, checkout response times spiked from 200ms to 12 seconds. Our checkout endpoint was running a query that joined orders, inventory, and promotions tables. A migration two weeks prior had dropped and recreated the promotions table but forgot to re-add the composite index on (product_id, valid_until). Under normal load this was fine — the query planner used a sequential scan that took ~50ms. At 5x traffic, PostgreSQL connection pool saturated and queries started queueing. By 6:15 AM the entire checkout flow was unresponsive.
03Timeline
6:02 AM - Datadog alerts fire for p99 latency > 5s on /api/checkout
6:08 AM - On-call engineer joins, sees connection pool exhaustion
6:20 AM - Team identifies the slow query via pg_stat_statements
6:35 AM - Missing index identified by comparing schema against backup
6:40 AM - CREATE INDEX CONCURRENTLY deployed
6:55 AM - Latency returns to normal, backlog clears
7:10 AM - All-clear declared
04The Resolution
We added the missing composite index using CREATE INDEX CONCURRENTLY to avoid locking the table. Recovery was immediate — p99 dropped from 12s back to 180ms within minutes. Post-incident, we added a CI check that diffs the schema against a known-good baseline and flags any missing indexes. We also lowered our latency alert thresholds and added a specific alert for connection pool utilization above 70%.
LessonsWhat We Learned
01
Always diff your database schema after migrations — missing indexes are silent killers that only show up under load.
02
Tune your alerting thresholds for your expected peak traffic, not your normal traffic.
03
Load test with realistic traffic patterns before major traffic events, not just synthetic benchmarks.
What I'd Do Differently
We should have had a schema diffing step in CI from day one. We also should have done a load test with Black Friday-level traffic at least a week before. The two-week gap between the migration and Black Friday was plenty of time to catch this if we had tested properly.