Building an On-Call Culture from Scratch at a "Move Fast, Break Things" Startup
👤 @startup_samSaaS< 10 engineers2023
01The Setup
We were a 7-person engineering team at a seed-stage B2B SaaS startup. There was no on-call rotation — when things broke, the CTO would get a text from a customer and scramble to fix it. Deploys happened whenever anyone pushed to main. There were no runbooks, no incident channels, and no post-mortems. The engineering culture was very much "ship fast and figure it out." This worked when we had 5 customers but we had just signed our 50th and the cracks were showing.
02What Happened
The tipping point was when we lost a $200k ARR enterprise customer because a database connection leak caused 4 hours of downtime on a Friday night and nobody noticed until Monday. The CTO called an all-hands and we decided to build an on-call culture. The challenge was doing this without killing the startup speed that got us to 50 customers. We introduced PagerDuty with a simple 1-week rotation among all 7 engineers. We wrote runbooks for the top 10 most common issues. We created a #incidents Slack channel with a lightweight post-mortem template.
03Timeline
Month 1: Set up PagerDuty, wrote initial runbooks, 1-week on-call rotation
Month 2: Added Datadog APM, created SLOs for core endpoints
Month 3: First blameless post-mortem after a deploy broke auth
Month 4: On-call compensation policy ($500/week stipend)
Month 6: MTTR dropped from 4 hours to 22 minutes
04The Resolution
Within 6 months, our MTTR went from ~4 hours to 22 minutes. We went from 3-4 customer-reported incidents per month to near-zero — we were catching and fixing issues before customers noticed. The biggest surprise was that shipping speed actually increased because engineers had more confidence in deploys when they knew there was a safety net. We kept the culture lightweight — the post-mortem template is 5 bullet points, not a 10-page document.
LessonsWhat We Learned
01
On-call culture doesnt have to mean slow culture — a safety net actually makes teams ship faster.
02
Start on-call rotations with experienced engineers, then gradually include juniors with shadowing.
03
Compensate on-call fairly from day one — it sets the tone that this work is valued.
What I'd Do Differently
We should have started the on-call rotation with just 3-4 senior engineers instead of all 7. Junior engineers were getting paged for issues they didnt have context to fix, which was stressful and slowed down response times. We fixed this by month 3 but it caused some early frustration.