Browse

Stories

6 stories from the trenches

🏗️ System Design

How We Built a Production-Grade AWS Infrastructure from Scratch in 6 Weeks — as a Team of Two

👤 Swift-Timber-19a Early-stage startup SaaS2026

We were 14 months into building a B2B document intelligence platform for legal teams. Our entire infrastructure was a single $48/mo DigitalOcean VPS — one box, manually SSHed into,...

AWSTerraformGitHub ActionsDocker+4
🦸 Heroic Save

Recovering from Terraform State Corruption 30 Minutes Before a Board Demo

👤 @sre_hero_mayainfrastructure2025

We provided a cloud infrastructure management platform. Our own infrastructure was managed by Terraform with state stored in an S3 backend with DynamoDB locking. We had a board dem...

TerraformAWSIncident ResponseCI/CD+1
🔄 Culture Change

Building an On-Call Culture from Scratch at a "Move Fast, Break Things" Startup

👤 @startup_samSaaS2023

We were a 7-person engineering team at a seed-stage B2B SaaS startup. There was no on-call rotation — when things broke, the CTO would get a text from a customer and scramble to fi...

PagerDutyDatadogOn-CallIncident Response+1
😰 Near-Miss

How We Almost Lost Our Production Kubernetes Cluster to a Misconfigured CronJob

👤 @k8s_newbie_kimfintech2024

We ran a 15-node Kubernetes cluster on GKE for our payment processing platform. The team was relatively new to Kubernetes — we had migrated from Heroku 6 months prior. We had basic...

KubernetesGCPPrometheusGrafana+2
🚀 Migration

Migrating 200 Microservices from Jenkins to GitHub Actions in 3 Months

👤 @platform_peteSaaS2025

Our platform team managed a Jenkins cluster running over 200 pipelines for our microservices. Jenkins was running on a fleet of 40 EC2 instances, costing us roughly $25k/month in c...

JenkinsGitHub ActionsCI/CDAWS+1
Incident Report

The Black Friday Meltdown: How a Missing Index Took Down Our Checkout

👤 @sre_sarahe-commerce2024

We were a mid-size e-commerce platform processing about 50k orders per day on normal days. Our stack was a Node.js monolith backed by PostgreSQL, deployed on AWS ECS. We had monito...

PostgreSQLAWSDatadogIncident Response+1