🏗️ System Design

How We Built a Production-Grade AWS Infrastructure from Scratch in 6 Weeks — as a Team of Two

👤 Swift-Timber-19a Early-stage startup SaaS< 10 engineers2026

01The Setup

We were 14 months into building a B2B document intelligence platform for legal teams. Our entire infrastructure was a single $48/mo DigitalOcean VPS — one box, manually SSHed into, with deployments that were literally git pull && pm2 restart app. No staging environment. No CI pipeline. No backups beyond a weekly manual snapshot someone occasionally remembered to take. No monitoring beyond "a customer emails us." It had worked fine for our 12 pilot customers, all of whom were small firms that never asked hard questions. Then our first enterprise prospect — a mid-size law firm — came back from their internal security review with a 47-item questionnaire. SOC 2 readiness. VPC isolation. Audit logging. Encryption at rest and in transit. Data residency guarantees. We had none of it. Not even close. We had 6 weeks before their contract deadline or they'd walk.

02What Happened

We made the call on a Monday morning: we weren't going to patch the DigitalOcean setup. We were going to build the right thing from scratch, in 6 weeks, with two engineers. I was the only one who had touched AWS infrastructure seriously before. Our backend dev agreed to pair with me half-time. Everyone else kept shipping product. No DevOps hire, no consultants, no extensions from the client. The decision itself was the incident. We chose to bet the contract on a complete infrastructure rebuild under a hard deadline, while still running a live product with paying customers on the old box. Week 1 hit us fast. We set up a multi-account AWS Organizations structure — management, production, and staging accounts — with all Terraform state stored remotely in S3 with DynamoDB locking. On day 3 we had our first state collision before locking was fully wired up. We caught it, fixed it, but it was a preview of how unforgiving the pace would be. VPC design took two full days of debate. Single VPC per account, three public subnets, three private, three isolated for RDS — spread across three AZs. NAT Gateways in each. More expensive than we wanted, but a single-AZ setup would have failed the security review on the spot. By week 3 we had ECS Fargate running, ALB and ACM wired up, Route 53 aliases resolving, and a GitHub Actions pipeline doing lint → test → build → push to ECR → rolling ECS deploy. OIDC federation for GitHub so we had zero long-lived IAM keys anywhere. Week 4 was brutal. Pure security posture work — CloudTrail to S3 with log integrity validation, AWS Config rules, GuardDuty across all accounts, Security Hub aggregating findings into the management account. We wrote a Python script to pipe Security Hub findings into a Notion dashboard so our CEO could track compliance posture without touching AWS. Week 5 we wired up CloudWatch Container Insights, centralized log groups with error-rate metric filters, and a PagerDuty integration for critical alarms. First real runbook. First real on-call rotation — two people, but it existed. Week 6 we did a full dry run of the 47-item questionnaire. We answered 44. Three items weren't technical problems — they were missing policy documents: data retention, incident response, and access control. Our CEO and I wrote all three over a weekend.

03Timeline

Week 1 Multi-account AWS Organizations setup. Remote Terraform state in S3 + DynamoDB. Account structure locked. No console clicking after day 2. Week 2 VPC design finalized. Three-tier subnet layout across 3 AZs. NAT Gateways. Security Groups locked down to principle of least privilege from day one. Week 3 ECS Fargate cluster, ALB, ACM certs, Route 53 alias records. GitHub Actions CI/CD pipeline with OIDC federation — no long-lived IAM keys. Week 4 Security posture sprint: CloudTrail, Config, GuardDuty, Security Hub aggregation. EBS + RDS encryption enforced. S3 block public access org-wide SCP. Week 5 Observability: CloudWatch Container Insights, centralized log groups, PagerDuty on-call. First real runbook written. Disaster recovery test (simulated AZ failure). Week 6 Security questionnaire dry run. Three missing policy docs written. Final review with the law firm's security team. Contract signed.

04The Resolution

We closed the contract. The law firm's security team did their final review call on a Friday afternoon. We walked them through the multi-account structure, showed them the Security Hub dashboard, demonstrated CloudTrail log integrity, and pulled up the GitHub Actions pipeline running live. They asked about our data retention policy — we had it. Incident response plan — we had it. Encryption at rest for RDS and EBS — confirmed. They unmuted and said most startups they evaluate are still running single-account setups with root user access keys. We were not most startups anymore. The contract was signed the following Monday. The deal was worth 14× our new monthly AWS bill of $1,100. The old DigitalOcean droplet stayed running in parallel for two more weeks while we migrated the existing pilot customers over, then we shut it down. The Terraform codebase landed at roughly 3,400 lines across 12 modules. Six months later we sanitized it, stripped out anything company-specific, and open-sourced the VPC and ECS modules. They've been starred a few hundred times. Someone filed an issue asking if we'd add Kubernetes support. We said no. The three policy documents our CEO and I wrote over that weekend have been pulled into every enterprise deal since. Copy-pasted, lightly edited, reused. A weekend of work that has quietly closed a lot of rooms. The backend dev who paired with me half-time for six weeks is now our de facto platform engineer. Never had the title. Never asked for it. Just never stopped.

LessonsWhat We Learned

01

Start multi-account from day one, even if it feels like overkill. Retrofitting account boundaries later is far more painful than building them in at zero revenue.

02

A compliance deadline is one of the best forcing functions for actually building infrastructure properly. The urgency killed all our bad habits around "we'll fix it later."