🦸 Heroic Save

Recovering from Terraform State Corruption 30 Minutes Before a Board Demo

👤 @sre_hero_mayainfrastructure10-30 engineers2025

01The Setup

We provided a cloud infrastructure management platform. Our own infrastructure was managed by Terraform with state stored in an S3 backend with DynamoDB locking. We had a board demo scheduled for 2 PM to show off our new multi-region deployment feature. At 1:30 PM, an engineer ran terraform apply on the production workspace to deploy a last-minute UI fix. Another engineer, not realizing there was an active apply, force-unlocked the state and ran their own apply for a different change.

02What Happened

The concurrent applies resulted in a corrupted Terraform state file. Resources in the state no longer matched reality — some resources were duplicated, others were missing. Running terraform plan showed it wanted to destroy and recreate 47 production resources including our RDS database and ECS services. If anyone had run terraform apply at that point, it would have taken down the entire production environment 30 minutes before the board demo. Our lead SRE immediately recognized the danger and told everyone to stop touching Terraform.

03Timeline

1:30 PM - Engineer A starts terraform apply 1:33 PM - Engineer B force-unlocks state and starts conflicting apply 1:35 PM - Both applies error out, state is corrupted 1:36 PM - Lead SRE sees Slack messages, declares terraform freeze 1:40 PM - SRE pulls last known good state from S3 versioning 1:48 PM - State manually reconciled using terraform import for 3 drifted resources 1:55 PM - terraform plan shows no changes — state matches reality 1:58 PM - Board demo starts on time, nobody knows what happened

04The Resolution

S3 versioning saved us. The lead SRE restored the last known good state file from S3 version history, then used terraform import to reconcile the 3 resources that had actually been created by the partial applies. The entire recovery took 18 minutes. After this incident, we implemented mandatory CI/CD for all Terraform changes (no more local applies), added a Slack bot that announces active Terraform operations, and enabled state file snapshots before every apply.

LessonsWhat We Learned

Never allow local Terraform applies against production — route all changes through CI/CD.

Enable S3 versioning on your Terraform state bucket — it is the ultimate safety net for state corruption.

Force-unlock should require peer approval — if the state is locked, there is usually a good reason.

What I'd Do Differently

We never should have allowed local terraform apply against production. This should have been a CI/CD-only operation from the start. The force-unlock command should also require approval from a second engineer.