How We Almost Lost Our Production Kubernetes Cluster to a Misconfigured CronJob
👤 @k8s_newbie_kimfintech10-30 engineers2024
01The Setup
We ran a 15-node Kubernetes cluster on GKE for our payment processing platform. The team was relatively new to Kubernetes — we had migrated from Heroku 6 months prior. We had basic monitoring with Prometheus and Grafana but our alerting coverage had gaps, especially around resource utilization at the node level. A developer had set up a CronJob to clean up stale payment session data from Redis.
02What Happened
The CronJob was configured to run every minute instead of every hour (*/1 * * * * vs 0 * * * *). Each job spawned a pod that pulled a 2GB Docker image (the dev had used the full application image instead of a slim utility image). The CronJob had no concurrencyPolicy set, so Kubernetes was spawning a new pod every minute without waiting for the previous one to finish. Over a weekend, this created hundreds of pending pods that consumed all schedulable resources. By Monday morning, new deployments couldnt schedule pods and existing pods were getting OOMKilled from node memory pressure.
03The Resolution
An engineer noticed the issue during Monday morning standup when a deploy was stuck in Pending. We identified 847 completed and pending CronJob pods consuming cluster resources. We deleted the CronJob, cleaned up the completed pods, and the cluster recovered within 20 minutes. We then rewrote the CronJob with concurrencyPolicy: Forbid, proper resource limits, a slim Alpine-based image, and the correct schedule. We added alerts for unusual pod counts and node resource pressure.
LessonsWhat We Learned
01
Always set concurrencyPolicy on CronJobs — the default (Allow) can cause runaway pod creation.
02
Use ResourceQuotas and LimitRanges as guardrails, especially when your team is new to Kubernetes.
03
CronJob schedule syntax is easy to get wrong — add a comment in the manifest with the human-readable schedule.
What I'd Do Differently
We should have had a policy requiring code review for any Kubernetes manifests, not just application code. We also should have set default ResourceQuotas per namespace and LimitRanges to prevent any single workload from consuming unbounded resources.