A war story about infrastructure hubris, a CFO's question nobody could answer, and the two weeks that changed how I think about complexity forever.
The AWS bill arrived on a Tuesday.
$14,850.32.
I stared at it long enough for the coffee to go cold. Our engineering team's monthly salary was $18,000. Three engineers. 10,000 users. A CRUD API that barely broke a sweat.
We were spending almost as much on infrastructure as on the people building it.
Somewhere in the last six months, we had convinced ourselves we needed Kubernetes. Not because the data said so. Because we were afraid of looking like we didn't know what we were doing.
We were cosplaying as Google. And we were paying for it.
The Kubernetes Kool-Aid
Six months earlier, we were scaling fast. A thousand users became ten thousand in three months. Our monolith started showing cracks — slow deploys, occasional timeouts, the usual growing pains.
Our senior engineer had just left Google. "Kubernetes is the industry standard," he said. "If we don't migrate now, we'll have to do it later under pressure."
I didn't push back. Not because I agreed — but because I was afraid. Afraid of sounding inexperienced. Afraid of being the CTO who under-built and slowed us down. Afraid that saying no to someone who'd just left Google meant I didn't belong in the same room.
That fear cost us $14,850 a month. And worse — it cost us clarity.
We spent two months migrating. EKS cluster, three availability zones, twelve EC2 nodes, four load balancers, Datadog for monitoring. The works.
It felt professional. Enterprise-grade. Like we'd finally made it.
We hadn't made it. We'd made a mess.
Why Ego Wins. Every Time.
This isn't a story about Kubernetes being bad. It's about what happens when fear and status pressure make infrastructure decisions instead of data.
The pattern repeats across every startup I've talked to since: team hits 10x growth, someone with big-company experience joins, and the implicit message is clear — real companies use this stack. Nobody says it out loud. Nobody needs to.
The person advocating it sounds credible. The downside of under-building feels catastrophic. And the cost of over-building feels abstract — until the bill arrives.
What nobody admits: we weren't adopting Kubernetes because our system required it. We adopted it because it signaled seriousness. To investors. To recruits. To ourselves.
The counterintuitive truth: Over-engineering is a form of technical debt. It just looks like ambition.
The CFO's Question
Two months in, our CFO pulled me aside after the monthly review.
"Walk me through this AWS bill. What's a NAT Gateway? Why do we have three of them?"
I couldn't give her a good answer. Not because I didn't understand NAT Gateways — but because I couldn't justify why a 10,000-user startup needed three of them.
That night I actually audited the cluster, line by line:
We had twelve EC2 nodes. We needed three. We had four load balancers. We needed one. Our RDS instance was a db.r5.2xlarge provisioned for a traffic load we'd never seen. A db.t3.medium would have handled everything fine.
We were running Datadog's Kubernetes integration — $1,200 a month — to monitor pods that sat idle 85% of the time.
Here's the important clarification: this wasn't Kubernetes failing. This was us failing to right-size anything because we'd never questioned the original setup. The tool wasn't the problem. The unexamined assumptions were.
We were using roughly 15% of what we were paying for. I gave myself two weeks to find a way out.
Two Weeks to Rip Out Two Years of Confidence
I looked at GKE Autopilot, ECS, and a few PaaS options. What drew me to Fly.io wasn't the pricing — it was a 20-minute proof of concept that actually worked without a Slack thread of tribal knowledge to execute it.
Day one: I deployed our API to Fly.io staging. The entire infrastructure configuration:
# fly.toml — our entire infrastructure config
app = "our-api"
[[services]]
internal_port = 8080
protocol = "tcp"
[services.concurrency]
hard_limit = 250
soft_limit = 200
[autoscaling]
min_machines = 2
max_machines = 10Twenty lines. Replaced two thousand lines of YAML across forty-seven files.
The load test results were not what I expected:
——————————————————————————————————
Metric Kubernetes Fly.io
——————————————————————————————————
P50 latency 45ms 38ms
P99 latency 320ms 180ms
Deploy time 8 min 45 sec
Config lines 2,000+ ~20
——————————————————————————————————Fly.io was faster. The global Anycast routing meant requests hit the nearest edge location instead of routing everything through us-east-1.
The Gotchas (Where It Actually Got Hard)
Nothing was frictionless. Three things required real rethinking — and I'm including this because every migration post that skips the friction is lying to you.
Persistent storage behaved differently than Kubernetes PVCs. We migrated to Neon Postgres — managed, serverless, cheaper. It took two days to get connection pooling right and one near-incident when I misconfigured the replica settings at 11 PM.
Background jobs were the hardest four days of the migration. Sidekiq workers that ran as fixed Kubernetes deployments needed to become Fly Machines that spin up on demand. The first version dropped jobs silently under load. I found out at 2 AM when a customer emailed. We fixed it the next morning but I aged about three years that night.
Monitoring was the easiest switch and the most embarrassing one. Fly.io's native metrics plus Grafana Cloud cost $50 a month. We'd been paying $1,200 for dashboards that told us our idle pods were still idle.
The Cutover
Tuesday, 2 AM.I flipped the DNS switch — alone, three energy drinks in, with a rollback plan I'd tested twice and still didn't trust.
The strategy: run both environments in parallel for 48 hours, shift traffic in stages — 10%, then 50%, then 100% — and keep the Kubernetes cluster warm for a week as a rollback option.
I watched the metrics for six hours. Error rate: 0.00%. Latency dropped immediately. No alerts fired. Nothing broke in the ways I'd rehearsed breaking it.
By Friday, we deleted the Kubernetes cluster. Twelve days, start to finish.
The Numbers
BEFORE — Kubernetes ($14,850/mo)
————————————————————————————————————
EC2 instances (12x m5.large) $6,400
RDS db.r5.2xlarge $4,200
Load balancers (4x ALB) $720
Data transfer $1,800
Datadog monitoring $1,200
NAT Gateways $450
————————————————————————————————————
TOTAL $14,850
AFTER — Fly.io ($680/mo)
————————————————————————————————————
Compute (6x shared-cpu-1x) $280
Neon Postgres $350
Grafana Cloud $50
————————————————————————————————————
TOTAL $680
Savings: $171,840/year — 95.7% reductionThe 95% reduction isn't an indictment of Kubernetes. It's an indictment of our failure to right-size anything we deployed.
P99 latency dropped from 320ms to 180ms. Deploy time from eight minutes to forty-five seconds. Incidents from twelve a month to two — tracked in our incident log, not estimated.
The number that mattered most: our junior developer can now deploy to production without asking anyone. We had built a system that required a former Google engineer to operate. That wasn't infrastructure. That was a single point of failure wearing a Kubernetes badge.
We deleted 15,000 lines of YAML. Our entire infrastructure is now 200 lines.
When You Should Not Do This
Fly.io is not the answer if you're at genuine scale — hundreds of services, complex multi-tenant isolation, specific compliance requirements tied to your infrastructure choices.
It's not the answer if you have a dedicated platform team whose job is making Kubernetes efficient. At that point K8s optimization is probably cheaper than migration.
The real question isn't "Kubernetes vs Fly.io." It's a harder one:
Does your current infrastructure match your team's operational maturity — or does it match the company you're pretending to be?
A heuristic that's served me better than traffic thresholds:
If onboarding requires tribal knowledge, your architecture has already outgrown your team.
That's the signal. Not user count. Not request volume. Team capability vs system complexity. When those two are misaligned, you're accumulating operational debt whether you feel it yet or not.
What Actually Changed
Three months out, we've shipped more features than in the six months prior. Not because Fly.io is magic. Because we stopped spending 40% of our engineering time on infrastructure we didn't understand well enough to operate.
The infrastructure lesson was real. But the deeper lesson was about what we were actually optimizing for — and who we were performing for.
We confused architectural sophistication with product progress. Sophisticated. Scalable. Serious. The kind of company that runs Kubernetes.
Complexity is seductive because it looks like ambition. The engineers who've been burned by it know the difference. The ones who haven't yet are probably reading this nodding while running twelve m5.large nodes at 15% utilization.
The infrastructure that serves you best is the one your actual team can actually operate. Not the team you're planning to hire. Not the scale you're planning to reach. The team you have, right now, today.
If you can't explain your infrastructure in five minutes, it's too complex.
If your entire team can't deploy to production, you're over-engineering.
If you're spending more on infrastructure than on the engineers running it, something is wrong.
We were doing all three. The bill just made it impossible to pretend otherwise.
Kubernetes didn't fail us. We failed to ask whether we'd earned it.
What's your infrastructure story? Have you felt the pull to build for the company you want to be instead of the one you are? Drop a comment — I read every one.