IaC in the SRE Lens: War Stories and the Future

Introduction

Tiago Dias Generoso

ITNEXT

· ~9 min read · October 13, 2025 (Updated: October 13, 2025) · Free: Yes

Introduction

Infrastructure as Code (IaC) is often sold as the cure for operational pain: faster deployments, fewer errors, and less toil. But anyone who has lived through a real migration knows the story is more complicated. IaC isn't a silver bullet — it comes with its own set of challenges that teams without hands-on experience rarely anticipate.

In this article, I'll share the war stories: the unexpected drifts, provider gaps, and timeouts that slowed us down and forced us to adapt. From there, we'll look ahead at how IaC is evolving — into GitOps, Policy-as-Code, and AI-driven automation — and why that evolution matters for SREs.

This is not just about theory. It's about scars, lessons learned, and what the future of reliability looks like when code, culture, and automation collide.

How IaC is relevant to SRE?

Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable configuration files. Instead of relying on manual steps or portal clicks, infrastructure becomes declarative, version-controlled, and repeatable. With Terraform, JSON, or YAML definitions, what once lived in scripts or tickets is now codified, reviewed, and automated.

But for Site Reliability Engineers, IaC is much more than automation — it's a strategy for reliability. In our migration, Terraform became the backbone for reducing toil, enforcing consistency across five regions, and enabling DevOps and SRE teams to collaborate through a single source of truth in GitHub.

Here's where SREs benefit directly from IaC:

Toil Reduction & Efficiency

Automation: eliminates repetitive tickets and manual steps.
Consistency: no more snowflake clusters — dev, QA, and prod share the same templates.
Scalability: resources provisioned rapidly, without waiting on queues.
Cost optimization: tagging, budgets, and policies enforced as code.

Reliability & Resilience

Repeatability: entire regions can be rebuilt predictably if they fail.
Drift detection: Terraform highlights infra changes before they cause incidents.
Incident response: faster MTTR by spinning up infra from code instead of rebuilding by hand.
Rollback & recovery: Terraform state acts as a "last known good config."

Culture & Collaboration

GitOps for infra: changes flow through pull requests, reviewed across DevOps, SRE, and SecOps.
Shared ownership: DevOps teams build modules; SREs consume and operate them.
Reduced silos: no hidden scripts or portal fixes; Git is the source of truth.

Observability & RCA

Observability as Code: dashboards, alerts, and LogicApps are provisioned alongside infra.
Error budgets: safer rollouts with canary applies and rollback paths.
Root Cause Analysis: Git history and state files provide forensic evidence during postmortems.
Faster detection (MTTD): monitoring is no longer optional — it's built into infra definitions.

Common Pain Points without IaC

If the benefits of IaC are consistency, automation, and reliability, the absence of IaC shows up as the opposite: manual toil, hidden inefficiencies, and environments that drift apart. Before Terraform, these were our daily realities:

Manual Operations

Creating every VM, NSG rule, or load balancer probe by clicking through a portal.
Copy-pasting firewall rules and configs across regions, hoping nothing was missed.
Re-running the same CLI commands after each upgrade or patch, one cluster at a time.
Manually checking logs and metrics after deployments instead of automating observability.

Hidden Toil

Each team had its own "helper scripts" on personal laptops — slightly different, undocumented, and fragile.
When things broke, engineers spent hours debugging one-off scripts instead of improving shared modules.
Fixes done in the portal were invisible to others and never captured in version control.

Inconsistent Environments

QA missing NSG rules while production had extra firewall exceptions.
Dev clusters running Kubernetes 1.28, while prod lagged behind on 1.26.
Monitoring alerts and dashboards configured differently in each region.
Storage accounts scattered with mixed replication strategies, without a clear policy.

These inconsistencies weren't just annoying — they slowed recovery during incidents, undermined confidence in testing, and created reliability risks that weren't visible until something broke in production.

Terraform Drift: War Stories

Now I want to put focus on problems introduced by IaC that can delay the environment maintenance, where sometimes in small environment, in terms of time to deploy, can be longer than manually, I am not talking about unpredictable environments, quality and so on, but just time.

So, will give some small examples of drifts in Terraform configuration that can block us to deploy what we need.

War History 1– Provider not supported resources

The Problem IaC doesn't always mean faster. In real-world migrations, we faced cases where Terraform blocked deployments because of provider limitations.

Example: SRE needed to deploy MSSQL with SKU GP_S_Gen5_2.
Terraform plan failed → AzureRM provider didn't support the SKU.
To avoid delays, SRE deployed manually in the portal → causing drift between code and reality.

Why it matters

Small environments: manual feels "faster."
Large environments: drift multiplies → future deploys break.
IaC forces you to choose: speed now vs. long-term consistency.

The Solution

Root cause: AzureRM provider lagging behind Azure SKU updates.
Mitigation: Adopted AzAPI provider for unsupported resources.
Process: DevOps team updated modules, tested new version, SREs redeployed cleanly.

✨Lesson Learned: IaC can feel slower at first, but discipline avoids snowballing drift. The fix isn't to bypass Terraform, it's to improve modules and providers so IaC stays the source of truth.

War History 2 — LB Drift forcing replacement

The Problem

SRE needed to deploy a new VM into an existing environment.
Terraform plan unexpectedly flagged NIC ↔ LB association for replacement.
This would destroy and recreate the load balancer binding, risking downtime for the app.

Why it matters

Even a "small change" (add 1 VM) can cascade into a destructive action.
IaC treats associations as atomic resources → no partial updates.
Without guardrails, deployments stall while teams scramble to fix the plan.

The Solution

Root cause: Terraform resource model for NIC/LB was too rigid.
Mitigation: DevOps updated the module lifecycle → ignored changes to NIC association.
Published a new version of the module.
SRE redeployed successfully, avoiding drift and app impact.

✨ Lesson Learned: Not every resource should be treated the same. Some infra changes (like associations) need lifecycle policies to prevent unintended replacements. Close DevOps–SRE collaboration keeps IaC safe for production-critical apps.

War History 3 — Provider Updates generating drifts

The Problem

SRE updated tags in Terraform to support FinOps cost allocation.
Simple PR expected… but the plan flagged unexpected drift in AKS upgrade settings (drain_timeout, max_surge, etc.).
SRE wasn't sure of the runtime impact → blocked applying changes to production.

Why it matters

Provider updates (AzureRM schema changes) introduce new defaults or deprecations.
Terraform shows drifts even when nothing actually changed in Azure.
This creates uncertainty for SREs — a "safe" tag change suddenly looks risky.

The Solution

Escalated to DevOps team.
DevOps exposed the provider-generated variables in tfvars and adjusted the module.
Published a new version of the code, enabling SREs to safely reapply without unexpected drifts.

✨ Lesson Learned: IaC is not only about code — it's also about provider behavior. Schema updates ripple into plans, and without clear division of responsibilities, SREs can get stuck. DevOps must continuously validate modules against provider updates, while SREs focus on safe deployment.

War History 4 — Long unknown Timeouts

The Problem

SRE needed to decommission a VM.
Pull Request was created, plan looked fine, and apply started.
Suddenly, Terraform failed with a "context deadline exceeded" timeout when deleting a run command.
Multiple retries and a new PR didn't help. Azure logs and Terraform logs gave no clue.

Why it matters

Timeouts with no root cause block workflows, delay operations, and frustrate engineers.
In production contexts, such unknowns can slow down incident response and increase MTTR.
Worse, they erode trust in IaC pipelines, making engineers consider "manual fixes."

The Solution

After waiting ~2 hours, the SRE retried, and the deletion finally succeeded.
No clear RCA — but issue likely tied to Azure backend inconsistencies or temporary provider/API glitches.

✨ Lesson Learned: Even with IaC, cloud providers introduce opaque failures outside of SRE/DevOps control. The mitigation here was patience and retries.

War History 5 — Restore VM forcing replacement

The Problem

An application issue required a full VM restore from backup.
The restore completed successfully, the application was validated, and users were satisfied.
But when the SRE later ran terraform plan, it showed drift in disk IDs, and flagged the VM for forced replacement.
This meant Terraform would try to recreate the VM unnecessarily — a risky and disruptive operation in production.

Why it matters

Restores are supposed to bring stability, but instead they broke alignment with IaC.
Disk ID mismatches caused Terraform to interpret the restored VM as a "different resource."
This added toil, risk, and confusion at the worst possible time (post-incident recovery).

The Solution

After escalating with Microsoft, the team learned the workaround:
Run a sequence of manual OS disk swap steps after restore (stop VM, snapshot, create new disk from snapshot, swap OS disk, etc.).
This preserved consistency between Azure resources and Terraform state.
I documented the detailed procedure in a Medium article here.

✨ Lesson Learned: Even managed cloud restores don't always play well with IaC. SREs need documented playbooks for these exceptions, and DevOps should continuously adjust modules to handle such edge cases.

IaC is not dead — it's evolving

Some argue that Infrastructure as Code (IaC) is a "dead end." In reality, what we are witnessing is its evolution into a larger ecosystem that blends automation, governance, and intelligence:

1. IaC as the foundation

IaC (Terraform, Pulumi, ARM/Bicep, etc.) remains the bedrock: machine-readable definitions of infrastructure that bring automation, repeatability, and version control.
It answers what infrastructure should look like, codified in templates and modules.
Still critical because AI, GitOps, and Policy-as-Code need a foundation to operate on.

2. GitOps as the process

GitOps extends IaC by making Git the single source of truth.
Changes flow through Pull Requests (PRs), reviews, and automated pipelines.
This enforces collaboration, auditability, and consistency across environments (dev → prod → multi-region).
IaC describes infra, but GitOps defines how infra gets deployed, reconciled, and rolled back.

3. Policy-as-Code as the guardrails

With great automation comes great risk — one wrong line of IaC can create compliance or cost nightmares.
Policy-as-Code (OPA, Azure Policy, Kyverno) ensures guardrails:
Enforce cost tagging.
Prevent noncompliant SKUs.
Enforce security baselines (encryption, network rules).
Works hand-in-hand with GitOps by validating PRs before they reach production.

4. AI-driven automation as the accelerator

AI is not replacing IaC — it's making it smarter and faster.
From the InfoWorld article: AI can suggest IaC code snippets, detect drift, predict scaling needs, and even auto-generate templates.
From DevOps.com: Policy-as-Code becomes even more important in an AI era, since AI-generated IaC must still comply with governance rules.

Conclusion

The promise of IaC is real — but so are the pitfalls. We learned that drift is inevitable, providers lag, and sometimes Terraform makes things harder before it makes them easier. Still, every failure became a pattern, and every workaround became a lesson.

For SREs, the takeaway is clear: IaC is more than automation. It's the foundation for reliability, when combined with the right principles — error budgets, postmortems, canary rollouts, and guardrails.

And it's still evolving. With GitOps as the delivery model, Policy-as-Code as the safety net, and AI as the accelerator, IaC is moving beyond "deploy faster" to "operate smarter." The future of infrastructure isn't manual, and it isn't static. It's codified, collaborative, and increasingly intelligent.

Stay tuned — in the next article, I'll take you inside a real-world migration from old Cloud provider to Azure, showing how we applied these ideas at scale.

Tiago Dias Generoso is a Distinguished IT Architect | Senior SRE | Master Inventor based in Pocos de Caldas, Brazil. The above article is personal and does not necessarily represent the employer's positions, strategies or opinions.

#iac #azure #sre

IaC in the SRE Lens: War Stories and the Future

Introduction

Introduction

How IaC is relevant to SRE?

Toil Reduction & Efficiency

Reliability & Resilience

Culture & Collaboration

Observability & RCA

Common Pain Points without IaC

Manual Operations

Hidden Toil

Inconsistent Environments

Terraform Drift: War Stories

War History 1– Provider not supported resources

War History 2 — LB Drift forcing replacement

War History 3 — Provider Updates generating drifts

War History 4 — Long unknown Timeouts

War History 5 — Restore VM forcing replacement

IaC is not dead — it's evolving

1. IaC as the foundation

2. GitOps as the process

3. Policy-as-Code as the guardrails

4. AI-driven automation as the accelerator

Conclusion

Reporting a Problem