Post

Is Your Terraform Pipeline Production Ready?

Learn how to build a safe, auditable Terraform pipeline using committed plan files, PR-triggered CI verification, remote S3 state, and state locking — so no one can silently apply unreviewed infrastructure changes.

Is Your Terraform Pipeline Production Ready?

Someone on your team made a small “fix” to a Terraform module, didn’t run a plan, and hit terraform apply locally. Now you’re paging at 2am.

  • Terraform is being run manually from laptops with no audit trail
  • terraform apply is triggered in CI without anyone reviewing what will actually change
  • A drift between the committed code and the live state goes undetected until something breaks
  • The fix: commit a generated tf-plan.txt alongside your Terraform changes, then have CI re-generate the plan and verify it matches before applying anything

TL;DR

  1. Engineer runs create-plan locally — initialises backend, runs terraform plan, saves output to tf-plan.txt, stages both the .tf files and the plan.
  2. Engineer pushes and opens a PR — CI (CodeBuild, GitHub Actions, whatever) triggers on the PR event.
  3. CI re-runs terraform init + terraform plan, diffs the result against the committed tf-plan.txt. If they don’t match, the build fails.
  4. CI also runs policy checks (OPA / Sentinel / custom binary) against the plan JSON.
  5. On PR merge, CI runs the exact same verified plan with terraform apply.
  6. State lives in S3, locked via DynamoDB (or S3 native locking on Terraform ≥ 1.10).

The Problem

I first hit this properly at scale when a colleague ran terraform apply against prod after rebasing their branch — the rebase had pulled in someone else’s half-finished changes. The plan showed 12 resources to be destroyed. They hadn’t run a plan locally first. We caught it in time, but only because someone happened to be watching the terminal over their shoulder.

The underlying issue is that most Terraform workflows have no enforcement between “write code” and “apply to prod”. The CI pipeline might run terraform plan in a build step, but it rarely checks whether that plan matches what the engineer actually reviewed before opening the PR. The plan in CI is a fresh one — generated against live state at CI run time. If state has drifted, or if something in the workspace changed between when the engineer last ran a plan and when CI runs, the apply can do something unexpected.

This persists because it requires a cultural and tooling shift. Most teams feel that requiring a committed plan file is overhead. It isn’t — it’s the only way to make “what you approved in the PR review is what gets applied” a guarantee rather than a hope.

Symptoms

These are the signals that your pipeline isn’t there yet:

  • No tf-plan.txt or equivalent committed alongside .tf changes. Reviewers are approving diffs without knowing what the actual infrastructure change is.
  • CI generates a fresh plan on every run with no comparison against a previously reviewed plan. If someone force-pushes or state drifts between plan and apply, you get surprises.
  • State file is local or per-engineer. Two people working on the same module will overwrite each other’s state. One of them will destroy resources the other created.
  • No state locking. Two concurrent applies — say, a manual run and a CI run triggered by a merge — will corrupt state.
  • Apply happens automatically on merge with no retries or error handling. Transient AWS API errors (TLS handshake timeout, NoSuchBucket) will leave your infrastructure in a partial state.
  • No audit trail. You can’t answer “who applied this change, when, and from which PR?” after the fact.

The state corruption case is the one people underestimate. It doesn’t always cause an immediate outage — sometimes it just silently orphans resources, and you find out months later via an unexpected AWS bill.

The Solution: Plan-as-Commit + PR-Gated Apply

The solved state looks like this: every infrastructure change goes through a PR that contains both the .tf changes and a human-readable tf-plan.txt. CI verifies the committed plan matches what Terraform would actually do right now. Apply only runs after that verification passes.

Step 1: Simplified create-plan.sh

The local script’s job is to initialise the backend, generate the plan, and stage everything for commit. Keep it simple — just backend init and plan:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash -ex

REPO=$(git config --local --get remote.origin.url | sed 's#.*/your-org/##;s#.git##')
RELPATH=$(echo $PWD | sed "s#$(git rev-parse --show-toplevel)/##")
TF_STATE_KEY="tfstate/$REPO/$RELPATH.tfstate"

export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"
mkdir -p "$TF_PLUGIN_CACHE_DIR"

rm -rf .terraform .terraform.lock.hcl

terraform init \
  -backend-config="key=$TF_STATE_KEY" \
  -backend-config="bucket=your-tf-state" \
  -backend-config="encrypt=true" \
  -backend-config="dynamodb_table=tf-state-lock" \
  -backend-config="region=ap-southeast-1"

terraform fmt
git add *.tf

terraform plan -lock=true -out=plan.out "$@"

echo "// Plan generated at $(date)" > tf-plan.txt
terraform show -no-color plan.out >> tf-plan.txt
git add tf-plan.txt

echo "Done. Review tf-plan.txt, then commit and open a PR."

Note -lock=true here. Engineers should be locking state even on local plan runs — it prevents a race with any ongoing CI job.

Required AWS permissions

Your local AWS credentials (or assumed role) must have the following permissions before running this script:

  • S3: s3:GetObject, s3:PutObject, s3:ListBucket on the state bucket
  • DynamoDB: dynamodb:GetItem, dynamodb:PutItem, dynamodb:DeleteItem on the lock table

Without these, terraform init or terraform plan -lock=true will fail with an access denied error. Check with aws sts get-caller-identity to confirm which identity you’re running as before debugging further.

To run create-plan from any directory on your Mac, install it once to /usr/local/bin:

1
2
sudo cp create-plan.sh /usr/local/bin/create-plan
sudo chmod +x /usr/local/bin/create-plan

Now from inside any Terraform module directory:

1
create-plan

Step 2: Remote State in S3

Local state files are the root cause of most state corruption incidents. Remote state with S3 means everyone works against the same source of truth.

1
2
3
4
5
6
7
8
9
terraform {
  backend "s3" {
    bucket         = "your-tf-state"
    key            = "tfstate/your-repo/path/to/module.tfstate"
    region         = "ap-southeast-1"
    encrypt        = true
    dynamodb_table = "tf-state-lock"
  }
}

Why S3 for state?

  • Single source of truth. No more “my local state file has resources yours doesn’t.”
  • Encryption at rest. State files contain sensitive values — resource IDs, sometimes credentials. S3 SSE-S3 or SSE-KMS handles this with one flag.
  • Versioning. Enable S3 bucket versioning and you can roll back to any previous state. This has saved me more than once after a botched apply.
  • Audit trail. S3 access logs or CloudTrail tells you exactly who read or wrote the state file and when.

Enable versioning on the bucket:

1
2
3
aws s3api put-bucket-versioning \
  --bucket your-tf-state \
  --versioning-configuration Status=Enabled

Step 3: State Locking

Without locking, two concurrent terraform apply runs will read the same state, make conflicting changes, and write back inconsistent state. One of those writes wins; the other’s changes are silently lost.

Option A: DynamoDB locking (classic)

Create the lock table once:

1
2
3
4
5
6
aws dynamodb create-table \
  --table-name tf-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region ap-southeast-1

Terraform writes a lock entry before any plan or apply and releases it after. If a job crashes mid-apply, the lock stays — run terraform force-unlock <lock-id> to release it manually.

Option B: S3 native locking (Terraform ≥ 1.10)

As of Terraform 1.10, the S3 backend supports native state locking using S3’s conditional write API — no DynamoDB table required. Add use_lockfile = true to your backend config:

1
2
3
4
5
6
7
8
9
terraform {
  backend "s3" {
    bucket       = "your-tf-state"
    key          = "tfstate/your-repo/module.tfstate"
    region       = "ap-southeast-1"
    encrypt      = true
    use_lockfile = true   # S3 native locking, no DynamoDB needed
  }
}

In create-plan script

1
 -backend-config="use_lockfile=true"

Terraform writes a .tflock object alongside the state file using S3’s If-None-Match: * conditional write. Only one writer can succeed at a time. Fewer moving parts, one less AWS service to manage.

If you’re not on 1.10 yet, stick with DynamoDB. Both approaches work — pick based on your Terraform version.

Step 4: CI Verification (CodeBuild / GitHub Actions / any runner)

The CI script is named build-terraform. It accepts a single argument — test or apply — and is triggered by webhook events:

  • PR create / updatebuild-terraform test — re-generates the plan, verifies it matches the committed tf-plan.txt, and runs policy checks. The build must pass before the PR can be merged.
  • PR mergebuild-terraform apply — runs the already-verified plan with terraform apply.

Save it to the runner’s working directory or a shared scripts path, then wire up your CodeBuild (or equivalent) webhook to pass test on PULL_REQUEST_CREATED and PULL_REQUEST_UPDATED events, and apply on PULL_REQUEST_MERGED.

Here’s the core of the script — update with your organisation-specific details:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
#!/bin/sh -ex
# build-terraform <test|apply>

if [ $# -ne 1 ] || { [ "$1" != "apply" ] && [ "$1" != "test" ]; }; then
  echo "Usage: $0 <apply|test>"
  exit 1
fi

# Identify which module changed in this commit
tmp=$(mktemp)
# HEAD^1 fails on shallow clones (e.g. CodeBuild depth=1 with a single commit);
# fall back to listing all files tracked in HEAD instead.
if git rev-parse --verify HEAD^1 >/dev/null 2>&1; then
  git diff --name-only HEAD^1 HEAD > "$tmp"
else
  git ls-tree --name-only -r HEAD > "$tmp"
fi

plan_count=$(grep -c '/tf-plan.txt' "$tmp" || true)
if [ "$plan_count" -gt 1 ]; then
  echo "Error: more than one tf-plan.txt changed in this commit"
  exit 1
elif [ "$plan_count" -eq 0 ]; then
  echo "No tf-plan.txt in commit, skipping"
  exit 0
fi

dir=$(dirname "$(grep '/tf-plan.txt' "$tmp")")
cd "$dir"

REPO=$(git config --local --get remote.origin.url | sed 's#.*/your-org/##;s#.git##')
RELPATH=$(echo $PWD | sed "s#$(git rev-parse --show-toplevel)/##")

export TF_VAR_tfstate="tfstate/$REPO/$RELPATH.tfstate"
export TF_CLI_ARGS="-no-color"
export TF_LOG=INFO

# -force-copy migrates state backend without prompting — needed when an existing
# module is first onboarded to this pipeline and the backend config changes
terraform init \
  -force-copy \
  -backend-config="key=$TF_VAR_tfstate" \
  -backend-config="bucket=your-tf-state" \
  -backend-config="encrypt=true" \
  -backend-config="dynamodb_table=tf-state-lock" \
  -backend-config="region=ap-southeast-1"

# Generate fresh plan in CI
status=0
terraform plan -lock-timeout=30s -out=test-plan.out 2>stderr.txt || status=$?
if [ $status -ne 0 ]; then
  echo "Plan generation failed:"
  cat stderr.txt
  exit $status
fi

terraform show -no-color test-plan.out > test-plan.txt
terraform show -json test-plan.out > tfplan.json

# Print plan summary
grep '#' test-plan.txt || true
grep 'Plan:' test-plan.txt || true

# Verify CI plan matches committed plan
verify_plan() {
  grep -v '^//' tf-plan.txt > processed-tf-plan.txt
  skip=$(git log -n 1 --pretty=medium | grep -c 'skip-plan-verification' || true)
  diff test-plan.txt processed-tf-plan.txt > delta || true
  if [ -s delta ] && [ "$skip" -eq 0 ]; then
    echo "Plan mismatch — state may have drifted since this PR was opened:"
    cat delta
    exit 1
  fi
  echo "Plan matched."
}

verify_plan

# Apply only on merge (not on PR open)
if [ "$1" = "apply" ]; then
  for i in $(seq 1 10); do
    status=0
    terraform apply -lock-timeout=30s test-plan.out 2>stderr.log || status=$?
    # Retry on known transient AWS API errors
    retryable=$(grep -cE \
      "TLS handshake timeout|tcp.*timeout|tcp.*connection reset|NoSuchBucket|TooManyUpdates" \
      stderr.log || true)
    if [ $status -ne 0 ] && [ $retryable -eq 0 ]; then
      echo "Apply failed (non-retryable):"
      cat stderr.log
      break
    fi
    if [ $status -eq 0 ]; then
      break
    fi
    echo "Retrying in 5s (attempt $i)..."
    sleep 5
  done
  exit $status
fi

The key check is verify_plan. CI re-generates the plan against live state and diffs it against what was committed in the PR. If they don’t match — because state drifted, or because someone pushed a commit that changed the resources without updating the plan — the build fails. The engineer has to re-run create-plan.sh locally, update the committed plan, and push again.

This closes the loop: what a reviewer approved in the PR diff is exactly what terraform apply will execute.

What Production-Ready Looks Like

A fully hardened pipeline has a few more things worth adding:

  • IAM roles scoped per repo/module. CI should assume a role with write access only to the specific S3 key for the module it’s applying. Not a wildcard on the whole bucket.
  • Audit trail on apply. After a successful apply, emit an event to your audit log — PR number, repo, module path, timestamp. You want to be able to answer “what changed and who triggered it” without digging through CI logs.
  • Bucket versioning + lifecycle. S3 versioning on your state bucket means you can roll back state if an apply goes wrong. Add a lifecycle rule to expire old versions after 90 days to keep costs in check.
  • skip-plan-verification escape hatch. For genuine state drift (someone manually changed a resource), the pipeline gives a way to bypass the diff check by including skip-plan-verification in the commit message. Use it sparingly and treat it as a signal to go clean up the drift.

Conclusion

The core insight is simple: commit the plan, verify the plan, apply the plan — in that order, with CI enforcing each step. Once you have this in place, every infrastructure change has a paper trail from the engineer’s laptop to the AWS API call, and no one can apply changes that weren’t reviewed.

After running this pipeline, the on-call load from surprise terraform applies drops to near zero, PR reviews become meaningful (reviewers are looking at actual resource changes, not just HCL diffs), and you can confidently answer any audit question about what changed in your infrastructure and when.


Subscribe to Level Up

This post is licensed under CC BY 4.0 by the author.