fix(spartan): break stale terraform state locks on ephemeral namespaces by AztecBot · Pull Request #22777 · AztecProtocol/aztec-packages

AztecBot · 2026-04-24T17:09:28Z

Summary

A run of network_deploy.sh that dies ungracefully (OOM, spot eviction, CI timeout) leaves a .tflock object in GCS for that namespace's terraform state, and every subsequent run fails at terraform plan with Error acquiring the state lock — the lock never clears without manual intervention.

That's what's been quietly killing the scheduled benchmarks for the last 2.5 months: the Weekly Real Proving Benchmark (prove-n-tps-real), the Nightly Spartan Benchmarks (prove-n-tps-fake, block_capacity, plain n_tps) — all three nightly jobs die at the same step, in 5–15 minutes, long before the benchmarks themselves run. Zero data points in bench/next since 2026-02-09.

Evidence

CI log http://ci.aztec-labs.com/1777006743878611 — nightly run 2026-04-24 05:03:39:

Executing: tf_run .../terraform/deploy-eth-devnet false true
Error: Error acquiring the state lock
  ID:        1775969935629294
  Path:      gs://aztec-terraform/aztec-gke-private/prove-n-tps-fake/deploy-eth-devnet/terraform.tfstate/default.tflock
  Operation: OperationTypeApply
  Who:       aztec-dev@next_amd64_n-proving-bench
  Version:   1.7.5
  Created:   2026-04-12 04:58:55.556620111 +0000 UTC

The lock has been held by a dead process since 2026-04-12. Every run since then tried to take the lock, failed in ~3 s, and exited before reaching make fast build output or anything else.

Fix

Drop any stale .tflock objects under this namespace's GCS prefix before running terraform. Gated on DESTROY_NAMESPACE=true + CLUSTER != kind, so it only fires for ephemeral bench/scenario namespaces (prove-n-tps-*, block-capacity, etc.) on a real GCS backend. Production namespaces (devnet, testnet, staging-*) don't set DESTROY_NAMESPACE=true and are untouched. Concurrency protection stays intact: bootstrap_ec2 already terminates any prior EC2 instance with the same instance_name, and the bench workflows declare a concurrency group.

Test plan

Trigger weekly-proving-bench.yml via workflow_dispatch; verify it gets past the Run real proving benchmarks step and into the actual 2.5 h bench run.
Trigger nightly-spartan-bench.yml via workflow_dispatch; verify all three *-benchmark jobs get past tf_run deploy-eth-devnet and reach the bench phase.
Confirm bench/next receives a new data point after the next scheduled run.

ClaudeBox log: https://claudebox.work/s/ea414e1a215ac1a6?run=4

BEGIN_COMMIT_OVERRIDE chore(node): fix p2p services start-stop (AztecProtocol#22776) fix(spartan): break stale terraform state locks on ephemeral namespaces (AztecProtocol#22777) fix(spartan): drop broken Terraform validations on RPC ingress vars (AztecProtocol#22786) END_COMMIT_OVERRIDE

fix(spartan): break stale terraform state locks on ephemeral namespaces

493d6c1

AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels Apr 24, 2026

alexghr approved these changes Apr 24, 2026

View reviewed changes

alexghr marked this pull request as ready for review April 24, 2026 17:23

alexghr enabled auto-merge (squash) April 24, 2026 17:25

alexghr disabled auto-merge April 24, 2026 18:51

fix(spartan): use gcloud storage rm instead of gsutil rm

bc85f4f

alexghr approved these changes Apr 24, 2026

View reviewed changes

alexghr merged commit 3ec89f3 into merge-train/spartan Apr 24, 2026
12 checks passed

alexghr deleted the claudebox/ea414e1a215ac1a6-4 branch April 24, 2026 19:03

AztecBot mentioned this pull request Apr 24, 2026

feat: merge-train/spartan #22779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spartan): break stale terraform state locks on ephemeral namespaces#22777

fix(spartan): break stale terraform state locks on ephemeral namespaces#22777
alexghr merged 2 commits into
merge-train/spartanfrom
claudebox/ea414e1a215ac1a6-4

AztecBot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AztecBot commented Apr 24, 2026

Summary

Evidence

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants