Skip to content

fix(spartan): break stale terraform state locks on ephemeral namespaces#22777

Merged
alexghr merged 2 commits into
merge-train/spartanfrom
claudebox/ea414e1a215ac1a6-4
Apr 24, 2026
Merged

fix(spartan): break stale terraform state locks on ephemeral namespaces#22777
alexghr merged 2 commits into
merge-train/spartanfrom
claudebox/ea414e1a215ac1a6-4

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Summary

A run of network_deploy.sh that dies ungracefully (OOM, spot eviction, CI timeout) leaves a .tflock object in GCS for that namespace's terraform state, and every subsequent run fails at terraform plan with Error acquiring the state lock — the lock never clears without manual intervention.

That's what's been quietly killing the scheduled benchmarks for the last 2.5 months: the Weekly Real Proving Benchmark (prove-n-tps-real), the Nightly Spartan Benchmarks (prove-n-tps-fake, block_capacity, plain n_tps) — all three nightly jobs die at the same step, in 5–15 minutes, long before the benchmarks themselves run. Zero data points in bench/next since 2026-02-09.

Evidence

CI log http://ci.aztec-labs.com/1777006743878611 — nightly run 2026-04-24 05:03:39:

Executing: tf_run .../terraform/deploy-eth-devnet false true
Error: Error acquiring the state lock
  ID:        1775969935629294
  Path:      gs://aztec-terraform/aztec-gke-private/prove-n-tps-fake/deploy-eth-devnet/terraform.tfstate/default.tflock
  Operation: OperationTypeApply
  Who:       aztec-dev@next_amd64_n-proving-bench
  Version:   1.7.5
  Created:   2026-04-12 04:58:55.556620111 +0000 UTC

The lock has been held by a dead process since 2026-04-12. Every run since then tried to take the lock, failed in ~3 s, and exited before reaching make fast build output or anything else.

Fix

Drop any stale .tflock objects under this namespace's GCS prefix before running terraform. Gated on DESTROY_NAMESPACE=true + CLUSTER != kind, so it only fires for ephemeral bench/scenario namespaces (prove-n-tps-*, block-capacity, etc.) on a real GCS backend. Production namespaces (devnet, testnet, staging-*) don't set DESTROY_NAMESPACE=true and are untouched. Concurrency protection stays intact: bootstrap_ec2 already terminates any prior EC2 instance with the same instance_name, and the bench workflows declare a concurrency group.

Test plan

  • Trigger weekly-proving-bench.yml via workflow_dispatch; verify it gets past the Run real proving benchmarks step and into the actual 2.5 h bench run.
  • Trigger nightly-spartan-bench.yml via workflow_dispatch; verify all three *-benchmark jobs get past tf_run deploy-eth-devnet and reach the bench phase.
  • Confirm bench/next receives a new data point after the next scheduled run.

ClaudeBox log: https://claudebox.work/s/ea414e1a215ac1a6?run=4

@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels Apr 24, 2026
@alexghr alexghr marked this pull request as ready for review April 24, 2026 17:23
@alexghr alexghr enabled auto-merge (squash) April 24, 2026 17:25
@alexghr alexghr disabled auto-merge April 24, 2026 18:51
@alexghr alexghr merged commit 3ec89f3 into merge-train/spartan Apr 24, 2026
12 checks passed
@alexghr alexghr deleted the claudebox/ea414e1a215ac1a6-4 branch April 24, 2026 19:03
chrismarino pushed a commit to chrismarino/aztec-packages that referenced this pull request May 5, 2026
BEGIN_COMMIT_OVERRIDE
chore(node): fix p2p services start-stop (AztecProtocol#22776)
fix(spartan): break stale terraform state locks on ephemeral namespaces
(AztecProtocol#22777)
fix(spartan): drop broken Terraform validations on RPC ingress vars
(AztecProtocol#22786)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants