fix(spartan): break stale terraform state locks on ephemeral namespaces#22777
Merged
Conversation
alexghr
approved these changes
Apr 24, 2026
alexghr
approved these changes
Apr 24, 2026
chrismarino
pushed a commit
to chrismarino/aztec-packages
that referenced
this pull request
May 5, 2026
BEGIN_COMMIT_OVERRIDE chore(node): fix p2p services start-stop (AztecProtocol#22776) fix(spartan): break stale terraform state locks on ephemeral namespaces (AztecProtocol#22777) fix(spartan): drop broken Terraform validations on RPC ingress vars (AztecProtocol#22786) END_COMMIT_OVERRIDE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A run of
network_deploy.shthat dies ungracefully (OOM, spot eviction, CI timeout) leaves a.tflockobject in GCS for that namespace's terraform state, and every subsequent run fails atterraform planwithError acquiring the state lock— the lock never clears without manual intervention.That's what's been quietly killing the scheduled benchmarks for the last 2.5 months: the Weekly Real Proving Benchmark (
prove-n-tps-real), the Nightly Spartan Benchmarks (prove-n-tps-fake,block_capacity, plainn_tps) — all three nightly jobs die at the same step, in 5–15 minutes, long before the benchmarks themselves run. Zero data points inbench/nextsince 2026-02-09.Evidence
CI log http://ci.aztec-labs.com/1777006743878611 — nightly run 2026-04-24 05:03:39:
The lock has been held by a dead process since 2026-04-12. Every run since then tried to take the lock, failed in ~3 s, and exited before reaching
make fastbuild output or anything else.Fix
Drop any stale
.tflockobjects under this namespace's GCS prefix before running terraform. Gated onDESTROY_NAMESPACE=true+CLUSTER != kind, so it only fires for ephemeral bench/scenario namespaces (prove-n-tps-*,block-capacity, etc.) on a real GCS backend. Production namespaces (devnet,testnet,staging-*) don't setDESTROY_NAMESPACE=trueand are untouched. Concurrency protection stays intact:bootstrap_ec2already terminates any prior EC2 instance with the sameinstance_name, and the bench workflows declare aconcurrencygroup.Test plan
weekly-proving-bench.ymlviaworkflow_dispatch; verify it gets past theRun real proving benchmarksstep and into the actual 2.5 h bench run.nightly-spartan-bench.ymlviaworkflow_dispatch; verify all three*-benchmarkjobs get pasttf_run deploy-eth-devnetand reach the bench phase.bench/nextreceives a new data point after the next scheduled run.ClaudeBox log: https://claudebox.work/s/ea414e1a215ac1a6?run=4