Skip to content

chore(ci): apply deploy-then-wait refactor to spartan nightly benchmarks#23489

Merged
spypsy merged 1 commit into
nextfrom
cb/e47536c9d13d
May 22, 2026
Merged

chore(ci): apply deploy-then-wait refactor to spartan nightly benchmarks#23489
spypsy merged 1 commit into
nextfrom
cb/e47536c9d13d

Conversation

@AztecBot

@AztecBot AztecBot commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Applies the same restructuring introduced for the weekly proving benchmark in #23395 to the remaining spartan nightly benchmark workflows, to cut the SSM Undeliverable failures we keep hitting (e.g. the nightly 10 TPS run that died mid network_deploy).

The core idea (unchanged from #23395): stop doing the network deploy + multi-epoch chain warmup inside the single long-running EC2/SSM benchmark command. Instead:

  1. deploy-network.yml (runs on the GitHub runner via GKE, no SSM) deploys the network.
  2. A wait-for-l2-block job polls until the chain produces its first L2 block.
  3. The benchmark then runs with SKIP_NETWORK_DEPLOY=1, so the SSM command only builds + measures against the already-live network — a much shorter SSM lifetime, which is where Undeliverable was striking.
  4. A separate cleanup job (if: always()) tears the network down, and notify-failure keeps the existing Slack/ClaudeBox alerts.

Changes

Workflows restructured into select-image → deploy-network → wait-for-l2-block → benchmark (SKIP_NETWORK_DEPLOY=1) → cleanup → notify:

  • .github/workflows/nightly-bench-10tps.yml — single 10 TPS benchmark (env/namespace bench-10tps).
  • .github/workflows/nightly-spartan-bench.yml — all three benches: benchmark (tps-scenario → ns nightly-bench), proving-benchmark (prove-n-tps-fake), block-capacity-benchmark (block-capacity → ns nightly-block-capacity). The status aggregator and per-bench Slack messages are preserved. All jobs check out ref: next.

SKIP_NETWORK_DEPLOY plumbing (mirrors the proving-bench path from #23395) added to:

  • ci.sh: network-bench, network-block-capacity-bench, network-bench-10tps.
  • bootstrap.sh: ci-network-bench, ci-network-block-capacity-bench, ci-network-bench-10tps.

Bundled from #23395 (not yet on next)

This PR targets next, but the benchmark changes depend on helpers that currently only exist on merge-train/spartan via #23395. To make this PR self-contained and mergeable to next on its own, it also carries #23395's helper diff:

  • deploy-network.ymlnotify_on_failure input.
  • spartan/bootstrap.shwait_for_l2_block command.
  • spartan/scripts/wait_for_l2_block.sh — RANDAO-aware wait logic.
  • bootstrap.sh / ci.shSKIP_NETWORK_DEPLOY for the proving-bench path.
  • weekly-proving-bench.yml — the original restructure from chore: refactor weekly proving test wait #23395.

These hunks are identical to #23395, so git will reconcile them when #23395 lands independently. If you'd rather land #23395 first and have me drop the bundled commit, say so.

Note on namespaces

For jobs whose deploy namespace differs from the env-file name (nightly-bench, nightly-block-capacity), the wait step sets NAMESPACE explicitly so wait_for_l2_block polls the right pods (env files use ${NAMESPACE:-default}, so the override is respected).

Not in scope here

The genuine nightly scenario tests run from ci3.yml's ci-network-scenario job (tag-gated), not from test-network-scenarios.yml (a manual workflow_dispatch runner). ci-network-scenario has the same deploy-inside-SSM shape and could get the same treatment as a follow-up.

Test plan

These can't be validated from a sandbox (need AWS/GCP secrets + a real deploy). Before relying on the schedule, run each via workflow_dispatch once:

  • Nightly Bench 10 TPS — confirm deploy-network + wait-for-l2-block succeed, then the benchmark runs against the live network with SKIP_NETWORK_DEPLOY=1, and cleanup tears down.
  • Nightly Spartan Benchmarks — confirm all three deploy/wait/benchmark/cleanup chains run in parallel without namespace collisions and status reports correctly.

@AztecBot AztecBot added ci claudebox Owned by claudebox. it can push to this PR. labels May 22, 2026
AztecBot added a commit that referenced this pull request May 22, 2026
@spypsy spypsy marked this pull request as ready for review May 22, 2026 09:16
@spypsy spypsy requested a review from charlielye as a code owner May 22, 2026 09:16
Restructures nightly-bench-10tps and nightly-spartan-bench (tps, proving,
block-capacity) to deploy the network via deploy-network.yml on the runner,
wait for the first L2 block, then run the benchmark with SKIP_NETWORK_DEPLOY=1,
cutting the SSM Undeliverable failures from the long deploy-inside-SSM command.
Adds SKIP_NETWORK_DEPLOY plumbing to the network-bench / network-block-capacity-bench
/ network-bench-10tps ci3 commands.

Bundles the helper changes from #23395 (deploy-network.yml notify_on_failure,
spartan wait_for_l2_block, proving-bench SKIP_NETWORK_DEPLOY) so the PR is
self-contained on next, where #23395 has not yet landed.
@AztecBot AztecBot changed the base branch from merge-train/spartan to next May 22, 2026 09:37
@spypsy spypsy added this pull request to the merge queue May 22, 2026

@alexghr alexghr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Merged via the queue into next with commit a8b6900 May 22, 2026
33 of 39 checks passed
@spypsy spypsy deleted the cb/e47536c9d13d branch May 22, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants