Skip to content

chore: refactor weekly proving test wait#23395

Merged
alexghr merged 1 commit into
merge-train/spartanfrom
ag/chore-refactor-weekly-proving-test-wait
May 20, 2026
Merged

chore: refactor weekly proving test wait#23395
alexghr merged 1 commit into
merge-train/spartanfrom
ag/chore-refactor-weekly-proving-test-wait

Conversation

@alexghr

@alexghr alexghr commented May 19, 2026

Copy link
Copy Markdown
Contributor

.

@alexghr alexghr requested a review from charlielye as a code owner May 19, 2026 12:55
@alexghr alexghr merged commit c542360 into merge-train/spartan May 20, 2026
14 checks passed
@alexghr alexghr deleted the ag/chore-refactor-weekly-proving-test-wait branch May 20, 2026 10:19
AztecBot added a commit that referenced this pull request May 22, 2026
Restructures nightly-bench-10tps and nightly-spartan-bench (tps, proving,
block-capacity) to deploy the network via deploy-network.yml on the runner,
wait for the first L2 block, then run the benchmark with SKIP_NETWORK_DEPLOY=1,
cutting the SSM Undeliverable failures from the long deploy-inside-SSM command.
Adds SKIP_NETWORK_DEPLOY plumbing to the network-bench / network-block-capacity-bench
/ network-bench-10tps ci3 commands.

Bundles the helper changes from #23395 (deploy-network.yml notify_on_failure,
spartan wait_for_l2_block, proving-bench SKIP_NETWORK_DEPLOY) so the PR is
self-contained on next, where #23395 has not yet landed.
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
…rks (AztecProtocol#23489)

## Summary

Applies the same restructuring introduced for the weekly proving
benchmark in AztecProtocol#23395 to the remaining spartan nightly benchmark
workflows, to cut the SSM `Undeliverable` failures we keep hitting (e.g.
the [nightly 10 TPS run that died mid
`network_deploy`](https://github.com/AztecProtocol/aztec-packages/actions/runs/26275218767)).

The core idea (unchanged from AztecProtocol#23395): stop doing the network deploy +
multi-epoch chain warmup **inside** the single long-running EC2/SSM
benchmark command. Instead:

1. **`deploy-network.yml`** (runs on the GitHub runner via GKE, no SSM)
deploys the network.
2. A **`wait-for-l2-block`** job polls until the chain produces its
first L2 block.
3. The benchmark then runs with **`SKIP_NETWORK_DEPLOY=1`**, so the SSM
command only builds + measures against the already-live network — a much
shorter SSM lifetime, which is where `Undeliverable` was striking.
4. A separate **`cleanup`** job (`if: always()`) tears the network down,
and **`notify-failure`** keeps the existing Slack/ClaudeBox alerts.

## Changes

**Workflows restructured into `select-image → deploy-network →
wait-for-l2-block → benchmark (SKIP_NETWORK_DEPLOY=1) → cleanup →
notify`:**
- `.github/workflows/nightly-bench-10tps.yml` — single 10 TPS benchmark
(env/namespace `bench-10tps`).
- `.github/workflows/nightly-spartan-bench.yml` — all three benches:
`benchmark` (tps-scenario → ns `nightly-bench`), `proving-benchmark`
(prove-n-tps-fake), `block-capacity-benchmark` (block-capacity → ns
`nightly-block-capacity`). The `status` aggregator and per-bench Slack
messages are preserved. All jobs check out `ref: next`.

**`SKIP_NETWORK_DEPLOY` plumbing** (mirrors the proving-bench path from
AztecProtocol#23395) added to:
- `ci.sh`: `network-bench`, `network-block-capacity-bench`,
`network-bench-10tps`.
- `bootstrap.sh`: `ci-network-bench`, `ci-network-block-capacity-bench`,
`ci-network-bench-10tps`.

## Bundled from AztecProtocol#23395 (not yet on `next`)

This PR targets `next`, but the benchmark changes depend on helpers that
currently only exist on `merge-train/spartan` via AztecProtocol#23395. To make this
PR self-contained and mergeable to `next` on its own, it also carries
AztecProtocol#23395's helper diff:
- `deploy-network.yml` — `notify_on_failure` input.
- `spartan/bootstrap.sh` — `wait_for_l2_block` command.
- `spartan/scripts/wait_for_l2_block.sh` — RANDAO-aware wait logic.
- `bootstrap.sh` / `ci.sh` — `SKIP_NETWORK_DEPLOY` for the proving-bench
path.
- `weekly-proving-bench.yml` — the original restructure from AztecProtocol#23395.

These hunks are identical to AztecProtocol#23395, so git will reconcile them when
AztecProtocol#23395 lands independently. If you'd rather land AztecProtocol#23395 first and have
me drop the bundled commit, say so.

## Note on namespaces

For jobs whose deploy namespace differs from the env-file name
(`nightly-bench`, `nightly-block-capacity`), the wait step sets
`NAMESPACE` explicitly so `wait_for_l2_block` polls the right pods (env
files use `${NAMESPACE:-default}`, so the override is respected).

## Not in scope here

The genuine nightly **scenario tests** run from `ci3.yml`'s
`ci-network-scenario` job (tag-gated), not from
`test-network-scenarios.yml` (a manual `workflow_dispatch` runner).
`ci-network-scenario` has the same deploy-inside-SSM shape and could get
the same treatment as a follow-up.

## Test plan

These can't be validated from a sandbox (need AWS/GCP secrets + a real
deploy). Before relying on the schedule, run each via
`workflow_dispatch` once:
- [ ] `Nightly Bench 10 TPS` — confirm deploy-network +
wait-for-l2-block succeed, then the benchmark runs against the live
network with `SKIP_NETWORK_DEPLOY=1`, and cleanup tears down.
- [ ] `Nightly Spartan Benchmarks` — confirm all three
deploy/wait/benchmark/cleanup chains run in parallel without namespace
collisions and `status` reports correctly.
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
BEGIN_COMMIT_OVERRIDE
refactor(p2p): merge FastTxCollection into TxCollection with sequential
pipeline (AztecProtocol#23245)
refactor(publisher): bundle-level simulate; drop per-action enqueue sims
(AztecProtocol#23165)
refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases
(AztecProtocol#23249)
test(e2e): fix race in 'proposer invalidates multiple checkpoints'
(AztecProtocol#23259)
fix: clean up old jobs regardless of pending status (AztecProtocol#23260)
refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273)
chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276)
feat: slash truncated checkpoint proposals (AztecProtocol#23250)
refactor: remove unused map in attestation pool (AztecProtocol#23284)
chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274)
refactor(l1-tx-utils): use DateProvider for fail-fast timeout check
(AztecProtocol#23257)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277)
test(e2e): fix race in broadcasted_invalid_block_proposal_slash under
pipelining (AztecProtocol#23302)
fix(archiver): atomic getter for L2 tips (AztecProtocol#23295)
fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under
pipelining (AztecProtocol#23296)
fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298)
test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275)
chore: update dashboard (AztecProtocol#23312)
chore: Revert "feat(sandbox): support proposer pipelining in local
network" (AztecProtocol#23313)
test: slash on bad attestation (AztecProtocol#23184)
feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116)
test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301)
test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300)
test(e2e): switch fee_settings to organic fee bumps under pipelining
(AztecProtocol#23303)
fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures
(AztecProtocol#23333)
test(e2e): wait for real oracle rotation in fee_settings inflate helper
(AztecProtocol#23334)
test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining
(AztecProtocol#23336)
fix(spartan-bench): tolerate older node images in SlasherConfig schema
(AztecProtocol#23351)
fix: interrupt prover jobs in stop (AztecProtocol#23358)
test(e2e): enable pipelining on bot, fees, and avm simulator tests
(AztecProtocol#23329)
feat(sentinel): end-of-epoch evaluation with re-execution outcomes
(AztecProtocol#23286)
feat: slash for invalid checkpoint proposals (AztecProtocol#23270)
fix: fork closure in epoch proving jobs (AztecProtocol#23390)
fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394)
fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396)
test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in
epochs tests (AztecProtocol#23424)
chore: refactor weekly proving test wait (AztecProtocol#23395)
refactor: add fifo set (AztecProtocol#23271)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327)
fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371)
chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373)
feat(sequencer): AutomineSequencer for single-sequencer e2e tests
(AztecProtocol#23354)
fix(prover): wait for previous epoch to be proven (AztecProtocol#23458)
chore: collocate provers (AztecProtocol#23439)
chore: rm staging-ignition (AztecProtocol#23440)
chore: rm unused networks (AztecProtocol#23441)
test(e2e): migrate block_building, multi_validator_node,
publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414)
fix(archiver): reconcile local blocks with L1 checkpoints by block
number (AztecProtocol#23461)
feat: Updated slash conditions on block proposals (AztecProtocol#23466)
test(e2e): migrate HA full test to pipelining (AztecProtocol#23463)
chore: update resource profiles (AztecProtocol#23442)
chore: update debug log levels (AztecProtocol#23456)
test: fix flaky sentinel_status_slash by asserting the fault on the
checkpoint slot (AztecProtocol#23483)
feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980)
(AztecProtocol#23436)
refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID ->
PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468)
fix: reject block proposals in poisoned slots (AztecProtocol#23411)
fix: retry nargo dep + solc downloads to survive transient DNS drops
(AztecProtocol#23490)
fix: enrich json-rpc tracing (AztecProtocol#23412)
feat: add trace export controls (AztecProtocol#23413)
test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496)
test: cover invalid checkpoint proposal slashing (AztecProtocol#23503)
test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482)
test: flag e2e_slashing_attested_invalid_proposal as flake under
pipelining (AztecProtocol#23501)
test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining
(AztecProtocol#23515)
test(e2e): require cross-observer agreement on sentinel fault slot
(AztecProtocol#23513)
test: flag e2e_ha_full afterAll hook timeout as flake under pipelining
(AztecProtocol#23524)
fix(e2e): propagate l1ContractsArgs into node config so archiver matches
L1 (AztecProtocol#23514)
test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as
flake (AztecProtocol#23528)
test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1
race (AztecProtocol#23533)
test(e2e_ha_full): parallel HA peer node teardown with per-node deadline
(AztecProtocol#23539)
test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541)
test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542)
test(ci): skip e2e_multi_validator_node_key_store entirely on
merge-train/spartan (AztecProtocol#23544)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants