Skip to content

test(e2e_ha_full): parallel HA peer node teardown with per-node deadline#23539

Merged
PhilWindle merged 1 commit into
merge-train/spartanfrom
cb/133ce6d845a4
May 24, 2026
Merged

test(e2e_ha_full): parallel HA peer node teardown with per-node deadline#23539
PhilWindle merged 1 commit into
merge-train/spartanfrom
cb/133ce6d845a4

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Why

e2e_ha_full.test.ts dequeued PR #23344 from the merge train (log): all 8 tests passed but the afterAll cleanup hook exceeded its 20-minute jest timeout. The hook stops 5 HA peer nodes serially and HA-2's sequencer.stop() blocked for ~23 minutes waiting on an in-flight L1 publish whose internal tx-timeout was computed on a test-warped dateProvider clock and never fired.

The deeper bug (publish doesn't honor stop()) is being fixed separately. This PR is the minimum change to keep one stuck node from killing the whole hook and the merge train.

What

Replace the serial for loop with Promise.allSettled(... Promise.race([stop, 30s timeout])), so:

  • All five node.stop() calls run concurrently.
  • A node that fails to stop within 30s is logged and abandoned rather than blocking siblings.
  • The hook always completes within ~30s instead of consuming the full 20-minute budget.

The 30s deadline is comfortably above the ~5ms each healthy node took in the failing log, so this is purely a safety net; if it ever fires we want the explicit error in the log to point at the next investigation.

Scope

Test-only change. No production code touched.


Created by claudebox · group: slackbot

…dline

The afterAll hook stopped HA peer nodes serially with no per-node timeout,
so a single sequencer.stop() that hangs (e.g. an L1 publish whose tx-timeout
was computed on a test-warped clock) burns the entire 20-minute jest hook
budget and dequeues the merge train.
@AztecBot AztecBot added ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels May 24, 2026
@PhilWindle PhilWindle marked this pull request as ready for review May 24, 2026 12:11
@PhilWindle PhilWindle enabled auto-merge (squash) May 24, 2026 12:11
@PhilWindle PhilWindle merged commit d38da91 into merge-train/spartan May 24, 2026
40 of 45 checks passed
@PhilWindle PhilWindle deleted the cb/133ce6d845a4 branch May 24, 2026 12:37
PhilWindle pushed a commit that referenced this pull request May 24, 2026
Dequeued from merge-train/spartan again:
<http://ci.aztec-labs.com/136431da99834194>.

The HA full suite keeps failing under proposer pipelining with shifting
symptoms. In this run the dashboard log shows recurring
`validator:proposal-handler Timed out waiting for block with archive
matching checkpoint proposal` warnings (slot 98, 115, …) and an `Error
building checkpoint at slot 127: already proposed block for slot 127
index 0` on HA-4 — i.e. the 5 HA peers race on the same proposal. The
bundled #23539 (parallel peer teardown) and #23524 (afterAll hook
timeout) entries did not catch this run because jest's per-test summary
was not reached within the dashboard log capture.

This PR adds a broad regex-only entry under `.test_patterns.yml` to flag
any failure of `yarn-project/end-to-end/scripts/run_test.sh ha
src/composed/ha/e2e_ha_full.test.ts` as a flake. Owner: @PaLLa, matching
the existing pipelining-flavoured entries for this suite.

The intent is to unblock the merge queue while the HA pipelining
stabilisation work continues; narrow the regex (or add a real fix) once
the failure modes settle down.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/d394ef6145e749ff) ·
group: `slackbot`*
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
BEGIN_COMMIT_OVERRIDE
refactor(p2p): merge FastTxCollection into TxCollection with sequential
pipeline (AztecProtocol#23245)
refactor(publisher): bundle-level simulate; drop per-action enqueue sims
(AztecProtocol#23165)
refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases
(AztecProtocol#23249)
test(e2e): fix race in 'proposer invalidates multiple checkpoints'
(AztecProtocol#23259)
fix: clean up old jobs regardless of pending status (AztecProtocol#23260)
refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273)
chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276)
feat: slash truncated checkpoint proposals (AztecProtocol#23250)
refactor: remove unused map in attestation pool (AztecProtocol#23284)
chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274)
refactor(l1-tx-utils): use DateProvider for fail-fast timeout check
(AztecProtocol#23257)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277)
test(e2e): fix race in broadcasted_invalid_block_proposal_slash under
pipelining (AztecProtocol#23302)
fix(archiver): atomic getter for L2 tips (AztecProtocol#23295)
fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under
pipelining (AztecProtocol#23296)
fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298)
test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275)
chore: update dashboard (AztecProtocol#23312)
chore: Revert "feat(sandbox): support proposer pipelining in local
network" (AztecProtocol#23313)
test: slash on bad attestation (AztecProtocol#23184)
feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116)
test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301)
test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300)
test(e2e): switch fee_settings to organic fee bumps under pipelining
(AztecProtocol#23303)
fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures
(AztecProtocol#23333)
test(e2e): wait for real oracle rotation in fee_settings inflate helper
(AztecProtocol#23334)
test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining
(AztecProtocol#23336)
fix(spartan-bench): tolerate older node images in SlasherConfig schema
(AztecProtocol#23351)
fix: interrupt prover jobs in stop (AztecProtocol#23358)
test(e2e): enable pipelining on bot, fees, and avm simulator tests
(AztecProtocol#23329)
feat(sentinel): end-of-epoch evaluation with re-execution outcomes
(AztecProtocol#23286)
feat: slash for invalid checkpoint proposals (AztecProtocol#23270)
fix: fork closure in epoch proving jobs (AztecProtocol#23390)
fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394)
fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396)
test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in
epochs tests (AztecProtocol#23424)
chore: refactor weekly proving test wait (AztecProtocol#23395)
refactor: add fifo set (AztecProtocol#23271)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327)
fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371)
chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373)
feat(sequencer): AutomineSequencer for single-sequencer e2e tests
(AztecProtocol#23354)
fix(prover): wait for previous epoch to be proven (AztecProtocol#23458)
chore: collocate provers (AztecProtocol#23439)
chore: rm staging-ignition (AztecProtocol#23440)
chore: rm unused networks (AztecProtocol#23441)
test(e2e): migrate block_building, multi_validator_node,
publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414)
fix(archiver): reconcile local blocks with L1 checkpoints by block
number (AztecProtocol#23461)
feat: Updated slash conditions on block proposals (AztecProtocol#23466)
test(e2e): migrate HA full test to pipelining (AztecProtocol#23463)
chore: update resource profiles (AztecProtocol#23442)
chore: update debug log levels (AztecProtocol#23456)
test: fix flaky sentinel_status_slash by asserting the fault on the
checkpoint slot (AztecProtocol#23483)
feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980)
(AztecProtocol#23436)
refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID ->
PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468)
fix: reject block proposals in poisoned slots (AztecProtocol#23411)
fix: retry nargo dep + solc downloads to survive transient DNS drops
(AztecProtocol#23490)
fix: enrich json-rpc tracing (AztecProtocol#23412)
feat: add trace export controls (AztecProtocol#23413)
test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496)
test: cover invalid checkpoint proposal slashing (AztecProtocol#23503)
test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482)
test: flag e2e_slashing_attested_invalid_proposal as flake under
pipelining (AztecProtocol#23501)
test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining
(AztecProtocol#23515)
test(e2e): require cross-observer agreement on sentinel fault slot
(AztecProtocol#23513)
test: flag e2e_ha_full afterAll hook timeout as flake under pipelining
(AztecProtocol#23524)
fix(e2e): propagate l1ContractsArgs into node config so archiver matches
L1 (AztecProtocol#23514)
test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as
flake (AztecProtocol#23528)
test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1
race (AztecProtocol#23533)
test(e2e_ha_full): parallel HA peer node teardown with per-node deadline
(AztecProtocol#23539)
test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541)
test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542)
test(ci): skip e2e_multi_validator_node_key_store entirely on
merge-train/spartan (AztecProtocol#23544)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants