test(e2e): scale bad-checkpoint wait to selected slots in proposer invalidates multiple checkpoints#24016
Closed
AztecBot wants to merge 1 commit into
Closed
Conversation
…nvalidates multiple checkpoints`
spalladino
pushed a commit
that referenced
this pull request
Jun 11, 2026
…idates multiple checkpoints` (#24017) Fixes a flake in `proposer invalidates multiple checkpoints` (`e2e_epochs/epochs_invalidate_block.parallel.test.ts`) reported on `v5-next`: [failed run](http://ci.aztec-labs.com/e4076dd86c434c6f). Replaces #24016 (was based on `merge-train/spartan`; this one targets the v5 line where the flake fired and restructures the test instead of just resizing the timeout). ## Root cause of the flake `TimeoutError: Operation timed out after 256000ms` — the bare 8-slot `timeoutPromise` waiting for the two bad checkpoints. The bad-slot search from #23608 rejects any candidate pair whose proposer also owns an earlier un-snapshotted pipelined slot, and the rejection window grows with each attempt. In the failed run the current slot was 21 and the search rejected (24,25)…(29,30) before accepting slots **30/31** — 9–10 slots out. The fixed 256s wait expired at 22:48:55, before slot 30 even began (~22:49:00), while the chain healthily mined checkpoints at slots 22–28 underneath; the run was unwinnable at selection time. The race's `.then(() => [CheckpointNumber(0), …])` fallback was also dead code, since `timeoutPromise` rejects. ## Fix: search first, then warp Instead of starting the sequencers and waiting in real time for whatever slots the search lands on: - With sequencers stopped, search for a `warpSlot` such that the proposers of the three lead-in slots `warpSlot+1..warpSlot+3` are not the proposers of the bad slots `warpSlot+4`/`warpSlot+5`. A far-away candidate now costs a warp instead of a real-time wait, and `EpochNotStable` during the search is handled by warping forward one epoch (same pattern as the `archiver skips a descendant` test in this file). - Warp to one L1 block before `warpSlot`, so sequencers get a full L2 slot to boot before the first pipelined build window we rely on (end of `warpSlot`, targeting `warpSlot+1`). - Start the sequencers and wait for the first good checkpoint (lands at `warpSlot`, or up to `warpSlot+2` on a slow start). - Apply the malicious config to the bad-slot proposers. The three good lead-in slots guarantee no pipelined job before `badSlot1` can snapshot it, since jobs snapshot config during the last L1 slot of the previous L2 slot. - Fail fast with a clear assertion if config application was somehow late enough to reach `badSlot1`'s build window, rather than timing out opaquely. - The 8-slot wait for the bad checkpoints is now correctly sized by construction (`badSlot2` is at most ~6 slots from the wait start), and gets a descriptive timeout message. Worst case the wait phase is bounded at ~6 slots regardless of how many candidates the search rejects, where previously each rejected candidate pushed the bad checkpoints one slot further past the fixed timeout. --- *Created by [claudebox](https://claudebox.work/v2/sessions/d509a218614bf4ac) · group: `slackbot`*
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes a flake in
proposer invalidates multiple checkpoints(e2e_epochs/epochs_invalidate_block.parallel.test.ts) reported onv5-next: failed run.Failure
TimeoutError: Operation timed out after 256000ms. 256000ms =L2_SLOT_DURATION_IN_S * 8 * 1000, and the missing custom error message identifies the baretimeoutPromise(...)racing the "wait for two bad checkpoints" promise — the only timeout call in the file without a message.Root cause
The bad-slot search introduced in #23608 rejects any candidate pair whose proposer also owns an earlier un-snapshotted pipelined slot. With 6 validators this can reject many pairs in a row: in the failed run, the current slot was 21 and the search rejected (24,25) through (29,30) before accepting slots 30 and 31 (proposer
0x14dc…first appears in the schedule at slot 30; every earlier pair had p1/p2 owning a slot in the growing pre-bad window).The wait for the two bad checkpoints, however, is fixed at 8 slots (256s) regardless of how far out the selected slots are. Timeline from the failed log:
22:44:39— "First checkpoint mined, current slot is 21"; search selects bad slots 30/31; wait starts22:45:11…22:48:23— healthy checkpoints 3–9 mined at slots 22–28, one per 32s slot (the chain was fine; the malicious config was never exercised)22:48:55— timeout fires, exactly 256s after the wait began — before slot 30 even started (~22:49:00); its checkpoint could not land on L1 until ~22:49:30With
badSlot1 = currentSlot + 9, the 8-slot wait is mathematically unable to succeed, so the run was doomed at slot-selection time. In passing runs the first or an early candidate pair is accepted (currentSlot + 3/+4), and the checkpoints land ~130–160s into the 256s window.Fix
badSlot2(badSlot2 - currentSlot + 4slots) instead of a fixed 8 slots, and give it a descriptive error message..then(() => [CheckpointNumber(0), CheckpointNumber(0)])fallback:timeoutPromiserejects on timeout, so the fallback was dead code (the graceful-failure path it implied never ran).Note for porting: the test file is identical on
nextandv5-next(modulo an unrelatedl1PublishingTimeline), and the flake fired onv5-next— the same change applies there (merge-train/spartan-v5).Created by claudebox · group:
slackbot