test: skip flaky sentinel_status_slash on merge-train/spartan#23443
Closed
AztecBot wants to merge 1 commit into
Closed
test: skip flaky sentinel_status_slash on merge-train/spartan#23443AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
spalladino
pushed a commit
that referenced
this pull request
May 22, 2026
…heckpoint slot (#23483) ## Summary Root-cause fix for the `sentinel_status_slash.parallel.test.ts` flake (added in #23286) instead of skipping it. The two malicious-proposer tests now discover the slot at which the fault is actually recorded rather than asserting it on the pre-warped block-proposer slot. (This PR previously carried a `.test_patterns.yml` skip on `next`; per maintainer request the skip is dropped and the PR now carries the fix on `merge-train/spartan`, where the test lives.) ## Root cause The failing merge-train run (test log http://ci.aztec-labs.com/83589a3b1ce4a8b7) shows every honest observer recording the target proposer as `checkpoint-missed` at the warped slot (15), with zero `checkpoint-invalid` / `checkpoint-unvalidated` / `checkpoint-valid` for it anywhere — the malicious node didn't even self-record `checkpoint-valid` for that slot. `checkpoint-missed` means *block proposals were seen for the slot, but no checkpoint-proposal re-execution outcome was recorded for it* (`sentinel.ts` `getSlotActivity`). The re-execution outcome that drives `checkpoint-invalid` / `checkpoint-unvalidated` is keyed by `proposal.slotNumber` — the slot at which a node *closes a checkpoint* (`proposal_handler.ts:887`). A proposer self-records `checkpoint-valid` only when it actually builds a checkpoint proposal. `warpToSlotBeforeTargetProposer` warps to the target's *block-proposer* slot and the test then asserts the fault at exactly that slot. But a proposer emits a checkpoint proposal only when its slot closes a checkpoint, and that does not reliably coincide with the warped block-proposer slot. When the warped slot is mid-checkpoint, no outcome is ever keyed to it → `checkpoint-missed`. Whether the proposer slot lines up with a checkpoint close shifts with pipelining/wall-clock timing, so it flakes (it fails on every retry, so the `e2e-p2p-epoch-flakes` group does not absorb it). ## Fix Stop pinning the pre-warped slot. New helper `findObservedStatusSlot` polls an observer for the slot at which it records the expected fault status (`checkpoint-unvalidated` / `checkpoint-invalid`); the test then requires every honest observer to agree at that slot and the malicious node to self-record `checkpoint-valid` for the same slot. The warp is kept purely to cut wall-clock by advancing near the proposer region. This stays faithful to what the test verifies: it still fails (times out) if the malicious proposal is never detected, so it does not mask a genuine propagation/detection bug — it only removes the brittle assumption that the fault lands on one specific pre-chosen slot. ## Verification - Type/usage review only: `findObservedStatusSlot` reuses the same `getValidatorsStats()` → `history` shape and `retryUntil` pattern as the existing helpers; `SlotNumber` / `ValidatorStatusInSlot` / `retryUntil` are already imported. - **Not run locally**: this is an intermittent e2e p2p flake; a single local run would not establish flake-freedom and a full build + repeated e2e grind is impractical to do reliably in-session. CI on this draft exercises the test; a rerun grind is the real confirmation bar. ## Related - Replaces the skip approach (the prior content of this PR and the now-closed #23443). - Adjacent to `palla/fix-sentinel-inactivity-want-to-slash` (changes the sentinel inactivity computation), which is independent of this test-timing fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Skip
sentinel_status_slash.parallel.test.tsto unblockmerge-train/spartan.Failure
Merge-train CI run http://ci.aztec-labs.com/1779299975599110 failed on:
Test log: http://ci.aztec-labs.com/852ace33ed5b6f03
Root Cause
The test (added in #23286) is timing-sensitive. It spawns one malicious validator with
broadcastInvalidBlockProposal: trueand expects honest observers to:state_mismatchunvalidatedIn the failing run, the observer recorded
blocks-missed, meaning it never saw the malicious node's block proposal at all. Despite the warp-to-slot-before-target andmockGossipSubNetwork: truesetup, the proposer-pipelining build phase for the target slot lands on the malicious node before the network has fully settled.The test failed twice in a row (initial run + retry), so the existing
e2e-p2p-epoch-flakesgroup didn't catch it.Fix
Skip the test (assigned to palla) so the merge-train can advance while the timing issue is investigated. Follows the same approach used for
epochs_l1_reorgs.parallel.test.ts.Slack: https://aztecprotocol.slack.com/archives/C0AU8BULZHC/p1779301250083769
ClaudeBox log: https://claudebox.work/s/856f19a40ad6abca?run=1