Skip to content

test: skip flaky sentinel_status_slash on merge-train/spartan#23443

Closed
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/skip-flaky-sentinel-status-slash
Closed

test: skip flaky sentinel_status_slash on merge-train/spartan#23443
AztecBot wants to merge 1 commit into
merge-train/spartanfrom
claudebox/skip-flaky-sentinel-status-slash

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Summary

Skip sentinel_status_slash.parallel.test.ts to unblock merge-train/spartan.

Failure

Merge-train CI run http://ci.aztec-labs.com/1779299975599110 failed on:

FAIL src/e2e_p2p/sentinel_status_slash.parallel.test.ts
  e2e_p2p_sentinel_status_slash
    ✕ slashes the proposer with INACTIVITY when checkpoint validation records unvalidated

Expected: "checkpoint-unvalidated"
Received: "blocks-missed"

Test log: http://ci.aztec-labs.com/852ace33ed5b6f03

Root Cause

The test (added in #23286) is timing-sensitive. It spawns one malicious validator with broadcastInvalidBlockProposal: true and expects honest observers to:

  1. Receive the malicious invalid block proposal
  2. Reject it via re-execution state_mismatch
  3. Skip pushing to their archiver
  4. Receive the malicious checkpoint proposal and fail to validate it → record unvalidated

In the failing run, the observer recorded blocks-missed, meaning it never saw the malicious node's block proposal at all. Despite the warp-to-slot-before-target and mockGossipSubNetwork: true setup, the proposer-pipelining build phase for the target slot lands on the malicious node before the network has fully settled.

The test failed twice in a row (initial run + retry), so the existing e2e-p2p-epoch-flakes group didn't catch it.

Fix

Skip the test (assigned to palla) so the merge-train can advance while the timing issue is investigated. Follows the same approach used for epochs_l1_reorgs.parallel.test.ts.

Slack: https://aztecprotocol.slack.com/archives/C0AU8BULZHC/p1779301250083769

ClaudeBox log: https://claudebox.work/s/856f19a40ad6abca?run=1

@AztecBot AztecBot added ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 20, 2026
@AztecBot AztecBot closed this May 22, 2026
spalladino pushed a commit that referenced this pull request May 22, 2026
…heckpoint slot (#23483)

## Summary

Root-cause fix for the `sentinel_status_slash.parallel.test.ts` flake
(added in #23286) instead of skipping it. The two malicious-proposer
tests now discover the slot at which the fault is actually recorded
rather than asserting it on the pre-warped block-proposer slot.

(This PR previously carried a `.test_patterns.yml` skip on `next`; per
maintainer request the skip is dropped and the PR now carries the fix on
`merge-train/spartan`, where the test lives.)

## Root cause

The failing merge-train run (test log
http://ci.aztec-labs.com/83589a3b1ce4a8b7) shows every honest observer
recording the target proposer as `checkpoint-missed` at the warped slot
(15), with zero `checkpoint-invalid` / `checkpoint-unvalidated` /
`checkpoint-valid` for it anywhere — the malicious node didn't even
self-record `checkpoint-valid` for that slot.

`checkpoint-missed` means *block proposals were seen for the slot, but
no checkpoint-proposal re-execution outcome was recorded for it*
(`sentinel.ts` `getSlotActivity`). The re-execution outcome that drives
`checkpoint-invalid` / `checkpoint-unvalidated` is keyed by
`proposal.slotNumber` — the slot at which a node *closes a checkpoint*
(`proposal_handler.ts:887`). A proposer self-records `checkpoint-valid`
only when it actually builds a checkpoint proposal.

`warpToSlotBeforeTargetProposer` warps to the target's *block-proposer*
slot and the test then asserts the fault at exactly that slot. But a
proposer emits a checkpoint proposal only when its slot closes a
checkpoint, and that does not reliably coincide with the warped
block-proposer slot. When the warped slot is mid-checkpoint, no outcome
is ever keyed to it → `checkpoint-missed`. Whether the proposer slot
lines up with a checkpoint close shifts with pipelining/wall-clock
timing, so it flakes (it fails on every retry, so the
`e2e-p2p-epoch-flakes` group does not absorb it).

## Fix

Stop pinning the pre-warped slot. New helper `findObservedStatusSlot`
polls an observer for the slot at which it records the expected fault
status (`checkpoint-unvalidated` / `checkpoint-invalid`); the test then
requires every honest observer to agree at that slot and the malicious
node to self-record `checkpoint-valid` for the same slot. The warp is
kept purely to cut wall-clock by advancing near the proposer region.

This stays faithful to what the test verifies: it still fails (times
out) if the malicious proposal is never detected, so it does not mask a
genuine propagation/detection bug — it only removes the brittle
assumption that the fault lands on one specific pre-chosen slot.

## Verification

- Type/usage review only: `findObservedStatusSlot` reuses the same
`getValidatorsStats()` → `history` shape and `retryUntil` pattern as the
existing helpers; `SlotNumber` / `ValidatorStatusInSlot` / `retryUntil`
are already imported.
- **Not run locally**: this is an intermittent e2e p2p flake; a single
local run would not establish flake-freedom and a full build + repeated
e2e grind is impractical to do reliably in-session. CI on this draft
exercises the test; a rerun grind is the real confirmation bar.

## Related

- Replaces the skip approach (the prior content of this PR and the
now-closed #23443).
- Adjacent to `palla/fix-sentinel-inactivity-want-to-slash` (changes the
sentinel inactivity computation), which is independent of this
test-timing fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant