feat(sentinel): end-of-epoch evaluation with re-execution outcomes by PhilWindle · Pull Request #23286 · AztecProtocol/aztec-packages

PhilWindle · 2026-05-14T17:00:39Z

Summary

End-of-epoch sentinel evaluation. Sentinel now evaluates per-epoch validator performance once sentinelEpochEndBufferSlots (default 2) has elapsed past an epoch's last slot, instead of waiting for the chain-proven event. Inactivity slashing fires without depending on L1 proof publication (AZIP-7).
Six-case proposer status taxonomy driven by the checkpoint re-execution tracker: checkpoint-mined, checkpoint-valid, checkpoint-invalid, checkpoint-unvalidated, checkpoint-missed, blocks-missed. Missing-attestor faults apply only when status is checkpoint-mined or checkpoint-valid.
Re-execution tracker overhaul (CheckpointReexecutionTracker):
- Drops the redundant interface; single class exported under its plain name. Maps are brand-typed (Map<CheckpointNumber, …>, Map<SlotNumber, Entry>) so callers don't lose brand info on lookup.
- Records a three-way ReexecutionOutcome (valid / invalid / unvalidated) keyed by slot; hasReexecuted keeps its prior contract.
- New recordTxsCollected(slot, indexWithinCheckpoint, collected) / getTxsCollectedRecord(slot, idx). The validator's proposal handler calls this immediately after getTxsForBlockProposal; the data-withholding watcher short-circuits when every block in a published checkpoint has a true record. A false (deadline missed) does not slash by itself because the data-withholding tolerance window is longer than the re-execution deadline.
ProposalHandler.validateCheckpointProposal records an outcome at every early-return; invalid_signature and checkpoint_already_published deliberately skip.
New broadcastInvalidCheckpointProposalOnly config flag (SequencerConfig). Lets tests corrupt the checkpoint archive while keeping the underlying block proposals valid — exercises the 'invalid' path cleanly. When unset, behaviour follows broadcastInvalidBlockProposal (existing behaviour).

Scope

Closes A-538. The tracker enhancement also lays groundwork for A-537's PROPOSER_INVALID_PROPOSAL offence — invalid outcomes are recorded but not yet consumed by a watcher; that watcher is a follow-on owned separately.

Breaking changes

SentinelStore.SCHEMA_VERSION bumped 3 → 4; legacy provenMap data is wiped on upgrade.
SingleValidatorStats.allTimeProvenPerformance renamed to allTimeEpochPerformance.
provenMap / getProvenPerformance / updateProvenPerformance renamed to epochMap / getEpochPerformance / updateEpochPerformance.
Env var SENTINEL_HISTORIC_PROVEN_PERFORMANCE_LENGTH_IN_EPOCHS renamed to SENTINEL_HISTORIC_EPOCH_PERFORMANCE_LENGTH_IN_EPOCHS.
InMemoryCheckpointReexecutionTracker renamed to CheckpointReexecutionTracker (interface dropped).

Base branch

Based on phil/a-523-... because it shares the CheckpointReexecutionTracker plumbing. Will rebase onto merge-train/spartan once A-523 merges.

Test plan

Unit tests for the new tracker outcomes + recordTxsCollected / getTxsCollectedRecord (stdlib/src/checkpoint/checkpoint_reexecution_tracker.test.ts).
Sentinel unit tests rewritten around the six-case taxonomy and end-of-epoch trigger (aztec-node/src/sentinel/sentinel.test.ts — 46 tests passing).
Store unit tests updated for renamed APIs (aztec-node/src/sentinel/store.test.ts).
Existing proposal_handler.test.ts and data_withholding_watcher.test.ts continue to pass.
multiple_validators_sentinel.parallel.test.ts updated for new status names; new test asserts per-epoch performance is recorded shortly after epoch end without waiting for L1 proof.
New sentinel_status_slash.parallel.test.ts drives all three slashable sentinel statuses end-to-end via config flags (no jest stubbing):
- 'unvalidated' via broadcastInvalidBlockProposal
- 'invalid' via broadcastInvalidCheckpointProposalOnly
- missing-attestor via tryStop on one node
  Each malicious-proposer test warps the L1 clock to the slot just before the target's proposer slot (accounting for pipelining) to keep wall-clock time down. All three pass locally in ~4 min total.

Docs

New yarn-project/aztec-node/src/sentinel/README.md documenting the module.
yarn-project/slasher/README.md Inactivity section updated to match the new behaviour.

spalladino

Looks great! One thing that Claude found: since proposals-from-self skip the proposal handler, a node will always report itself as inactive in its sentinel. It's no big deal since nodes are smart enough not to slash themselves, but it can be annoying for a node that relies on its own sentinel for tracking its activity.

spalladino · 2026-05-16T01:02:09Z

      );
      const result: CheckpointProposalValidationResult = { isValid: false, reason: 'invalid_fee_asset_price_modifier' };
      this.lastCheckpointValidationResult = { payloadHash, result };
+      this.reexecutionTracker.recordOutcome(slot, proposal.archive, 'invalid');


Refactoring nit: I'd set up a mapping from validation result to outcome, and just record this on the caller.

spalladino · 2026-05-16T01:03:25Z

+  it('slashes the proposer with INACTIVITY when checkpoint validation records unvalidated', async () => {
+    // One malicious node broadcasts invalid block proposals; honest observers reject them via
+    // re-execution state_mismatch and therefore never push to their archivers, so the malicious
+    // node's checkpoint proposals can't find their last block and observers record `unvalidated`.
+    const targetAddress = await spawnMaliciousAndHonestNodes({ broadcastInvalidBlockProposal: true });
+    const targetSlot = await warpToSlotBeforeTargetProposer(targetAddress);
+    // nodes[0] is the malicious node; honest observers are nodes[1..].
+    const honestObservers = nodes.slice(1);
+    await assertAllObserversSentinelStatus(honestObservers, targetAddress, targetSlot, 'checkpoint-unvalidated');
+    await assertInactivityOffenseFor(targetAddress, nodes[1]);
+  });


All e2e tests should read like this

PhilWindle · 2026-05-18T10:06:24Z

Looks great! One thing that Claude found: since proposals-from-self skip the proposal handler, a node will always report itself as inactive in its sentinel. It's no big deal since nodes are smart enough not to slash themselves, but it can be annoying for a node that relies on its own sentinel for tracking its activity.

This should be addressed now. Proposers record their own proposals as valid.

Sentinel now evaluates per-epoch validator performance once sentinelEpochEndBufferSlots (default 2) has elapsed past an epoch's last slot, instead of waiting for the chain-proven event. Inactivity slashing fires without depending on L1 proof publication (AZIP-7). Proposer status follows a six-case taxonomy driven by the checkpoint re-execution tracker: checkpoint-mined, checkpoint-valid, checkpoint-invalid, checkpoint-unvalidated, checkpoint-missed, blocks-missed. Missing-attestor faults apply only when status is checkpoint-mined or checkpoint-valid. CheckpointReexecutionTracker overhaul: - Drops the redundant interface; single class exported under its plain name. Maps are brand-typed so callers don't lose brand info on lookup. - Records a three-way ReexecutionOutcome (valid / invalid / unvalidated) keyed by slot; hasReexecuted keeps its prior contract. - New recordTxsCollected(slot, indexWithinCheckpoint, collected) / getTxsCollectedRecord(slot, idx). The validator's proposal handler calls this immediately after getTxsForBlockProposal; the data-withholding watcher short-circuits when every block in a published checkpoint has a true record. A false (deadline missed) does not slash by itself because the DW tolerance window is longer than the re-execution deadline. ProposalHandler.validateCheckpointProposal records an outcome at every early-return; invalid_signature and checkpoint_already_published deliberately skip. New broadcastInvalidCheckpointProposalOnly config flag on SequencerConfig: lets tests corrupt the checkpoint archive while keeping the underlying block proposals valid — exercises the 'invalid' path cleanly. Unset preserves existing behaviour (follows broadcastInvalidBlockProposal). Breaking changes: - SentinelStore.SCHEMA_VERSION bumped 3 -> 4; legacy provenMap data is wiped on upgrade. - SingleValidatorStats.allTimeProvenPerformance renamed to allTimeEpochPerformance. - provenMap / getProvenPerformance / updateProvenPerformance renamed to epochMap / getEpochPerformance / updateEpochPerformance. - Env var SENTINEL_HISTORIC_PROVEN_PERFORMANCE_LENGTH_IN_EPOCHS renamed to SENTINEL_HISTORIC_EPOCH_PERFORMANCE_LENGTH_IN_EPOCHS. - InMemoryCheckpointReexecutionTracker renamed to CheckpointReexecutionTracker (interface dropped).

CI's container-name derivation rejects backticks (allowed pattern is [a-zA-Z0-9][a-zA-Z0-9_.-]+) so the test runner failed at docker run before either test could execute. Drop the inline-code formatting from the it() titles.

…heckpoint slot (#23483) ## Summary Root-cause fix for the `sentinel_status_slash.parallel.test.ts` flake (added in #23286) instead of skipping it. The two malicious-proposer tests now discover the slot at which the fault is actually recorded rather than asserting it on the pre-warped block-proposer slot. (This PR previously carried a `.test_patterns.yml` skip on `next`; per maintainer request the skip is dropped and the PR now carries the fix on `merge-train/spartan`, where the test lives.) ## Root cause The failing merge-train run (test log http://ci.aztec-labs.com/83589a3b1ce4a8b7) shows every honest observer recording the target proposer as `checkpoint-missed` at the warped slot (15), with zero `checkpoint-invalid` / `checkpoint-unvalidated` / `checkpoint-valid` for it anywhere — the malicious node didn't even self-record `checkpoint-valid` for that slot. `checkpoint-missed` means *block proposals were seen for the slot, but no checkpoint-proposal re-execution outcome was recorded for it* (`sentinel.ts` `getSlotActivity`). The re-execution outcome that drives `checkpoint-invalid` / `checkpoint-unvalidated` is keyed by `proposal.slotNumber` — the slot at which a node *closes a checkpoint* (`proposal_handler.ts:887`). A proposer self-records `checkpoint-valid` only when it actually builds a checkpoint proposal. `warpToSlotBeforeTargetProposer` warps to the target's *block-proposer* slot and the test then asserts the fault at exactly that slot. But a proposer emits a checkpoint proposal only when its slot closes a checkpoint, and that does not reliably coincide with the warped block-proposer slot. When the warped slot is mid-checkpoint, no outcome is ever keyed to it → `checkpoint-missed`. Whether the proposer slot lines up with a checkpoint close shifts with pipelining/wall-clock timing, so it flakes (it fails on every retry, so the `e2e-p2p-epoch-flakes` group does not absorb it). ## Fix Stop pinning the pre-warped slot. New helper `findObservedStatusSlot` polls an observer for the slot at which it records the expected fault status (`checkpoint-unvalidated` / `checkpoint-invalid`); the test then requires every honest observer to agree at that slot and the malicious node to self-record `checkpoint-valid` for the same slot. The warp is kept purely to cut wall-clock by advancing near the proposer region. This stays faithful to what the test verifies: it still fails (times out) if the malicious proposal is never detected, so it does not mask a genuine propagation/detection bug — it only removes the brittle assumption that the fault lands on one specific pre-chosen slot. ## Verification - Type/usage review only: `findObservedStatusSlot` reuses the same `getValidatorsStats()` → `history` shape and `retryUntil` pattern as the existing helpers; `SlotNumber` / `ValidatorStatusInSlot` / `retryUntil` are already imported. - **Not run locally**: this is an intermittent e2e p2p flake; a single local run would not establish flake-freedom and a full build + repeated e2e grind is impractical to do reliably in-session. CI on this draft exercises the test; a rerun grind is the real confirmation bar. ## Related - Replaces the skip approach (the prior content of this PR and the now-closed #23443). - Adjacent to `palla/fix-sentinel-inactivity-want-to-slash` (changes the sentinel inactivity computation), which is independent of this test-timing fix.

BEGIN_COMMIT_OVERRIDE refactor(p2p): merge FastTxCollection into TxCollection with sequential pipeline (AztecProtocol#23245) refactor(publisher): bundle-level simulate; drop per-action enqueue sims (AztecProtocol#23165) refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases (AztecProtocol#23249) test(e2e): fix race in 'proposer invalidates multiple checkpoints' (AztecProtocol#23259) fix: clean up old jobs regardless of pending status (AztecProtocol#23260) refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273) chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276) feat: slash truncated checkpoint proposals (AztecProtocol#23250) refactor: remove unused map in attestation pool (AztecProtocol#23284) chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274) refactor(l1-tx-utils): use DateProvider for fail-fast timeout check (AztecProtocol#23257) feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277) test(e2e): fix race in broadcasted_invalid_block_proposal_slash under pipelining (AztecProtocol#23302) fix(archiver): atomic getter for L2 tips (AztecProtocol#23295) fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under pipelining (AztecProtocol#23296) fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298) test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275) chore: update dashboard (AztecProtocol#23312) chore: Revert "feat(sandbox): support proposer pipelining in local network" (AztecProtocol#23313) test: slash on bad attestation (AztecProtocol#23184) feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116) test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301) test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300) test(e2e): switch fee_settings to organic fee bumps under pipelining (AztecProtocol#23303) fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures (AztecProtocol#23333) test(e2e): wait for real oracle rotation in fee_settings inflate helper (AztecProtocol#23334) test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining (AztecProtocol#23336) fix(spartan-bench): tolerate older node images in SlasherConfig schema (AztecProtocol#23351) fix: interrupt prover jobs in stop (AztecProtocol#23358) test(e2e): enable pipelining on bot, fees, and avm simulator tests (AztecProtocol#23329) feat(sentinel): end-of-epoch evaluation with re-execution outcomes (AztecProtocol#23286) feat: slash for invalid checkpoint proposals (AztecProtocol#23270) fix: fork closure in epoch proving jobs (AztecProtocol#23390) fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394) fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396) test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in epochs tests (AztecProtocol#23424) chore: refactor weekly proving test wait (AztecProtocol#23395) refactor: add fifo set (AztecProtocol#23271) feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327) fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371) chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373) feat(sequencer): AutomineSequencer for single-sequencer e2e tests (AztecProtocol#23354) fix(prover): wait for previous epoch to be proven (AztecProtocol#23458) chore: collocate provers (AztecProtocol#23439) chore: rm staging-ignition (AztecProtocol#23440) chore: rm unused networks (AztecProtocol#23441) test(e2e): migrate block_building, multi_validator_node, publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414) fix(archiver): reconcile local blocks with L1 checkpoints by block number (AztecProtocol#23461) feat: Updated slash conditions on block proposals (AztecProtocol#23466) test(e2e): migrate HA full test to pipelining (AztecProtocol#23463) chore: update resource profiles (AztecProtocol#23442) chore: update debug log levels (AztecProtocol#23456) test: fix flaky sentinel_status_slash by asserting the fault on the checkpoint slot (AztecProtocol#23483) feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980) (AztecProtocol#23436) refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID -> PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468) fix: reject block proposals in poisoned slots (AztecProtocol#23411) fix: retry nargo dep + solc downloads to survive transient DNS drops (AztecProtocol#23490) fix: enrich json-rpc tracing (AztecProtocol#23412) feat: add trace export controls (AztecProtocol#23413) test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496) test: cover invalid checkpoint proposal slashing (AztecProtocol#23503) test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482) test: flag e2e_slashing_attested_invalid_proposal as flake under pipelining (AztecProtocol#23501) test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining (AztecProtocol#23515) test(e2e): require cross-observer agreement on sentinel fault slot (AztecProtocol#23513) test: flag e2e_ha_full afterAll hook timeout as flake under pipelining (AztecProtocol#23524) fix(e2e): propagate l1ContractsArgs into node config so archiver matches L1 (AztecProtocol#23514) test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as flake (AztecProtocol#23528) test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1 race (AztecProtocol#23533) test(e2e_ha_full): parallel HA peer node teardown with per-node deadline (AztecProtocol#23539) test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541) test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542) test(ci): skip e2e_multi_validator_node_key_store entirely on merge-train/spartan (AztecProtocol#23544) END_COMMIT_OVERRIDE

PhilWindle added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 14, 2026

PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from 425d9bf to cd34200 Compare May 14, 2026 17:39

PhilWindle force-pushed the phil/a-538-track-performance-at-the-end-of-every-epoch branch 2 times, most recently from f2ae630 to eb8a096 Compare May 15, 2026 15:47

Base automatically changed from phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 to merge-train/spartan May 15, 2026 15:56

PhilWindle force-pushed the phil/a-538-track-performance-at-the-end-of-every-epoch branch from eb8a096 to 4acc6d8 Compare May 15, 2026 17:55

PhilWindle marked this pull request as ready for review May 15, 2026 17:56

spalladino approved these changes May 16, 2026

View reviewed changes

ludamad force-pushed the merge-train/spartan branch from 4af2626 to db4ec58 Compare May 16, 2026 19:07

ludamad requested review from IlyasRidhuan, LHerskind, Maddiaa0, MirandaWood, jeanmon and just-mitch as code owners May 16, 2026 19:07

PhilWindle enabled auto-merge (squash) May 18, 2026 10:06

PhilWindle added 5 commits May 18, 2026 10:28

Address review comments

f1d61f8

Rafactor

b363064

chore: disable camelcase lint for protocol-reason map

eab8434

PhilWindle force-pushed the phil/a-538-track-performance-at-the-end-of-every-epoch branch from 8bb558d to eab8434 Compare May 18, 2026 10:29

PhilWindle disabled auto-merge May 18, 2026 10:33

PhilWindle enabled auto-merge (squash) May 18, 2026 10:37

PhilWindle merged commit 87c71ce into merge-train/spartan May 18, 2026
14 checks passed

PhilWindle deleted the phil/a-538-track-performance-at-the-end-of-every-epoch branch May 18, 2026 10:57

This was referenced May 18, 2026

feat: merge-train/spartan #23344

Merged

test(ci): tolerate pipelined-publisher flake in e2e_scope_isolation #23367

Closed

This was referenced May 19, 2026

test(ci): flake-group sentinel_status_slash warp helper #23408

Closed

test: skip flaky sentinel_status_slash on merge-train/spartan #23443

Closed

test: fix flaky sentinel_status_slash by asserting the fault on the checkpoint slot #23483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sentinel): end-of-epoch evaluation with re-execution outcomes#23286

feat(sentinel): end-of-epoch evaluation with re-execution outcomes#23286
PhilWindle merged 5 commits into
merge-train/spartanfrom
phil/a-538-track-performance-at-the-end-of-every-epoch

PhilWindle commented May 14, 2026 •

edited

Loading

Uh oh!

spalladino left a comment

Uh oh!

spalladino May 16, 2026

Uh oh!

PhilWindle May 18, 2026

Uh oh!

spalladino May 16, 2026

Uh oh!

PhilWindle commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PhilWindle commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Breaking changes

Base branch

Test plan

Docs

Uh oh!

spalladino left a comment

Choose a reason for hiding this comment

Uh oh!

spalladino May 16, 2026

Choose a reason for hiding this comment

Uh oh!

PhilWindle May 18, 2026

Choose a reason for hiding this comment

Uh oh!

spalladino May 16, 2026

Choose a reason for hiding this comment

Uh oh!

PhilWindle commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PhilWindle commented May 14, 2026 •

edited

Loading