Skip to content

feat(sentinel): end-of-epoch evaluation with re-execution outcomes#23286

Merged
PhilWindle merged 5 commits into
merge-train/spartanfrom
phil/a-538-track-performance-at-the-end-of-every-epoch
May 18, 2026
Merged

feat(sentinel): end-of-epoch evaluation with re-execution outcomes#23286
PhilWindle merged 5 commits into
merge-train/spartanfrom
phil/a-538-track-performance-at-the-end-of-every-epoch

Conversation

@PhilWindle

@PhilWindle PhilWindle commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • End-of-epoch sentinel evaluation. Sentinel now evaluates per-epoch validator performance once sentinelEpochEndBufferSlots (default 2) has elapsed past an epoch's last slot, instead of waiting for the chain-proven event. Inactivity slashing fires without depending on L1 proof publication (AZIP-7).
  • Six-case proposer status taxonomy driven by the checkpoint re-execution tracker: checkpoint-mined, checkpoint-valid, checkpoint-invalid, checkpoint-unvalidated, checkpoint-missed, blocks-missed. Missing-attestor faults apply only when status is checkpoint-mined or checkpoint-valid.
  • Re-execution tracker overhaul (CheckpointReexecutionTracker):
    • Drops the redundant interface; single class exported under its plain name. Maps are brand-typed (Map<CheckpointNumber, …>, Map<SlotNumber, Entry>) so callers don't lose brand info on lookup.
    • Records a three-way ReexecutionOutcome (valid / invalid / unvalidated) keyed by slot; hasReexecuted keeps its prior contract.
    • New recordTxsCollected(slot, indexWithinCheckpoint, collected) / getTxsCollectedRecord(slot, idx). The validator's proposal handler calls this immediately after getTxsForBlockProposal; the data-withholding watcher short-circuits when every block in a published checkpoint has a true record. A false (deadline missed) does not slash by itself because the data-withholding tolerance window is longer than the re-execution deadline.
  • ProposalHandler.validateCheckpointProposal records an outcome at every early-return; invalid_signature and checkpoint_already_published deliberately skip.
  • New broadcastInvalidCheckpointProposalOnly config flag (SequencerConfig). Lets tests corrupt the checkpoint archive while keeping the underlying block proposals valid — exercises the 'invalid' path cleanly. When unset, behaviour follows broadcastInvalidBlockProposal (existing behaviour).

Scope

Closes A-538. The tracker enhancement also lays groundwork for A-537's PROPOSER_INVALID_PROPOSAL offence — invalid outcomes are recorded but not yet consumed by a watcher; that watcher is a follow-on owned separately.

Breaking changes

  • SentinelStore.SCHEMA_VERSION bumped 3 → 4; legacy provenMap data is wiped on upgrade.
  • SingleValidatorStats.allTimeProvenPerformance renamed to allTimeEpochPerformance.
  • provenMap / getProvenPerformance / updateProvenPerformance renamed to epochMap / getEpochPerformance / updateEpochPerformance.
  • Env var SENTINEL_HISTORIC_PROVEN_PERFORMANCE_LENGTH_IN_EPOCHS renamed to SENTINEL_HISTORIC_EPOCH_PERFORMANCE_LENGTH_IN_EPOCHS.
  • InMemoryCheckpointReexecutionTracker renamed to CheckpointReexecutionTracker (interface dropped).

Base branch

Based on phil/a-523-... because it shares the CheckpointReexecutionTracker plumbing. Will rebase onto merge-train/spartan once A-523 merges.

Test plan

  • Unit tests for the new tracker outcomes + recordTxsCollected / getTxsCollectedRecord (stdlib/src/checkpoint/checkpoint_reexecution_tracker.test.ts).
  • Sentinel unit tests rewritten around the six-case taxonomy and end-of-epoch trigger (aztec-node/src/sentinel/sentinel.test.ts — 46 tests passing).
  • Store unit tests updated for renamed APIs (aztec-node/src/sentinel/store.test.ts).
  • Existing proposal_handler.test.ts and data_withholding_watcher.test.ts continue to pass.
  • multiple_validators_sentinel.parallel.test.ts updated for new status names; new test asserts per-epoch performance is recorded shortly after epoch end without waiting for L1 proof.
  • New sentinel_status_slash.parallel.test.ts drives all three slashable sentinel statuses end-to-end via config flags (no jest stubbing):
    • 'unvalidated' via broadcastInvalidBlockProposal
    • 'invalid' via broadcastInvalidCheckpointProposalOnly
    • missing-attestor via tryStop on one node
      Each malicious-proposer test warps the L1 clock to the slot just before the target's proposer slot (accounting for pipelining) to keep wall-clock time down. All three pass locally in ~4 min total.

Docs

  • New yarn-project/aztec-node/src/sentinel/README.md documenting the module.
  • yarn-project/slasher/README.md Inactivity section updated to match the new behaviour.

@PhilWindle PhilWindle added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 14, 2026
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from 425d9bf to cd34200 Compare May 14, 2026 17:39
@PhilWindle PhilWindle force-pushed the phil/a-538-track-performance-at-the-end-of-every-epoch branch 2 times, most recently from f2ae630 to eb8a096 Compare May 15, 2026 15:47
Base automatically changed from phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 to merge-train/spartan May 15, 2026 15:56
@PhilWindle PhilWindle force-pushed the phil/a-538-track-performance-at-the-end-of-every-epoch branch from eb8a096 to 4acc6d8 Compare May 15, 2026 17:55
@PhilWindle PhilWindle marked this pull request as ready for review May 15, 2026 17:56

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! One thing that Claude found: since proposals-from-self skip the proposal handler, a node will always report itself as inactive in its sentinel. It's no big deal since nodes are smart enough not to slash themselves, but it can be annoying for a node that relies on its own sentinel for tracking its activity.

);
const result: CheckpointProposalValidationResult = { isValid: false, reason: 'invalid_fee_asset_price_modifier' };
this.lastCheckpointValidationResult = { payloadHash, result };
this.reexecutionTracker.recordOutcome(slot, proposal.archive, 'invalid');

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring nit: I'd set up a mapping from validation result to outcome, and just record this on the caller.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +138 to +148
it('slashes the proposer with INACTIVITY when checkpoint validation records unvalidated', async () => {
// One malicious node broadcasts invalid block proposals; honest observers reject them via
// re-execution state_mismatch and therefore never push to their archivers, so the malicious
// node's checkpoint proposals can't find their last block and observers record `unvalidated`.
const targetAddress = await spawnMaliciousAndHonestNodes({ broadcastInvalidBlockProposal: true });
const targetSlot = await warpToSlotBeforeTargetProposer(targetAddress);
// nodes[0] is the malicious node; honest observers are nodes[1..].
const honestObservers = nodes.slice(1);
await assertAllObserversSentinelStatus(honestObservers, targetAddress, targetSlot, 'checkpoint-unvalidated');
await assertInactivityOffenseFor(targetAddress, nodes[1]);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All e2e tests should read like this

@PhilWindle

Copy link
Copy Markdown
Collaborator Author

Looks great! One thing that Claude found: since proposals-from-self skip the proposal handler, a node will always report itself as inactive in its sentinel. It's no big deal since nodes are smart enough not to slash themselves, but it can be annoying for a node that relies on its own sentinel for tracking its activity.

This should be addressed now. Proposers record their own proposals as valid.

@PhilWindle PhilWindle enabled auto-merge (squash) May 18, 2026 10:06
Sentinel now evaluates per-epoch validator performance once
sentinelEpochEndBufferSlots (default 2) has elapsed past an epoch's
last slot, instead of waiting for the chain-proven event. Inactivity
slashing fires without depending on L1 proof publication (AZIP-7).

Proposer status follows a six-case taxonomy driven by the checkpoint
re-execution tracker: checkpoint-mined, checkpoint-valid,
checkpoint-invalid, checkpoint-unvalidated, checkpoint-missed,
blocks-missed. Missing-attestor faults apply only when status is
checkpoint-mined or checkpoint-valid.

CheckpointReexecutionTracker overhaul:
- Drops the redundant interface; single class exported under its
  plain name. Maps are brand-typed so callers don't lose brand info
  on lookup.
- Records a three-way ReexecutionOutcome (valid / invalid /
  unvalidated) keyed by slot; hasReexecuted keeps its prior contract.
- New recordTxsCollected(slot, indexWithinCheckpoint, collected) /
  getTxsCollectedRecord(slot, idx). The validator's proposal handler
  calls this immediately after getTxsForBlockProposal; the
  data-withholding watcher short-circuits when every block in a
  published checkpoint has a true record. A false (deadline missed)
  does not slash by itself because the DW tolerance window is longer
  than the re-execution deadline.

ProposalHandler.validateCheckpointProposal records an outcome at
every early-return; invalid_signature and checkpoint_already_published
deliberately skip.

New broadcastInvalidCheckpointProposalOnly config flag on
SequencerConfig: lets tests corrupt the checkpoint archive while
keeping the underlying block proposals valid — exercises the 'invalid'
path cleanly. Unset preserves existing behaviour (follows
broadcastInvalidBlockProposal).

Breaking changes:
- SentinelStore.SCHEMA_VERSION bumped 3 -> 4; legacy provenMap data
  is wiped on upgrade.
- SingleValidatorStats.allTimeProvenPerformance renamed to
  allTimeEpochPerformance.
- provenMap / getProvenPerformance / updateProvenPerformance renamed
  to epochMap / getEpochPerformance / updateEpochPerformance.
- Env var SENTINEL_HISTORIC_PROVEN_PERFORMANCE_LENGTH_IN_EPOCHS
  renamed to SENTINEL_HISTORIC_EPOCH_PERFORMANCE_LENGTH_IN_EPOCHS.
- InMemoryCheckpointReexecutionTracker renamed to
  CheckpointReexecutionTracker (interface dropped).
CI's container-name derivation rejects backticks (allowed pattern is
[a-zA-Z0-9][a-zA-Z0-9_.-]+) so the test runner failed at docker run
before either test could execute. Drop the inline-code formatting from
the it() titles.
@PhilWindle PhilWindle force-pushed the phil/a-538-track-performance-at-the-end-of-every-epoch branch from 8bb558d to eab8434 Compare May 18, 2026 10:29
@PhilWindle PhilWindle disabled auto-merge May 18, 2026 10:33
@PhilWindle PhilWindle enabled auto-merge (squash) May 18, 2026 10:37
@PhilWindle PhilWindle merged commit 87c71ce into merge-train/spartan May 18, 2026
14 checks passed
@PhilWindle PhilWindle deleted the phil/a-538-track-performance-at-the-end-of-every-epoch branch May 18, 2026 10:57
spalladino pushed a commit that referenced this pull request May 22, 2026
…heckpoint slot (#23483)

## Summary

Root-cause fix for the `sentinel_status_slash.parallel.test.ts` flake
(added in #23286) instead of skipping it. The two malicious-proposer
tests now discover the slot at which the fault is actually recorded
rather than asserting it on the pre-warped block-proposer slot.

(This PR previously carried a `.test_patterns.yml` skip on `next`; per
maintainer request the skip is dropped and the PR now carries the fix on
`merge-train/spartan`, where the test lives.)

## Root cause

The failing merge-train run (test log
http://ci.aztec-labs.com/83589a3b1ce4a8b7) shows every honest observer
recording the target proposer as `checkpoint-missed` at the warped slot
(15), with zero `checkpoint-invalid` / `checkpoint-unvalidated` /
`checkpoint-valid` for it anywhere — the malicious node didn't even
self-record `checkpoint-valid` for that slot.

`checkpoint-missed` means *block proposals were seen for the slot, but
no checkpoint-proposal re-execution outcome was recorded for it*
(`sentinel.ts` `getSlotActivity`). The re-execution outcome that drives
`checkpoint-invalid` / `checkpoint-unvalidated` is keyed by
`proposal.slotNumber` — the slot at which a node *closes a checkpoint*
(`proposal_handler.ts:887`). A proposer self-records `checkpoint-valid`
only when it actually builds a checkpoint proposal.

`warpToSlotBeforeTargetProposer` warps to the target's *block-proposer*
slot and the test then asserts the fault at exactly that slot. But a
proposer emits a checkpoint proposal only when its slot closes a
checkpoint, and that does not reliably coincide with the warped
block-proposer slot. When the warped slot is mid-checkpoint, no outcome
is ever keyed to it → `checkpoint-missed`. Whether the proposer slot
lines up with a checkpoint close shifts with pipelining/wall-clock
timing, so it flakes (it fails on every retry, so the
`e2e-p2p-epoch-flakes` group does not absorb it).

## Fix

Stop pinning the pre-warped slot. New helper `findObservedStatusSlot`
polls an observer for the slot at which it records the expected fault
status (`checkpoint-unvalidated` / `checkpoint-invalid`); the test then
requires every honest observer to agree at that slot and the malicious
node to self-record `checkpoint-valid` for the same slot. The warp is
kept purely to cut wall-clock by advancing near the proposer region.

This stays faithful to what the test verifies: it still fails (times
out) if the malicious proposal is never detected, so it does not mask a
genuine propagation/detection bug — it only removes the brittle
assumption that the fault lands on one specific pre-chosen slot.

## Verification

- Type/usage review only: `findObservedStatusSlot` reuses the same
`getValidatorsStats()` → `history` shape and `retryUntil` pattern as the
existing helpers; `SlotNumber` / `ValidatorStatusInSlot` / `retryUntil`
are already imported.
- **Not run locally**: this is an intermittent e2e p2p flake; a single
local run would not establish flake-freedom and a full build + repeated
e2e grind is impractical to do reliably in-session. CI on this draft
exercises the test; a rerun grind is the real confirmation bar.

## Related

- Replaces the skip approach (the prior content of this PR and the
now-closed #23443).
- Adjacent to `palla/fix-sentinel-inactivity-want-to-slash` (changes the
sentinel inactivity computation), which is independent of this
test-timing fix.
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
BEGIN_COMMIT_OVERRIDE
refactor(p2p): merge FastTxCollection into TxCollection with sequential
pipeline (AztecProtocol#23245)
refactor(publisher): bundle-level simulate; drop per-action enqueue sims
(AztecProtocol#23165)
refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases
(AztecProtocol#23249)
test(e2e): fix race in 'proposer invalidates multiple checkpoints'
(AztecProtocol#23259)
fix: clean up old jobs regardless of pending status (AztecProtocol#23260)
refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273)
chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276)
feat: slash truncated checkpoint proposals (AztecProtocol#23250)
refactor: remove unused map in attestation pool (AztecProtocol#23284)
chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274)
refactor(l1-tx-utils): use DateProvider for fail-fast timeout check
(AztecProtocol#23257)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277)
test(e2e): fix race in broadcasted_invalid_block_proposal_slash under
pipelining (AztecProtocol#23302)
fix(archiver): atomic getter for L2 tips (AztecProtocol#23295)
fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under
pipelining (AztecProtocol#23296)
fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298)
test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275)
chore: update dashboard (AztecProtocol#23312)
chore: Revert "feat(sandbox): support proposer pipelining in local
network" (AztecProtocol#23313)
test: slash on bad attestation (AztecProtocol#23184)
feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116)
test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301)
test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300)
test(e2e): switch fee_settings to organic fee bumps under pipelining
(AztecProtocol#23303)
fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures
(AztecProtocol#23333)
test(e2e): wait for real oracle rotation in fee_settings inflate helper
(AztecProtocol#23334)
test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining
(AztecProtocol#23336)
fix(spartan-bench): tolerate older node images in SlasherConfig schema
(AztecProtocol#23351)
fix: interrupt prover jobs in stop (AztecProtocol#23358)
test(e2e): enable pipelining on bot, fees, and avm simulator tests
(AztecProtocol#23329)
feat(sentinel): end-of-epoch evaluation with re-execution outcomes
(AztecProtocol#23286)
feat: slash for invalid checkpoint proposals (AztecProtocol#23270)
fix: fork closure in epoch proving jobs (AztecProtocol#23390)
fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394)
fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396)
test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in
epochs tests (AztecProtocol#23424)
chore: refactor weekly proving test wait (AztecProtocol#23395)
refactor: add fifo set (AztecProtocol#23271)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327)
fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371)
chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373)
feat(sequencer): AutomineSequencer for single-sequencer e2e tests
(AztecProtocol#23354)
fix(prover): wait for previous epoch to be proven (AztecProtocol#23458)
chore: collocate provers (AztecProtocol#23439)
chore: rm staging-ignition (AztecProtocol#23440)
chore: rm unused networks (AztecProtocol#23441)
test(e2e): migrate block_building, multi_validator_node,
publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414)
fix(archiver): reconcile local blocks with L1 checkpoints by block
number (AztecProtocol#23461)
feat: Updated slash conditions on block proposals (AztecProtocol#23466)
test(e2e): migrate HA full test to pipelining (AztecProtocol#23463)
chore: update resource profiles (AztecProtocol#23442)
chore: update debug log levels (AztecProtocol#23456)
test: fix flaky sentinel_status_slash by asserting the fault on the
checkpoint slot (AztecProtocol#23483)
feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980)
(AztecProtocol#23436)
refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID ->
PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468)
fix: reject block proposals in poisoned slots (AztecProtocol#23411)
fix: retry nargo dep + solc downloads to survive transient DNS drops
(AztecProtocol#23490)
fix: enrich json-rpc tracing (AztecProtocol#23412)
feat: add trace export controls (AztecProtocol#23413)
test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496)
test: cover invalid checkpoint proposal slashing (AztecProtocol#23503)
test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482)
test: flag e2e_slashing_attested_invalid_proposal as flake under
pipelining (AztecProtocol#23501)
test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining
(AztecProtocol#23515)
test(e2e): require cross-observer agreement on sentinel fault slot
(AztecProtocol#23513)
test: flag e2e_ha_full afterAll hook timeout as flake under pipelining
(AztecProtocol#23524)
fix(e2e): propagate l1ContractsArgs into node config so archiver matches
L1 (AztecProtocol#23514)
test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as
flake (AztecProtocol#23528)
test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1
race (AztecProtocol#23533)
test(e2e_ha_full): parallel HA peer node teardown with per-node deadline
(AztecProtocol#23539)
test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541)
test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542)
test(ci): skip e2e_multi_validator_node_key_store entirely on
merge-train/spartan (AztecProtocol#23544)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants