feat(sentinel): end-of-epoch evaluation with re-execution outcomes#23286
Conversation
425d9bf to
cd34200
Compare
f2ae630 to
eb8a096
Compare
eb8a096 to
4acc6d8
Compare
spalladino
left a comment
There was a problem hiding this comment.
Looks great! One thing that Claude found: since proposals-from-self skip the proposal handler, a node will always report itself as inactive in its sentinel. It's no big deal since nodes are smart enough not to slash themselves, but it can be annoying for a node that relies on its own sentinel for tracking its activity.
| ); | ||
| const result: CheckpointProposalValidationResult = { isValid: false, reason: 'invalid_fee_asset_price_modifier' }; | ||
| this.lastCheckpointValidationResult = { payloadHash, result }; | ||
| this.reexecutionTracker.recordOutcome(slot, proposal.archive, 'invalid'); |
There was a problem hiding this comment.
Refactoring nit: I'd set up a mapping from validation result to outcome, and just record this on the caller.
| it('slashes the proposer with INACTIVITY when checkpoint validation records unvalidated', async () => { | ||
| // One malicious node broadcasts invalid block proposals; honest observers reject them via | ||
| // re-execution state_mismatch and therefore never push to their archivers, so the malicious | ||
| // node's checkpoint proposals can't find their last block and observers record `unvalidated`. | ||
| const targetAddress = await spawnMaliciousAndHonestNodes({ broadcastInvalidBlockProposal: true }); | ||
| const targetSlot = await warpToSlotBeforeTargetProposer(targetAddress); | ||
| // nodes[0] is the malicious node; honest observers are nodes[1..]. | ||
| const honestObservers = nodes.slice(1); | ||
| await assertAllObserversSentinelStatus(honestObservers, targetAddress, targetSlot, 'checkpoint-unvalidated'); | ||
| await assertInactivityOffenseFor(targetAddress, nodes[1]); | ||
| }); |
There was a problem hiding this comment.
All e2e tests should read like this
4af2626 to
db4ec58
Compare
This should be addressed now. Proposers record their own proposals as valid. |
Sentinel now evaluates per-epoch validator performance once sentinelEpochEndBufferSlots (default 2) has elapsed past an epoch's last slot, instead of waiting for the chain-proven event. Inactivity slashing fires without depending on L1 proof publication (AZIP-7). Proposer status follows a six-case taxonomy driven by the checkpoint re-execution tracker: checkpoint-mined, checkpoint-valid, checkpoint-invalid, checkpoint-unvalidated, checkpoint-missed, blocks-missed. Missing-attestor faults apply only when status is checkpoint-mined or checkpoint-valid. CheckpointReexecutionTracker overhaul: - Drops the redundant interface; single class exported under its plain name. Maps are brand-typed so callers don't lose brand info on lookup. - Records a three-way ReexecutionOutcome (valid / invalid / unvalidated) keyed by slot; hasReexecuted keeps its prior contract. - New recordTxsCollected(slot, indexWithinCheckpoint, collected) / getTxsCollectedRecord(slot, idx). The validator's proposal handler calls this immediately after getTxsForBlockProposal; the data-withholding watcher short-circuits when every block in a published checkpoint has a true record. A false (deadline missed) does not slash by itself because the DW tolerance window is longer than the re-execution deadline. ProposalHandler.validateCheckpointProposal records an outcome at every early-return; invalid_signature and checkpoint_already_published deliberately skip. New broadcastInvalidCheckpointProposalOnly config flag on SequencerConfig: lets tests corrupt the checkpoint archive while keeping the underlying block proposals valid — exercises the 'invalid' path cleanly. Unset preserves existing behaviour (follows broadcastInvalidBlockProposal). Breaking changes: - SentinelStore.SCHEMA_VERSION bumped 3 -> 4; legacy provenMap data is wiped on upgrade. - SingleValidatorStats.allTimeProvenPerformance renamed to allTimeEpochPerformance. - provenMap / getProvenPerformance / updateProvenPerformance renamed to epochMap / getEpochPerformance / updateEpochPerformance. - Env var SENTINEL_HISTORIC_PROVEN_PERFORMANCE_LENGTH_IN_EPOCHS renamed to SENTINEL_HISTORIC_EPOCH_PERFORMANCE_LENGTH_IN_EPOCHS. - InMemoryCheckpointReexecutionTracker renamed to CheckpointReexecutionTracker (interface dropped).
CI's container-name derivation rejects backticks (allowed pattern is [a-zA-Z0-9][a-zA-Z0-9_.-]+) so the test runner failed at docker run before either test could execute. Drop the inline-code formatting from the it() titles.
8bb558d to
eab8434
Compare
…heckpoint slot (#23483) ## Summary Root-cause fix for the `sentinel_status_slash.parallel.test.ts` flake (added in #23286) instead of skipping it. The two malicious-proposer tests now discover the slot at which the fault is actually recorded rather than asserting it on the pre-warped block-proposer slot. (This PR previously carried a `.test_patterns.yml` skip on `next`; per maintainer request the skip is dropped and the PR now carries the fix on `merge-train/spartan`, where the test lives.) ## Root cause The failing merge-train run (test log http://ci.aztec-labs.com/83589a3b1ce4a8b7) shows every honest observer recording the target proposer as `checkpoint-missed` at the warped slot (15), with zero `checkpoint-invalid` / `checkpoint-unvalidated` / `checkpoint-valid` for it anywhere — the malicious node didn't even self-record `checkpoint-valid` for that slot. `checkpoint-missed` means *block proposals were seen for the slot, but no checkpoint-proposal re-execution outcome was recorded for it* (`sentinel.ts` `getSlotActivity`). The re-execution outcome that drives `checkpoint-invalid` / `checkpoint-unvalidated` is keyed by `proposal.slotNumber` — the slot at which a node *closes a checkpoint* (`proposal_handler.ts:887`). A proposer self-records `checkpoint-valid` only when it actually builds a checkpoint proposal. `warpToSlotBeforeTargetProposer` warps to the target's *block-proposer* slot and the test then asserts the fault at exactly that slot. But a proposer emits a checkpoint proposal only when its slot closes a checkpoint, and that does not reliably coincide with the warped block-proposer slot. When the warped slot is mid-checkpoint, no outcome is ever keyed to it → `checkpoint-missed`. Whether the proposer slot lines up with a checkpoint close shifts with pipelining/wall-clock timing, so it flakes (it fails on every retry, so the `e2e-p2p-epoch-flakes` group does not absorb it). ## Fix Stop pinning the pre-warped slot. New helper `findObservedStatusSlot` polls an observer for the slot at which it records the expected fault status (`checkpoint-unvalidated` / `checkpoint-invalid`); the test then requires every honest observer to agree at that slot and the malicious node to self-record `checkpoint-valid` for the same slot. The warp is kept purely to cut wall-clock by advancing near the proposer region. This stays faithful to what the test verifies: it still fails (times out) if the malicious proposal is never detected, so it does not mask a genuine propagation/detection bug — it only removes the brittle assumption that the fault lands on one specific pre-chosen slot. ## Verification - Type/usage review only: `findObservedStatusSlot` reuses the same `getValidatorsStats()` → `history` shape and `retryUntil` pattern as the existing helpers; `SlotNumber` / `ValidatorStatusInSlot` / `retryUntil` are already imported. - **Not run locally**: this is an intermittent e2e p2p flake; a single local run would not establish flake-freedom and a full build + repeated e2e grind is impractical to do reliably in-session. CI on this draft exercises the test; a rerun grind is the real confirmation bar. ## Related - Replaces the skip approach (the prior content of this PR and the now-closed #23443). - Adjacent to `palla/fix-sentinel-inactivity-want-to-slash` (changes the sentinel inactivity computation), which is independent of this test-timing fix.
BEGIN_COMMIT_OVERRIDE refactor(p2p): merge FastTxCollection into TxCollection with sequential pipeline (AztecProtocol#23245) refactor(publisher): bundle-level simulate; drop per-action enqueue sims (AztecProtocol#23165) refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases (AztecProtocol#23249) test(e2e): fix race in 'proposer invalidates multiple checkpoints' (AztecProtocol#23259) fix: clean up old jobs regardless of pending status (AztecProtocol#23260) refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273) chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276) feat: slash truncated checkpoint proposals (AztecProtocol#23250) refactor: remove unused map in attestation pool (AztecProtocol#23284) chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274) refactor(l1-tx-utils): use DateProvider for fail-fast timeout check (AztecProtocol#23257) feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277) test(e2e): fix race in broadcasted_invalid_block_proposal_slash under pipelining (AztecProtocol#23302) fix(archiver): atomic getter for L2 tips (AztecProtocol#23295) fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under pipelining (AztecProtocol#23296) fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298) test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275) chore: update dashboard (AztecProtocol#23312) chore: Revert "feat(sandbox): support proposer pipelining in local network" (AztecProtocol#23313) test: slash on bad attestation (AztecProtocol#23184) feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116) test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301) test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300) test(e2e): switch fee_settings to organic fee bumps under pipelining (AztecProtocol#23303) fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures (AztecProtocol#23333) test(e2e): wait for real oracle rotation in fee_settings inflate helper (AztecProtocol#23334) test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining (AztecProtocol#23336) fix(spartan-bench): tolerate older node images in SlasherConfig schema (AztecProtocol#23351) fix: interrupt prover jobs in stop (AztecProtocol#23358) test(e2e): enable pipelining on bot, fees, and avm simulator tests (AztecProtocol#23329) feat(sentinel): end-of-epoch evaluation with re-execution outcomes (AztecProtocol#23286) feat: slash for invalid checkpoint proposals (AztecProtocol#23270) fix: fork closure in epoch proving jobs (AztecProtocol#23390) fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394) fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396) test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in epochs tests (AztecProtocol#23424) chore: refactor weekly proving test wait (AztecProtocol#23395) refactor: add fifo set (AztecProtocol#23271) feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327) fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371) chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373) feat(sequencer): AutomineSequencer for single-sequencer e2e tests (AztecProtocol#23354) fix(prover): wait for previous epoch to be proven (AztecProtocol#23458) chore: collocate provers (AztecProtocol#23439) chore: rm staging-ignition (AztecProtocol#23440) chore: rm unused networks (AztecProtocol#23441) test(e2e): migrate block_building, multi_validator_node, publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414) fix(archiver): reconcile local blocks with L1 checkpoints by block number (AztecProtocol#23461) feat: Updated slash conditions on block proposals (AztecProtocol#23466) test(e2e): migrate HA full test to pipelining (AztecProtocol#23463) chore: update resource profiles (AztecProtocol#23442) chore: update debug log levels (AztecProtocol#23456) test: fix flaky sentinel_status_slash by asserting the fault on the checkpoint slot (AztecProtocol#23483) feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980) (AztecProtocol#23436) refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID -> PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468) fix: reject block proposals in poisoned slots (AztecProtocol#23411) fix: retry nargo dep + solc downloads to survive transient DNS drops (AztecProtocol#23490) fix: enrich json-rpc tracing (AztecProtocol#23412) feat: add trace export controls (AztecProtocol#23413) test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496) test: cover invalid checkpoint proposal slashing (AztecProtocol#23503) test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482) test: flag e2e_slashing_attested_invalid_proposal as flake under pipelining (AztecProtocol#23501) test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining (AztecProtocol#23515) test(e2e): require cross-observer agreement on sentinel fault slot (AztecProtocol#23513) test: flag e2e_ha_full afterAll hook timeout as flake under pipelining (AztecProtocol#23524) fix(e2e): propagate l1ContractsArgs into node config so archiver matches L1 (AztecProtocol#23514) test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as flake (AztecProtocol#23528) test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1 race (AztecProtocol#23533) test(e2e_ha_full): parallel HA peer node teardown with per-node deadline (AztecProtocol#23539) test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541) test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542) test(ci): skip e2e_multi_validator_node_key_store entirely on merge-train/spartan (AztecProtocol#23544) END_COMMIT_OVERRIDE
Summary
sentinelEpochEndBufferSlots(default 2) has elapsed past an epoch's last slot, instead of waiting for thechain-provenevent. Inactivity slashing fires without depending on L1 proof publication (AZIP-7).checkpoint-mined,checkpoint-valid,checkpoint-invalid,checkpoint-unvalidated,checkpoint-missed,blocks-missed. Missing-attestor faults apply only when status ischeckpoint-minedorcheckpoint-valid.CheckpointReexecutionTracker):Map<CheckpointNumber, …>,Map<SlotNumber, Entry>) so callers don't lose brand info on lookup.ReexecutionOutcome(valid/invalid/unvalidated) keyed by slot;hasReexecutedkeeps its prior contract.recordTxsCollected(slot, indexWithinCheckpoint, collected)/getTxsCollectedRecord(slot, idx). The validator's proposal handler calls this immediately aftergetTxsForBlockProposal; the data-withholding watcher short-circuits when every block in a published checkpoint has atruerecord. Afalse(deadline missed) does not slash by itself because the data-withholding tolerance window is longer than the re-execution deadline.ProposalHandler.validateCheckpointProposalrecords an outcome at every early-return;invalid_signatureandcheckpoint_already_publisheddeliberately skip.broadcastInvalidCheckpointProposalOnlyconfig flag (SequencerConfig). Lets tests corrupt the checkpoint archive while keeping the underlying block proposals valid — exercises the'invalid'path cleanly. When unset, behaviour followsbroadcastInvalidBlockProposal(existing behaviour).Scope
Closes A-538. The tracker enhancement also lays groundwork for A-537's
PROPOSER_INVALID_PROPOSALoffence —invalidoutcomes are recorded but not yet consumed by a watcher; that watcher is a follow-on owned separately.Breaking changes
SentinelStore.SCHEMA_VERSIONbumped 3 → 4; legacyprovenMapdata is wiped on upgrade.SingleValidatorStats.allTimeProvenPerformancerenamed toallTimeEpochPerformance.provenMap/getProvenPerformance/updateProvenPerformancerenamed toepochMap/getEpochPerformance/updateEpochPerformance.SENTINEL_HISTORIC_PROVEN_PERFORMANCE_LENGTH_IN_EPOCHSrenamed toSENTINEL_HISTORIC_EPOCH_PERFORMANCE_LENGTH_IN_EPOCHS.InMemoryCheckpointReexecutionTrackerrenamed toCheckpointReexecutionTracker(interface dropped).Base branch
Based on
phil/a-523-...because it shares theCheckpointReexecutionTrackerplumbing. Will rebase ontomerge-train/spartanonce A-523 merges.Test plan
recordTxsCollected/getTxsCollectedRecord(stdlib/src/checkpoint/checkpoint_reexecution_tracker.test.ts).aztec-node/src/sentinel/sentinel.test.ts— 46 tests passing).aztec-node/src/sentinel/store.test.ts).proposal_handler.test.tsanddata_withholding_watcher.test.tscontinue to pass.multiple_validators_sentinel.parallel.test.tsupdated for new status names; new test asserts per-epoch performance is recorded shortly after epoch end without waiting for L1 proof.sentinel_status_slash.parallel.test.tsdrives all three slashable sentinel statuses end-to-end via config flags (no jest stubbing):'unvalidated'viabroadcastInvalidBlockProposal'invalid'viabroadcastInvalidCheckpointProposalOnlytryStopon one nodeEach malicious-proposer test warps the L1 clock to the slot just before the target's proposer slot (accounting for pipelining) to keep wall-clock time down. All three pass locally in ~4 min total.
Docs
yarn-project/aztec-node/src/sentinel/README.mddocumenting the module.yarn-project/slasher/README.mdInactivity section updated to match the new behaviour.