Skip to content

feat(slasher): per-slot data-withholding watcher (A-523, A-525)#23116

Merged
PhilWindle merged 7 commits into
merge-train/spartanfrom
phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2
May 15, 2026
Merged

feat(slasher): per-slot data-withholding watcher (A-523, A-525)#23116
PhilWindle merged 7 commits into
merge-train/spartanfrom
phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2

Conversation

@PhilWindle

Copy link
Copy Markdown
Collaborator

Summary

Moves the data-withholding slash from the L1-prune path to a per-slot check at slotStart(checkpoint.slot + slashDataWithholdingToleranceSlots), and removes the now-unnecessary VALID_EPOCH_PRUNED offense and EpochPruneWatcher.

Per AZIP-7: validators are responsible for making tx data available, not for ensuring proofs land. The new DataWithholdingWatcher ticks at quarter-eth-slot cadence and, for each published checkpoint older than dataWithholdingToleranceSlots (default 3), probes the local mempool for the txs in the checkpoint's blocks. Missing txs trigger a DATA_WITHHOLDING slash for the validators who actually attested to that checkpoint.

Highlights

  • New DataWithholdingWatcher (yarn-project/slasher/src/watchers/data_withholding_watcher.ts) with full unit-test coverage. Sentinel-style tick + restart floor (no KV).
  • Slot-keyed DATA_WITHHOLDING — moved from 'epoch' to 'slot' in getTimeUnitForOffense. Offense identity is now per-checkpoint, not per-epoch.
  • Single source of truth for tolerance. P2PClient.collectingMissingTxs anchors its tx-collection deadline to slotStart(block.slot + slashDataWithholdingToleranceSlots) so the collection effort runs to exactly the wall-clock instant the watcher renders its verdict. The ad-hoc p2pMissingTxCollectionDeadlineMs is removed.
  • A-525 deletions bundled in: OffenseType.VALID_EPOCH_PRUNED, slashPrunePenalty config + env var + spartan plumbing, EpochPruneWatcher class + tests, valid_epoch_pruned_slash.test.ts.
  • e2e test rewritten in data_withholding_slash.test.ts: 4 validators, slashSelfAllowed, tx is mined normally then stubbed missing on every node, watcher fires, slash executes on-chain, committee is kicked. Asserts slot-keyed offense identity + on-chain effect.

Test plan

  • yarn workspace @aztec/slasher test — 76 tests pass (incl. 9 new DataWithholdingWatcher tests).
  • yarn workspace @aztec/stdlib test src/slashing — 55 tests pass after the keying flip.
  • yarn workspace @aztec/p2p test src/client/p2p_client.test.ts — 20 tests pass with the new slot-anchored deadline.
  • yarn build clean across the monorepo.
  • The e2e (e2e_p2p_data_withholding_slash) is in place but only runs in CI.

Out of scope

  • L1 contract changes — none needed (offense type code is purely off-chain).
  • The L2PruneUnproven event/emitter is left in place; nothing subscribes to it after this PR but the event itself stays available for future observers.
  • Re-execution of checkpoints (covered by sibling A-1022).

Closes A-523.
Closes A-525.

@PhilWindle PhilWindle added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 8, 2026
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch 2 times, most recently from 78d0044 to ec8293d Compare May 12, 2026 19:34
@PhilWindle PhilWindle marked this pull request as ready for review May 12, 2026 19:37
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from ec8293d to 51fe9e3 Compare May 13, 2026 08:53

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a few comments. Aside from them, something I'm worried about are race conditions that cause the TxPool to evict txs before the data withholding watcher has had the time to check availability. For instance: let's say an epoch gets proven really fast, so its txs get evicted from the mempool (since they are no longer needed) faster than the data withholding tolerance. When the watcher kicks in, the txs will have been removed from the tx pool, and the slash will kick on. Or even worse: in the event of a prune due to missing proof, the txs for all blocks in the epoch become unprotected (or removed if they're invalid, as they may have been anchored to pruned blocks), and the data withholding check will kick in a few slots after the prune and want to slash all attestors of the last few checkpoints in the epoch. I'm not sure what's the best way to fix this though, and apologies for not catching this earlier.

Comment on lines +617 to +620
// Staleness: the proposer can only publish this checkpoint within its wallclock slot,
// which is `slot - pipeliningOffset`.
const currentSlot = this.epochCache.getSlotNow();
const lastAttestableWallclockSlot = slotNumber - this.epochCache.pipeliningOffset();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same helpers as in the attestation_validator here, which I think are more lenient.

More in general, when in doubt, we should probably broadcast our attestation anyway. Better be penalized on p2p than slashed due to inactivity.

// before we declare its data missing. For checkpoint slot S, we therefore process S
// only once we are in slot `S + tolerance + 1` or later.
const tolerance = this.config.slashDataWithholdingToleranceSlots;
const currentSlot = this.epochCache.getSlotNow();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change to L2BlockSource.getSyncedL2SlotNumber in case the archiver is lagging behind for some reason

Comment on lines +140 to +143
this.log.warn(`Detected ${missingTxs.length} missing txs at slot ${slot} but no recoverable attesters`, {
slot,
missingTxs: missingTxs.map(h => h.toString()),
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, the publisher itself should at least be an attester. But fine to leave this in just as a guard.

Comment on lines +222 to +223
// 4. Wait until the proposer is the designated current proposer.
t.logger.warn(`Waiting for ${proposerAddress} to be the current proposer`);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With pipelining you should wait until proposerAddress is to be the proposer in the next slot according to the rollup contract. I'm surprised the tests passes anyway.

Comment on lines +266 to +267
// 7. Wait for the slash to execute on L1. Advance to the slashing tally epoch so B/C/D
// can start voting on the offense round.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue we should start dropping this part from all e2e slashing tests, so they run faster, and keep it in only one of them.

Comment on lines +190 to +198
const proposerP2pService: any = (proposerNode as any).p2pClient.p2pService;
const originalPropagate = proposerP2pService.propagate.bind(proposerP2pService);
jest.spyOn(proposerP2pService, 'propagate').mockImplementation(((msg: any) => {
if (msg instanceof Tx) {
t.logger.info(`Suppressing outbound tx gossip from proposer ${proposerAddress}`);
return Promise.resolve();
}
return originalPropagate(msg);
}) as any);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer adding test-only config flags to control this behaviour, so e2e tests don't need to mess with internals. But it's a personal preference.

@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch 3 times, most recently from e48f310 to cdf0290 Compare May 13, 2026 16:40
A-523: Replace the L1-prune-time data-withholding check with a per-slot
watcher that runs once `slashDataWithholdingToleranceSlots` full slots
have elapsed after a published checkpoint, probes the local mempool via
ITxProvider.hasTxs, and emits a slot-keyed DATA_WITHHOLDING offense for
every attester (union of L1-published + p2p-pool) that signed the
checkpoint when any of its txs are still missing. Restart-floor at boot
slot, ticks at quarter-eth-slot cadence.

A-525: Remove the legacy VALID_EPOCH_PRUNED slashing path:
EpochPruneWatcher, slashPrunePenalty, and OffenseType.VALID_EPOCH_PRUNED.

Protocol fixes that close the data-withholding loophole:
- validateCheckpointProposal: refuse to attest if the block's enclosing
  checkpoint is already on L1 — we cannot legitimately attest based on
  L1-synced data we did not witness on p2p.
- shouldAttestToSlot: refuse a proposal whose proposer's wallclock window
  has elapsed (`currentSlot > slot - pipeliningOffset`).
- block-sync timeout in validateCheckpointProposal: use
  `getReexecutionDeadline` so the deadline is the end of the proposal's
  own slot when pipelining is off, and the start of the target slot when
  it is on.

P2P:
- ITxProvider.hasTxs(txHashes) added for the watcher's mempool probe.
- Optional p2pMissingTxCollectionDeadlineSlots clamped up to the
  tolerance window, so collection never gives up before the slash verdict.

E2E:
- Rewrite data_withholding_slash to model the realistic attack: stub
  proposer outbound tx gossip + two blind attesters + one honest member
  who naturally refuses via the new gates. Retry epoch advance until a
  full committee is found.
- Patch duplicate_attestation_slash: stub `notifyOwnCheckpointProposal`
  on the malicious nodes to break the self-loopback into local validation
  (with skipPushProposedBlocksToArchiver the proposer's block never lands
  in its archiver, so the loopback hung the proposer past the staleness
  gate and suppressed all duplicate attestations).
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from cdf0290 to 7c1cfff Compare May 13, 2026 17:29
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from 7f1b01c to 425d9bf Compare May 14, 2026 17:37
…-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2
@PhilWindle PhilWindle force-pushed the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch from 425d9bf to cd34200 Compare May 14, 2026 17:39
}

public start(): Promise<void> {
this.initialSlot = this.epochCache.getSlotNow();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be l2BlockSource.getSyncedL2SlotNumber() here as well, sorry I missed it on the first pass

* driven externally via `removeBefore` (typically by the proposal handler, once a
* checkpoint reaches L1 finality).
*/
export class InMemoryCheckpointReexecutionTracker implements CheckpointReexecutionTracker {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I've started removing intermediate interfaces when we have a single implementation, since it makes code navigation more annoying. We should agree on whether it's a good policy or not.

Comment on lines +975 to +976
// We successfully re-executed every block in this checkpoint locally, record for any observers
this.reexecutionTracker.recordReexecuted(checkpointNumber, proposal.archive);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a recordTxsCollected before we reexecute, in case we do manage to collect the txs but fail to reexecute within the deadline?

Mirror the same archiver-synced-slot logic used in work(): use
l2BlockSource.getSyncedL2SlotNumber() rather than the wallclock to
floor processing on start. Avoids treating slots as already-started
before the archiver has ingested them. Per PR review.
@PhilWindle PhilWindle enabled auto-merge (squash) May 15, 2026 10:38
…kpoint_proposal_slash test

slashPrunePenalty was removed from SetupOptions as part of A-523/A-525.
This e2e test still referenced it, breaking the TS build.
@PhilWindle PhilWindle merged commit 58acdae into merge-train/spartan May 15, 2026
14 checks passed
@PhilWindle PhilWindle deleted the phil/a-523-move-data-withholding-check-to-end-of-slot-instead-of-at-l2 branch May 15, 2026 15:56
spalladino pushed a commit that referenced this pull request May 18, 2026
#23351)

## Summary

Fixes `block_capacity.test.ts` (nightly spartan bench) failing against a
stale nightly Aztec image.

The nightly-spartan-bench workflow checks out test code from
`merge-train/spartan` but runs against the prebuilt nightly Aztec docker
image, which is built from `next`. `merge-train/spartan` introduced two
new required fields on `SlasherConfigSchema`:

- `slashDataWithholdingToleranceSlots: z.number()` (PR #23116)
- `slashBroadcastedInvalidCheckpointProposalPenalty: schemas.BigInt`

The nightly image (built from `next`, which does not yet have these PRs)
does not include these fields in its config response. As a result the
very first call in the test — `updateSequencersConfig(config, {
minTxsPerBlock: 0 })` → `client.getConfig()` — fails Zod validation,
every subbench dies on `accountAddresses[0] === undefined` in
`beforeAll`, and `block_capacity.test.ts` reports 8/8 failures (see
<http://ci.aztec-labs.com/2679b1738ea39752>).

## Fix

Make the two new fields tolerate undefined in the response schema by
giving them sensible defaults (`3` and `0n`, matching
`spartan/environments/network-defaults.yml` and the slasher defaults).
The TypeScript `SlasherConfig` interface stays strict (output type is
unchanged); only the parse side becomes forgiving, matching the pattern
used by the other recently-added schema fields in `configs.ts`,
`p2p.ts`, and `validator.ts`.

`partial()` (used by `setConfig`) wraps each field with `.optional()` on
the outside, which bypasses `.default()` — so partial updates that omit
these fields will continue to omit them server-side rather than
accidentally clobbering existing values to the defaults.

## Test plan

- nightly-spartan-bench → block-capacity-benchmark passes against the
stale-image scenario
- existing `aztec-node-admin.test.ts` getConfig test continues to pass
(fields are present in the mock)


---
*Created by
[claudebox](https://claudebox.work/v2/sessions/5cbfb8fe5a58619c) ·
group: `slackbot`*
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
BEGIN_COMMIT_OVERRIDE
refactor(p2p): merge FastTxCollection into TxCollection with sequential
pipeline (AztecProtocol#23245)
refactor(publisher): bundle-level simulate; drop per-action enqueue sims
(AztecProtocol#23165)
refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases
(AztecProtocol#23249)
test(e2e): fix race in 'proposer invalidates multiple checkpoints'
(AztecProtocol#23259)
fix: clean up old jobs regardless of pending status (AztecProtocol#23260)
refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273)
chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276)
feat: slash truncated checkpoint proposals (AztecProtocol#23250)
refactor: remove unused map in attestation pool (AztecProtocol#23284)
chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274)
refactor(l1-tx-utils): use DateProvider for fail-fast timeout check
(AztecProtocol#23257)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277)
test(e2e): fix race in broadcasted_invalid_block_proposal_slash under
pipelining (AztecProtocol#23302)
fix(archiver): atomic getter for L2 tips (AztecProtocol#23295)
fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under
pipelining (AztecProtocol#23296)
fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298)
test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275)
chore: update dashboard (AztecProtocol#23312)
chore: Revert "feat(sandbox): support proposer pipelining in local
network" (AztecProtocol#23313)
test: slash on bad attestation (AztecProtocol#23184)
feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116)
test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301)
test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300)
test(e2e): switch fee_settings to organic fee bumps under pipelining
(AztecProtocol#23303)
fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures
(AztecProtocol#23333)
test(e2e): wait for real oracle rotation in fee_settings inflate helper
(AztecProtocol#23334)
test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining
(AztecProtocol#23336)
fix(spartan-bench): tolerate older node images in SlasherConfig schema
(AztecProtocol#23351)
fix: interrupt prover jobs in stop (AztecProtocol#23358)
test(e2e): enable pipelining on bot, fees, and avm simulator tests
(AztecProtocol#23329)
feat(sentinel): end-of-epoch evaluation with re-execution outcomes
(AztecProtocol#23286)
feat: slash for invalid checkpoint proposals (AztecProtocol#23270)
fix: fork closure in epoch proving jobs (AztecProtocol#23390)
fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394)
fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396)
test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in
epochs tests (AztecProtocol#23424)
chore: refactor weekly proving test wait (AztecProtocol#23395)
refactor: add fifo set (AztecProtocol#23271)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327)
fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371)
chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373)
feat(sequencer): AutomineSequencer for single-sequencer e2e tests
(AztecProtocol#23354)
fix(prover): wait for previous epoch to be proven (AztecProtocol#23458)
chore: collocate provers (AztecProtocol#23439)
chore: rm staging-ignition (AztecProtocol#23440)
chore: rm unused networks (AztecProtocol#23441)
test(e2e): migrate block_building, multi_validator_node,
publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414)
fix(archiver): reconcile local blocks with L1 checkpoints by block
number (AztecProtocol#23461)
feat: Updated slash conditions on block proposals (AztecProtocol#23466)
test(e2e): migrate HA full test to pipelining (AztecProtocol#23463)
chore: update resource profiles (AztecProtocol#23442)
chore: update debug log levels (AztecProtocol#23456)
test: fix flaky sentinel_status_slash by asserting the fault on the
checkpoint slot (AztecProtocol#23483)
feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980)
(AztecProtocol#23436)
refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID ->
PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468)
fix: reject block proposals in poisoned slots (AztecProtocol#23411)
fix: retry nargo dep + solc downloads to survive transient DNS drops
(AztecProtocol#23490)
fix: enrich json-rpc tracing (AztecProtocol#23412)
feat: add trace export controls (AztecProtocol#23413)
test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496)
test: cover invalid checkpoint proposal slashing (AztecProtocol#23503)
test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482)
test: flag e2e_slashing_attested_invalid_proposal as flake under
pipelining (AztecProtocol#23501)
test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining
(AztecProtocol#23515)
test(e2e): require cross-observer agreement on sentinel fault slot
(AztecProtocol#23513)
test: flag e2e_ha_full afterAll hook timeout as flake under pipelining
(AztecProtocol#23524)
fix(e2e): propagate l1ContractsArgs into node config so archiver matches
L1 (AztecProtocol#23514)
test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as
flake (AztecProtocol#23528)
test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1
race (AztecProtocol#23533)
test(e2e_ha_full): parallel HA peer node teardown with per-node deadline
(AztecProtocol#23539)
test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541)
test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542)
test(ci): skip e2e_multi_validator_node_key_store entirely on
merge-train/spartan (AztecProtocol#23544)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants