chore: deflake duplicate proposals and attestations#20990
Conversation
spalladino
left a comment
There was a problem hiding this comment.
Have you confirmed the call to waitForP2PMeshConnectivity is needed? Both tests run with mockGossipSubNetwork, so connectivity should be guaranteed. Did you stumble upon runs where the issue was due to missed p2p messages? If so, we should review if mockGossipSubNetwork is working properly. If not, then we should remove the call to waitForP2PMeshConnectivity.
| logger.info(`Checking epoch ${currentEpoch} (slots ${startSlot}-${endSlot - 1}) for proposer ${targetProposer}`); | ||
|
|
||
| for (let s = startSlot; s < endSlot; s++) { | ||
| const proposer = await epochCache.getProposerAttesterAddressInSlot(SlotNumber(s)); |
There was a problem hiding this comment.
Nit: let's switch this in favor of getCurrentProposer (see ethereum/src/contracts/rollup.ts method getCurrentProposer) so we don't need an epoch cache as a dependency, and get the data directly from L1.
@spalladino I didn't - I actually assumed it is there for a reason (it was already there, I did not add it). I'll double check if that's needed |
|
Hah, fair! I didn't notice it was already there. I'm good to merge then, but still worth taking a look at that. |
BEGIN_COMMIT_OVERRIDE fix: track last seen nonce in case of stale fallback L1 RPC node (#20855) feat: Validate num txs in block proposals (#20850) fix(archiver): enforce checkpoint boundary on rollbackTo (#20908) fix: tps zero metrics (#20656) fix: handle scientific notation in bigintConfigHelper (#20929) feat(aztec): node enters standby mode on genesis root mismatch (#20938) fix: logging of class instances (#20807) feat(slasher): make slash grace period relative to rollup upgrade time (#20942) chore: add script to find PRs to backport (#20956) chore: remove unused prover-node dep (#20955) fix: increase minFeePadding in e2e_bot bridge resume tests and harden GasFees.mul() (#20962) feat(sequencer): (A-526) rotate publishers when send fails (#20888) chore: (A-554) bump reth version 1.6.0 -> 1.11.1 for eth devnet (#20889) chore: metric on how many epochs validator has been on committee (#20967) fix: set wallet minFeePadding in BotFactory constructor (#20992) chore: deflake epoch invalidate block test (#21001) chore(sequencer): e2e tests for invalid signature recovery in checkpoint attestations (#20971) chore: deflake duplicate proposals and attestations (#20990) chore: deflake epochs mbps test (#21003) feat: reenable function selectors in txPublicSetupAllowList (#20909) fix: limit offenses when voting in tally slashing mode by slashMaxPayloadSize (#20683) fix(spartan): wire SEQ_L1_PUBLISHING_TIME_ALLOWANCE_IN_SLOT env var (#21017) END_COMMIT_OVERRIDE
Fix flakiness in duplicate_proposal_slash and duplicate_attestation_slash e2e tests. Both tests flake at a 3-13% rate because the malicious proposer (1 of 4 validators) is never selected within the timeout window. Three root causes are addressed: 1. **GossipSub mesh checked only for TX topic** — The tests need block proposals and checkpoint proposals to propagate, but `waitForP2PMeshConnectivity` only verified the `tx` topic mesh. Added a `topics` parameter so callers can specify which topics to wait on. Both slash tests now wait on `tx`, `block_proposal`, and `checkpoint_proposal`. 2. **Proposer selection is probabilistic** — With 4 validators and 2 slots/epoch, the malicious proposer has ~25% chance per slot. Added `awaitEpochWithProposer` helper that advances epochs (via L1 time warp) until the target proposer is deterministically selected for at least one slot in the current epoch. 3. **Race between node startup and first proposal** — Nodes started sequencing immediately upon creation, potentially proposing before the P2P mesh was ready. Now all nodes are created with `dontStartSequencer: true`, and sequencers are started simultaneously only after mesh formation, committee existence, and epoch advancement are confirmed. Fixes A-593 Fixes A-595
## Summary Fixes a race condition in the `duplicate_attestation_slash` and `duplicate_proposal_slash` e2e tests that caused intermittent timeouts waiting for slashing offenses to be detected. ## Root Cause Investigated CI failure [2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit `f71c235a670d`, merge queue for PR #20555). This run **did include** the fix from #20990, yet the `duplicate_attestation_slash` test still failed with a timeout after ~478 seconds. The failure sequence from the logs: 1. `awaitEpochWithProposer` found the malicious proposer at **slot 14** (epoch 7) 2. The function returned, having warped L1 time to the start of epoch 7 3. Sequencers were then started (`await Promise.all(nodes.map(n => n.getSequencer()!.start()))`) 4. By the time sequencers were ready to build blocks, L1 time had advanced past epoch 7 into **epoch 8** (first block built at slot 16) 5. The malicious proposer was never selected in epoch 8+, so no duplicate proposals/attestations were created 6. The slasher had nothing to detect, and the test timed out The core issue: `awaitEpochWithProposer` warped to the target epoch and returned, but starting sequencers takes real time. With only 2 slots per epoch (48s total), the epoch passed before sequencers could act. ## Fix Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and changed the approach to a two-phase pattern: 1. **Find phase**: The function now checks the **next** epoch's slots (N+1) instead of the current epoch's (N). When the target proposer is found, it returns `{ targetEpoch }` while staying at epoch N -- one full epoch before the target. 2. **Start + warp phase** (in the test): After the function returns, sequencers are started while still one epoch before the target. Only then does the test warp to the target epoch via `advanceToEpoch(targetEpoch)`. This eliminates the race because sequencers are already running when the target epoch begins. The function can safely query future epoch slots because `epochCache.getProposerAttesterAddressInSlot` works for any slot within the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and we only look 1 epoch ahead. ## Changes - **`shared.ts`**: Renamed `awaitEpochWithProposer` -> `advanceToEpochBeforeProposer`. Now checks next epoch's slots and returns `{ targetEpoch: EpochNumber }` instead of `void`. - **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers before warping to target epoch. - **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since the two malicious nodes sharing the same key each self-attest to their own (different) checkpoint proposals, causing honest nodes to also detect a `DUPLICATE_ATTESTATION` offense. Fixes A-632
Fix flakiness in duplicate_proposal_slash and duplicate_attestation_slash e2e tests. Both tests flake at a 3-13% rate because the malicious proposer (1 of 4 validators) is never selected within the timeout window. Three root causes are addressed: 1. **GossipSub mesh checked only for TX topic** — The tests need block proposals and checkpoint proposals to propagate, but `waitForP2PMeshConnectivity` only verified the `tx` topic mesh. Added a `topics` parameter so callers can specify which topics to wait on. Both slash tests now wait on `tx`, `block_proposal`, and `checkpoint_proposal`. 2. **Proposer selection is probabilistic** — With 4 validators and 2 slots/epoch, the malicious proposer has ~25% chance per slot. Added `awaitEpochWithProposer` helper that advances epochs (via L1 time warp) until the target proposer is deterministically selected for at least one slot in the current epoch. 3. **Race between node startup and first proposal** — Nodes started sequencing immediately upon creation, potentially proposing before the P2P mesh was ready. Now all nodes are created with `dontStartSequencer: true`, and sequencers are started simultaneously only after mesh formation, committee existence, and epoch advancement are confirmed. Fixes A-593 Fixes A-595
## Summary Fixes a race condition in the `duplicate_attestation_slash` and `duplicate_proposal_slash` e2e tests that caused intermittent timeouts waiting for slashing offenses to be detected. ## Root Cause Investigated CI failure [2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit `f71c235a670d`, merge queue for PR #20555). This run **did include** the fix from #20990, yet the `duplicate_attestation_slash` test still failed with a timeout after ~478 seconds. The failure sequence from the logs: 1. `awaitEpochWithProposer` found the malicious proposer at **slot 14** (epoch 7) 2. The function returned, having warped L1 time to the start of epoch 7 3. Sequencers were then started (`await Promise.all(nodes.map(n => n.getSequencer()!.start()))`) 4. By the time sequencers were ready to build blocks, L1 time had advanced past epoch 7 into **epoch 8** (first block built at slot 16) 5. The malicious proposer was never selected in epoch 8+, so no duplicate proposals/attestations were created 6. The slasher had nothing to detect, and the test timed out The core issue: `awaitEpochWithProposer` warped to the target epoch and returned, but starting sequencers takes real time. With only 2 slots per epoch (48s total), the epoch passed before sequencers could act. ## Fix Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and changed the approach to a two-phase pattern: 1. **Find phase**: The function now checks the **next** epoch's slots (N+1) instead of the current epoch's (N). When the target proposer is found, it returns `{ targetEpoch }` while staying at epoch N -- one full epoch before the target. 2. **Start + warp phase** (in the test): After the function returns, sequencers are started while still one epoch before the target. Only then does the test warp to the target epoch via `advanceToEpoch(targetEpoch)`. This eliminates the race because sequencers are already running when the target epoch begins. The function can safely query future epoch slots because `epochCache.getProposerAttesterAddressInSlot` works for any slot within the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and we only look 1 epoch ahead. ## Changes - **`shared.ts`**: Renamed `awaitEpochWithProposer` -> `advanceToEpochBeforeProposer`. Now checks next epoch's slots and returns `{ targetEpoch: EpochNumber }` instead of `void`. - **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers before warping to target epoch. - **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since the two malicious nodes sharing the same key each self-attest to their own (different) checkpoint proposals, causing honest nodes to also detect a `DUPLICATE_ATTESTATION` offense. Fixes A-632
Fix flakiness in duplicate_proposal_slash and duplicate_attestation_slash e2e tests. Both tests flake at a 3-13% rate because the malicious proposer (1 of 4 validators) is never selected within the timeout window. Three root causes are addressed: 1. **GossipSub mesh checked only for TX topic** — The tests need block proposals and checkpoint proposals to propagate, but `waitForP2PMeshConnectivity` only verified the `tx` topic mesh. Added a `topics` parameter so callers can specify which topics to wait on. Both slash tests now wait on `tx`, `block_proposal`, and `checkpoint_proposal`. 2. **Proposer selection is probabilistic** — With 4 validators and 2 slots/epoch, the malicious proposer has ~25% chance per slot. Added `awaitEpochWithProposer` helper that advances epochs (via L1 time warp) until the target proposer is deterministically selected for at least one slot in the current epoch. 3. **Race between node startup and first proposal** — Nodes started sequencing immediately upon creation, potentially proposing before the P2P mesh was ready. Now all nodes are created with `dontStartSequencer: true`, and sequencers are started simultaneously only after mesh formation, committee existence, and epoch advancement are confirmed. Fixes A-593 Fixes A-595
## Summary Ports the P2P mesh connectivity fix from `next` to `v4-next` for the `duplicate_attestation_slash` and `duplicate_proposal_slash` e2e tests. Cherry-picked commit: `8680abcca7` — chore: deflake duplicate proposals and attestations (#20990) ## Root Cause `waitForP2PMeshConnectivity` only waited for the `tx` GossipSub topic mesh to form. The slash tests also need `block_proposal` and `checkpoint_proposal` meshes ready before sequencers start proposing, otherwise proposals get dropped and offenses are never detected. ## Fix - Added `topics` parameter to `waitForP2PMeshConnectivity` (defaults to `[TopicType.tx]` for backward compat) - Slash tests now wait for all 3 relevant topics before proceeding - Also added `advanceToEpochBeforeProposer` helper so sequencers start before the target epoch arrives Co-authored-by: Michal Rzeszutko <michal.rzeszutko@gmail.com>
Fix flakiness in duplicate_proposal_slash and duplicate_attestation_slash e2e tests.
Both tests flake at a 3-13% rate because the malicious proposer (1 of 4 validators) is never selected within the timeout window. Three root causes are addressed:
GossipSub mesh checked only for TX topic — The tests need block proposals and checkpoint proposals to propagate, but
waitForP2PMeshConnectivityonly verified thetxtopic mesh. Added atopicsparameter so callers can specify which topics to wait on. Both slash tests now wait ontx,block_proposal, andcheckpoint_proposal.Proposer selection is probabilistic — With 4 validators and 2 slots/epoch, the malicious proposer has ~25% chance per slot. Added
awaitEpochWithProposerhelper that advances epochs (via L1 time warp) until the target proposer is deterministically selected for at least one slot in the current epoch.Race between node startup and first proposal — Nodes started sequencing immediately upon creation, potentially proposing before the P2P mesh was ready. Now all nodes are created with
dontStartSequencer: true, and sequencers are started simultaneously only after mesh formation, committee existence, and epoch advancement are confirmed.Fixes A-593
Fixes A-595