Skip to content

chore: deflake duplicate proposals and attestations#20990

Merged
spalladino merged 1 commit into
merge-train/spartanfrom
mr/deflake-duplicate-proposals-and-attestations
Mar 2, 2026
Merged

chore: deflake duplicate proposals and attestations#20990
spalladino merged 1 commit into
merge-train/spartanfrom
mr/deflake-duplicate-proposals-and-attestations

Conversation

@mrzeszutko

Copy link
Copy Markdown
Contributor

Fix flakiness in duplicate_proposal_slash and duplicate_attestation_slash e2e tests.

Both tests flake at a 3-13% rate because the malicious proposer (1 of 4 validators) is never selected within the timeout window. Three root causes are addressed:

  1. GossipSub mesh checked only for TX topic — The tests need block proposals and checkpoint proposals to propagate, but waitForP2PMeshConnectivity only verified the tx topic mesh. Added a topics parameter so callers can specify which topics to wait on. Both slash tests now wait on tx, block_proposal, and checkpoint_proposal.

  2. Proposer selection is probabilistic — With 4 validators and 2 slots/epoch, the malicious proposer has ~25% chance per slot. Added awaitEpochWithProposer helper that advances epochs (via L1 time warp) until the target proposer is deterministically selected for at least one slot in the current epoch.

  3. Race between node startup and first proposal — Nodes started sequencing immediately upon creation, potentially proposing before the P2P mesh was ready. Now all nodes are created with dontStartSequencer: true, and sequencers are started simultaneously only after mesh formation, committee existence, and epoch advancement are confirmed.

Fixes A-593
Fixes A-595

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you confirmed the call to waitForP2PMeshConnectivity is needed? Both tests run with mockGossipSubNetwork, so connectivity should be guaranteed. Did you stumble upon runs where the issue was due to missed p2p messages? If so, we should review if mockGossipSubNetwork is working properly. If not, then we should remove the call to waitForP2PMeshConnectivity.

logger.info(`Checking epoch ${currentEpoch} (slots ${startSlot}-${endSlot - 1}) for proposer ${targetProposer}`);

for (let s = startSlot; s < endSlot; s++) {
const proposer = await epochCache.getProposerAttesterAddressInSlot(SlotNumber(s));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's switch this in favor of getCurrentProposer (see ethereum/src/contracts/rollup.ts method getCurrentProposer) so we don't need an epoch cache as a dependency, and get the data directly from L1.

@mrzeszutko

Copy link
Copy Markdown
Contributor Author

Have you confirmed the call to waitForP2PMeshConnectivity is needed? Both tests run with mockGossipSubNetwork, so connectivity should be guaranteed. Did you stumble upon runs where the issue was due to missed p2p messages? If so, we should review if mockGossipSubNetwork is working properly. If not, then we should remove the call to waitForP2PMeshConnectivity.

@spalladino I didn't - I actually assumed it is there for a reason (it was already there, I did not add it). I'll double check if that's needed

@spalladino

Copy link
Copy Markdown
Contributor

Hah, fair! I didn't notice it was already there. I'm good to merge then, but still worth taking a look at that.

@spalladino spalladino merged commit 8680abc into merge-train/spartan Mar 2, 2026
10 checks passed
@spalladino spalladino deleted the mr/deflake-duplicate-proposals-and-attestations branch March 2, 2026 14:28
github-merge-queue Bot pushed a commit that referenced this pull request Mar 3, 2026
BEGIN_COMMIT_OVERRIDE
fix: track last seen nonce in case of stale fallback L1 RPC node
(#20855)
feat: Validate num txs in block proposals (#20850)
fix(archiver): enforce checkpoint boundary on rollbackTo (#20908)
fix: tps zero metrics (#20656)
fix: handle scientific notation in bigintConfigHelper (#20929)
feat(aztec): node enters standby mode on genesis root mismatch (#20938)
fix: logging of class instances (#20807)
feat(slasher): make slash grace period relative to rollup upgrade time
(#20942)
chore: add script to find PRs to backport (#20956)
chore: remove unused prover-node dep (#20955)
fix: increase minFeePadding in e2e_bot bridge resume tests and harden
GasFees.mul() (#20962)
feat(sequencer): (A-526) rotate publishers when send fails (#20888)
chore: (A-554) bump reth version 1.6.0 -> 1.11.1 for eth devnet (#20889)
chore: metric on how many epochs validator has been on committee
(#20967)
fix: set wallet minFeePadding in BotFactory constructor (#20992)
chore: deflake epoch invalidate block test (#21001)
chore(sequencer): e2e tests for invalid signature recovery in checkpoint
attestations (#20971)
chore: deflake duplicate proposals and attestations (#20990)
chore: deflake epochs mbps test (#21003)
feat: reenable function selectors in txPublicSetupAllowList (#20909)
fix: limit offenses when voting in tally slashing mode by
slashMaxPayloadSize (#20683)
fix(spartan): wire SEQ_L1_PUBLISHING_TIME_ALLOWANCE_IN_SLOT env var
(#21017)
END_COMMIT_OVERRIDE
johnathan79717 pushed a commit that referenced this pull request Mar 4, 2026
Fix flakiness in duplicate_proposal_slash and
duplicate_attestation_slash e2e tests.

Both tests flake at a 3-13% rate because the malicious proposer (1 of 4
validators) is never selected within the timeout window. Three root
causes are addressed:

1. **GossipSub mesh checked only for TX topic** — The tests need block
proposals and checkpoint proposals to propagate, but
`waitForP2PMeshConnectivity` only verified the `tx` topic mesh. Added a
`topics` parameter so callers can specify which topics to wait on. Both
slash tests now wait on `tx`, `block_proposal`, and
`checkpoint_proposal`.

2. **Proposer selection is probabilistic** — With 4 validators and 2
slots/epoch, the malicious proposer has ~25% chance per slot. Added
`awaitEpochWithProposer` helper that advances epochs (via L1 time warp)
until the target proposer is deterministically selected for at least one
slot in the current epoch.

3. **Race between node startup and first proposal** — Nodes started
sequencing immediately upon creation, potentially proposing before the
P2P mesh was ready. Now all nodes are created with `dontStartSequencer:
true`, and sequencers are started simultaneously only after mesh
formation, committee existence, and epoch advancement are confirmed.

Fixes A-593
Fixes A-595
PhilWindle pushed a commit that referenced this pull request Mar 10, 2026
## Summary

Fixes a race condition in the `duplicate_attestation_slash` and
`duplicate_proposal_slash` e2e tests that caused intermittent timeouts
waiting for slashing offenses to be detected.

## Root Cause

Investigated CI failure
[2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit
`f71c235a670d`, merge queue for PR #20555). This run **did include** the
fix from #20990, yet the `duplicate_attestation_slash` test still failed
with a timeout after ~478 seconds.

The failure sequence from the logs:
1. `awaitEpochWithProposer` found the malicious proposer at **slot 14**
(epoch 7)
2. The function returned, having warped L1 time to the start of epoch 7
3. Sequencers were then started (`await Promise.all(nodes.map(n =>
n.getSequencer()!.start()))`)
4. By the time sequencers were ready to build blocks, L1 time had
advanced past epoch 7 into **epoch 8** (first block built at slot 16)
5. The malicious proposer was never selected in epoch 8+, so no
duplicate proposals/attestations were created
6. The slasher had nothing to detect, and the test timed out

The core issue: `awaitEpochWithProposer` warped to the target epoch and
returned, but starting sequencers takes real time. With only 2 slots per
epoch (48s total), the epoch passed before sequencers could act.

## Fix

Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and
changed the approach to a two-phase pattern:

1. **Find phase**: The function now checks the **next** epoch's slots
(N+1) instead of the current epoch's (N). When the target proposer is
found, it returns `{ targetEpoch }` while staying at epoch N -- one full
epoch before the target.

2. **Start + warp phase** (in the test): After the function returns,
sequencers are started while still one epoch before the target. Only
then does the test warp to the target epoch via
`advanceToEpoch(targetEpoch)`.

This eliminates the race because sequencers are already running when the
target epoch begins.

The function can safely query future epoch slots because
`epochCache.getProposerAttesterAddressInSlot` works for any slot within
the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and
we only look 1 epoch ahead.

## Changes

- **`shared.ts`**: Renamed `awaitEpochWithProposer` ->
`advanceToEpochBeforeProposer`. Now checks next epoch's slots and
returns `{ targetEpoch: EpochNumber }` instead of `void`.
- **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers
before warping to target epoch.
- **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered
offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since
the two malicious nodes sharing the same key each self-attest to their
own (different) checkpoint proposals, causing honest nodes to also
detect a `DUPLICATE_ATTESTATION` offense.

Fixes A-632
AztecBot pushed a commit that referenced this pull request Mar 19, 2026
Fix flakiness in duplicate_proposal_slash and
duplicate_attestation_slash e2e tests.

Both tests flake at a 3-13% rate because the malicious proposer (1 of 4
validators) is never selected within the timeout window. Three root
causes are addressed:

1. **GossipSub mesh checked only for TX topic** — The tests need block
proposals and checkpoint proposals to propagate, but
`waitForP2PMeshConnectivity` only verified the `tx` topic mesh. Added a
`topics` parameter so callers can specify which topics to wait on. Both
slash tests now wait on `tx`, `block_proposal`, and
`checkpoint_proposal`.

2. **Proposer selection is probabilistic** — With 4 validators and 2
slots/epoch, the malicious proposer has ~25% chance per slot. Added
`awaitEpochWithProposer` helper that advances epochs (via L1 time warp)
until the target proposer is deterministically selected for at least one
slot in the current epoch.

3. **Race between node startup and first proposal** — Nodes started
sequencing immediately upon creation, potentially proposing before the
P2P mesh was ready. Now all nodes are created with `dontStartSequencer:
true`, and sequencers are started simultaneously only after mesh
formation, committee existence, and epoch advancement are confirmed.

Fixes A-593
Fixes A-595
AztecBot pushed a commit that referenced this pull request Mar 19, 2026
## Summary

Fixes a race condition in the `duplicate_attestation_slash` and
`duplicate_proposal_slash` e2e tests that caused intermittent timeouts
waiting for slashing offenses to be detected.

## Root Cause

Investigated CI failure
[2abee794200ae6a7](http://ci.aztec-labs.com/2abee794200ae6a7) (commit
`f71c235a670d`, merge queue for PR #20555). This run **did include** the
fix from #20990, yet the `duplicate_attestation_slash` test still failed
with a timeout after ~478 seconds.

The failure sequence from the logs:
1. `awaitEpochWithProposer` found the malicious proposer at **slot 14**
(epoch 7)
2. The function returned, having warped L1 time to the start of epoch 7
3. Sequencers were then started (`await Promise.all(nodes.map(n =>
n.getSequencer()!.start()))`)
4. By the time sequencers were ready to build blocks, L1 time had
advanced past epoch 7 into **epoch 8** (first block built at slot 16)
5. The malicious proposer was never selected in epoch 8+, so no
duplicate proposals/attestations were created
6. The slasher had nothing to detect, and the test timed out

The core issue: `awaitEpochWithProposer` warped to the target epoch and
returned, but starting sequencers takes real time. With only 2 slots per
epoch (48s total), the epoch passed before sequencers could act.

## Fix

Renamed `awaitEpochWithProposer` to `advanceToEpochBeforeProposer` and
changed the approach to a two-phase pattern:

1. **Find phase**: The function now checks the **next** epoch's slots
(N+1) instead of the current epoch's (N). When the target proposer is
found, it returns `{ targetEpoch }` while staying at epoch N -- one full
epoch before the target.

2. **Start + warp phase** (in the test): After the function returns,
sequencers are started while still one epoch before the target. Only
then does the test warp to the target epoch via
`advanceToEpoch(targetEpoch)`.

This eliminates the race because sequencers are already running when the
target epoch begins.

The function can safely query future epoch slots because
`epochCache.getProposerAttesterAddressInSlot` works for any slot within
the `lagInEpochsForValidatorSet` window (typically 2 epochs ahead), and
we only look 1 epoch ahead.

## Changes

- **`shared.ts`**: Renamed `awaitEpochWithProposer` ->
`advanceToEpochBeforeProposer`. Now checks next epoch's slots and
returns `{ targetEpoch: EpochNumber }` instead of `void`.
- **`duplicate_attestation_slash.test.ts`**: Updated to start sequencers
before warping to target epoch.
- **`duplicate_proposal_slash.test.ts`**: Same change. Also filtered
offense assertions to only check `DUPLICATE_PROPOSAL` offenses, since
the two malicious nodes sharing the same key each self-attest to their
own (different) checkpoint proposals, causing honest nodes to also
detect a `DUPLICATE_ATTESTATION` offense.

Fixes A-632
AztecBot pushed a commit that referenced this pull request Mar 20, 2026
Fix flakiness in duplicate_proposal_slash and
duplicate_attestation_slash e2e tests.

Both tests flake at a 3-13% rate because the malicious proposer (1 of 4
validators) is never selected within the timeout window. Three root
causes are addressed:

1. **GossipSub mesh checked only for TX topic** — The tests need block
proposals and checkpoint proposals to propagate, but
`waitForP2PMeshConnectivity` only verified the `tx` topic mesh. Added a
`topics` parameter so callers can specify which topics to wait on. Both
slash tests now wait on `tx`, `block_proposal`, and
`checkpoint_proposal`.

2. **Proposer selection is probabilistic** — With 4 validators and 2
slots/epoch, the malicious proposer has ~25% chance per slot. Added
`awaitEpochWithProposer` helper that advances epochs (via L1 time warp)
until the target proposer is deterministically selected for at least one
slot in the current epoch.

3. **Race between node startup and first proposal** — Nodes started
sequencing immediately upon creation, potentially proposing before the
P2P mesh was ready. Now all nodes are created with `dontStartSequencer:
true`, and sequencers are started simultaneously only after mesh
formation, committee existence, and epoch advancement are confirmed.

Fixes A-593
Fixes A-595
ludamad pushed a commit that referenced this pull request Mar 20, 2026
## Summary
Ports the P2P mesh connectivity fix from `next` to `v4-next` for the
`duplicate_attestation_slash` and `duplicate_proposal_slash` e2e tests.

Cherry-picked commit: `8680abcca7` — chore: deflake duplicate proposals
and attestations (#20990)

## Root Cause
`waitForP2PMeshConnectivity` only waited for the `tx` GossipSub topic
mesh to form. The slash tests also need `block_proposal` and
`checkpoint_proposal` meshes ready before sequencers start proposing,
otherwise proposals get dropped and offenses are never detected.

## Fix
- Added `topics` parameter to `waitForP2PMeshConnectivity` (defaults to
`[TopicType.tx]` for backward compat)
- Slash tests now wait for all 3 relevant topics before proceeding
- Also added `advanceToEpochBeforeProposer` helper so sequencers start
before the target epoch arrives

Co-authored-by: Michal Rzeszutko <michal.rzeszutko@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants