Skip to content

feat: merge-train/spartan#22660

Merged
AztecBot merged 3 commits into
nextfrom
merge-train/spartan
Apr 20, 2026
Merged

feat: merge-train/spartan#22660
AztecBot merged 3 commits into
nextfrom
merge-train/spartan

Conversation

@AztecBot

@AztecBot AztecBot commented Apr 20, 2026

Copy link
Copy Markdown
Collaborator

BEGIN_COMMIT_OVERRIDE
fix: increase gossipsub mcache length to prevent valid messages being dropped (#22498)
fix(p2p): bump proposal_tx_collector bench TIMEOUT 1200s -> 1800s (#22661)
END_COMMIT_OVERRIDE

… dropped (#22498)

## Summary

- Doubles the gossipsub `mcacheLength` default from 6 to 12, extending
the async validation window from ~4.2s to ~8.4s (~12% of a 72s slot)
- Adds a warning log when gossip validation time approaches 75% of the
mcache eviction window for early detection

The gossipsub message cache eviction window (`mcacheLength ×
heartbeatInterval`) was too short at ~4.2s. When `asyncValidation` is
enabled, `reportMessageValidationResult(Accept)` only forwards a message
if it's still in the mcache. Checkpoint attestation validation involves
epoch cache lookups (potentially cold L1 RPC calls) and signature
recovery, which can exceed this window under load, causing valid
messages to be silently dropped.

Fixes
[A-763](https://linear.app/aztec-labs/issue/A-763/audit-94-gossip-validation-timeout-too-short-for-block)
AztecBot and others added 2 commits April 20, 2026 12:59
…2661)

## Summary

The `p2p_client.proposal_tx_collector.bench.test.ts` suite flakes when
the container-level `timeout -v 1200` (20 min) fires mid-suite. Bumping
the bash `TIMEOUT` to 1800s (30 min) gives the benchmark ~12 min of
headroom over the typical 15-18 min happy-path runtime.

Example flake: http://ci.aztec-labs.com/e7d212f49389d8cf — suite
progressed through case 17 of 24 before being TERM'd at exactly 1201s.

## Root cause

24 cases (3 distributions x 4 missingTxCounts x 2 collectors) run under
`jest --runInBand`. Per-case parent budget is ~120s (30s collector + 60s
peer wait + 30s buffer). Stress cases like `sparse/missing=500` reliably
hit the 30s collector timeout, which degrades peer connectivity for
subsequent cases (see cascade in the log:
`sparse/500/send-batch-request` failing in 646ms with `Max retry
attempts 3 reached` after the preceding batch-requester case stressed
the mesh).

The anti-flake mitigations from #22240 (chunkSize 1->8, peer-score
reset, `waitForConnectivity` in `beforeEach`) and #22046 (graceful
worker exit) are already in place but don't shrink the overall
wall-clock — they only prevent some cascade failures.

## Change

One line in `yarn-project/bootstrap.sh` for the bench command:
`TIMEOUT=1200` -> `TIMEOUT=1800`. The test runs with `ISOLATE=1`, so its
container doesn't block other jobs.

Jest's `TEST_TIMEOUT_MS = 600_000` (10 min per `it`) still catches
runaway individual cases.

Full analysis:
https://gist.github.com/AztecBot/698da6dae84abf742f74ec10b36c79bc


ClaudeBox log: https://claudebox.work/s/e9700e98c090ffca?run=1

@ludamad ludamad left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Auto-approved

@AztecBot

Copy link
Copy Markdown
Collaborator Author

🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass.

@AztecBot AztecBot added this pull request to the merge queue Apr 20, 2026
@AztecBot

AztecBot commented Apr 20, 2026

Copy link
Copy Markdown
Collaborator Author

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/c62dd5f2c17df82e�c62dd5f2c17df82e8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_invalidate_block.parallel.test.ts "proposer invalidates multiple checkpoints" (412s) (code: 0) group:e2e-p2p-epoch-flakes

Merged via the queue into next with commit 8fbd9d0 Apr 20, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants