feat: merge-train/spartan#22660
Merged
Merged
Conversation
… dropped (#22498) ## Summary - Doubles the gossipsub `mcacheLength` default from 6 to 12, extending the async validation window from ~4.2s to ~8.4s (~12% of a 72s slot) - Adds a warning log when gossip validation time approaches 75% of the mcache eviction window for early detection The gossipsub message cache eviction window (`mcacheLength × heartbeatInterval`) was too short at ~4.2s. When `asyncValidation` is enabled, `reportMessageValidationResult(Accept)` only forwards a message if it's still in the mcache. Checkpoint attestation validation involves epoch cache lookups (potentially cold L1 RPC calls) and signature recovery, which can exceed this window under load, causing valid messages to be silently dropped. Fixes [A-763](https://linear.app/aztec-labs/issue/A-763/audit-94-gossip-validation-timeout-too-short-for-block)
…2661) ## Summary The `p2p_client.proposal_tx_collector.bench.test.ts` suite flakes when the container-level `timeout -v 1200` (20 min) fires mid-suite. Bumping the bash `TIMEOUT` to 1800s (30 min) gives the benchmark ~12 min of headroom over the typical 15-18 min happy-path runtime. Example flake: http://ci.aztec-labs.com/e7d212f49389d8cf — suite progressed through case 17 of 24 before being TERM'd at exactly 1201s. ## Root cause 24 cases (3 distributions x 4 missingTxCounts x 2 collectors) run under `jest --runInBand`. Per-case parent budget is ~120s (30s collector + 60s peer wait + 30s buffer). Stress cases like `sparse/missing=500` reliably hit the 30s collector timeout, which degrades peer connectivity for subsequent cases (see cascade in the log: `sparse/500/send-batch-request` failing in 646ms with `Max retry attempts 3 reached` after the preceding batch-requester case stressed the mesh). The anti-flake mitigations from #22240 (chunkSize 1->8, peer-score reset, `waitForConnectivity` in `beforeEach`) and #22046 (graceful worker exit) are already in place but don't shrink the overall wall-clock — they only prevent some cascade failures. ## Change One line in `yarn-project/bootstrap.sh` for the bench command: `TIMEOUT=1200` -> `TIMEOUT=1800`. The test runs with `ISOLATE=1`, so its container doesn't block other jobs. Jest's `TEST_TIMEOUT_MS = 600_000` (10 min per `it`) still catches runaway individual cases. Full analysis: https://gist.github.com/AztecBot/698da6dae84abf742f74ec10b36c79bc ClaudeBox log: https://claudebox.work/s/e9700e98c090ffca?run=1
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
Collaborator
Author
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix: increase gossipsub mcache length to prevent valid messages being dropped (#22498)
fix(p2p): bump proposal_tx_collector bench TIMEOUT 1200s -> 1800s (#22661)
END_COMMIT_OVERRIDE