chore(e2e): tolerate one missing block per checkpoint in high tps test#22834
Closed
AztecBot wants to merge 2 commits into
Closed
chore(e2e): tolerate one missing block per checkpoint in high tps test#22834AztecBot wants to merge 2 commits into
AztecBot wants to merge 2 commits into
Conversation
Contributor
|
Closing in favor of #22846 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Deflakes
epochs_high_tps_block_building.test.ts, which failed the merge-train/spartan CI run 25086618965 (log) withExpected length: 4 / Received length: 3at line 192.Cause
Each sub-slot, the proposer (
yarn-project/sequencer-client/src/sequencer/checkpoint_proposal_job.ts) entersWAITING_FOR_TXSand pollsp2pClient.getPendingTxCount(). If the sub-slot's budget runs out withavailableTxs < minTxsPerBlock,tryBuildBlockreturnsfailure: insufficient-txsand the loop skips ahead viawaitUntilNextSubslot. Two non-txDelayercauses can drop a sub-slot under CI load: (1) a prior block ran past itsblockDurationbudget, leaving no time for the next one; (2) the mempool was momentarily empty for that sub-slot's polling window. Same flake class the author already tolerates two lines below for tx counts (// We don't test for exactly TXS_PER_BLOCK since CI delays make this flakey).(My earlier description blamed
txDelayerMaxInclusionTimeIntoSlot— Santiago flagged that as wrong: it controls L1 publishing latency only, i.e. the proposer's L1 propose tx must land within 1s of the last L1 block, otherwise it is deferred to the next L1 block. That is the mechanism behind the existingexpect([0, 1]).toContain(l1OffsetInSlot)assertion, not the cause of a sub-slot building zero blocks.)Fix
Replace the strict
toHaveLength(BLOCKS_PER_CHECKPOINT)with>= BLOCKS_PER_CHECKPOINT - 1, <= BLOCKS_PER_CHECKPOINT. Upper bound is preserved (catches a regression that produces too many blocks). ThefailEventsassertion at the end of the test still catches sequencer errors, andexpect(checkedFullCheckpoints).toBe(CHECKPOINTS_TO_CHECK)still requires two qualifying checkpoints.The flake group
e2e-p2p-epoch-flakesdid not absorb this one because the test failed both the initial run and its retry —flake_error_thresholdonly counts FLAKED runs, not hard FAILs.Full analysis: https://gist.github.com/AztecBot/c18984c05764251bc6136af08831517a