fix: fork closure in epoch proving jobs#23390
Conversation
spalladino
left a comment
There was a problem hiding this comment.
LGTM. Just left a few comments on possible refactors.
| // temporary stack to control fork lifetime | ||
| await using cleanup = new AsyncDisposableStack(); |
| private async processCheckpoints( | ||
| parallelism: number, | ||
| processCheckpoint: (checkpoint: Checkpoint) => Promise<void>, | ||
| ): Promise<void> { | ||
| let hasError = false; | ||
| let firstError: unknown; | ||
|
|
||
| await asyncPool(Math.max(parallelism, 1), this.checkpoints, async checkpoint => { | ||
| if (hasError || this.abortController.signal.aborted) { | ||
| return; | ||
| } | ||
|
|
||
| try { | ||
| this.checkState(); | ||
| await processCheckpoint(checkpoint); | ||
| } catch (err) { | ||
| if (!hasError) { | ||
| hasError = true; | ||
| firstError = err; | ||
| this.failProcessing(); | ||
| } | ||
| } | ||
| }); | ||
|
|
||
| if (hasError) { | ||
| throw firstError; | ||
| } | ||
|
|
||
| if (this.abortController.signal.aborted) { | ||
| this.checkState(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Sounds like this is something we could bake into the asyncPool helper directly?
There was a problem hiding this comment.
I was wondering the same thing, but I didn't because the pool was forked from some repo.
| return await execWithSignal( | ||
| () => processFn(), | ||
| processingSignal, | ||
| signal => | ||
| signal.reason?.name === 'TimeoutError' ? new PublicProcessorTimeoutError() : new PublicProcessorAbortError(), | ||
| ); | ||
| } | ||
|
|
||
| private getProcessingSignal(tx: Tx, deadline: Date | undefined, signal: AbortSignal | undefined) { | ||
| if (!deadline) { | ||
| return signal; | ||
| } | ||
|
|
||
| const timeout = +deadline - this.dateProvider.now(); | ||
| if (timeout <= 0) { | ||
| throw new PublicProcessorTimeoutError(); | ||
| } | ||
|
|
||
| const txHash = tx.getTxHash(); | ||
| this.log.debug(`Processing tx ${txHash.toString()} within ${timeout}ms`, { | ||
| deadline: deadline.toISOString(), | ||
| now: new Date(this.dateProvider.now()).toISOString(), | ||
| txHash, | ||
| }); | ||
|
|
||
| return await executeTimeout( | ||
| () => processFn(), | ||
| timeout, | ||
| () => new PublicProcessorTimeoutError(), | ||
| ); | ||
| const timeoutSignal = AbortSignal.timeout(timeout); | ||
| return signal ? AbortSignal.any([signal, timeoutSignal]) : timeoutSignal; | ||
| } |
There was a problem hiding this comment.
Seems like this is calling for an overload of executeTimeout or execWithSignal that has both timeout and signal?
There was a problem hiding this comment.
Honestly, I'd remove executeTimeout in favour of execWithSignal now that we can use AbortSignal.timeout in recent node versions.
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
BEGIN_COMMIT_OVERRIDE refactor(p2p): merge FastTxCollection into TxCollection with sequential pipeline (AztecProtocol#23245) refactor(publisher): bundle-level simulate; drop per-action enqueue sims (AztecProtocol#23165) refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases (AztecProtocol#23249) test(e2e): fix race in 'proposer invalidates multiple checkpoints' (AztecProtocol#23259) fix: clean up old jobs regardless of pending status (AztecProtocol#23260) refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273) chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276) feat: slash truncated checkpoint proposals (AztecProtocol#23250) refactor: remove unused map in attestation pool (AztecProtocol#23284) chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274) refactor(l1-tx-utils): use DateProvider for fail-fast timeout check (AztecProtocol#23257) feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277) test(e2e): fix race in broadcasted_invalid_block_proposal_slash under pipelining (AztecProtocol#23302) fix(archiver): atomic getter for L2 tips (AztecProtocol#23295) fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under pipelining (AztecProtocol#23296) fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298) test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275) chore: update dashboard (AztecProtocol#23312) chore: Revert "feat(sandbox): support proposer pipelining in local network" (AztecProtocol#23313) test: slash on bad attestation (AztecProtocol#23184) feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116) test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301) test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300) test(e2e): switch fee_settings to organic fee bumps under pipelining (AztecProtocol#23303) fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures (AztecProtocol#23333) test(e2e): wait for real oracle rotation in fee_settings inflate helper (AztecProtocol#23334) test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining (AztecProtocol#23336) fix(spartan-bench): tolerate older node images in SlasherConfig schema (AztecProtocol#23351) fix: interrupt prover jobs in stop (AztecProtocol#23358) test(e2e): enable pipelining on bot, fees, and avm simulator tests (AztecProtocol#23329) feat(sentinel): end-of-epoch evaluation with re-execution outcomes (AztecProtocol#23286) feat: slash for invalid checkpoint proposals (AztecProtocol#23270) fix: fork closure in epoch proving jobs (AztecProtocol#23390) fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394) fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396) test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in epochs tests (AztecProtocol#23424) chore: refactor weekly proving test wait (AztecProtocol#23395) refactor: add fifo set (AztecProtocol#23271) feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327) fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371) chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373) feat(sequencer): AutomineSequencer for single-sequencer e2e tests (AztecProtocol#23354) fix(prover): wait for previous epoch to be proven (AztecProtocol#23458) chore: collocate provers (AztecProtocol#23439) chore: rm staging-ignition (AztecProtocol#23440) chore: rm unused networks (AztecProtocol#23441) test(e2e): migrate block_building, multi_validator_node, publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414) fix(archiver): reconcile local blocks with L1 checkpoints by block number (AztecProtocol#23461) feat: Updated slash conditions on block proposals (AztecProtocol#23466) test(e2e): migrate HA full test to pipelining (AztecProtocol#23463) chore: update resource profiles (AztecProtocol#23442) chore: update debug log levels (AztecProtocol#23456) test: fix flaky sentinel_status_slash by asserting the fault on the checkpoint slot (AztecProtocol#23483) feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980) (AztecProtocol#23436) refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID -> PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468) fix: reject block proposals in poisoned slots (AztecProtocol#23411) fix: retry nargo dep + solc downloads to survive transient DNS drops (AztecProtocol#23490) fix: enrich json-rpc tracing (AztecProtocol#23412) feat: add trace export controls (AztecProtocol#23413) test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496) test: cover invalid checkpoint proposal slashing (AztecProtocol#23503) test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482) test: flag e2e_slashing_attested_invalid_proposal as flake under pipelining (AztecProtocol#23501) test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining (AztecProtocol#23515) test(e2e): require cross-observer agreement on sentinel fault slot (AztecProtocol#23513) test: flag e2e_ha_full afterAll hook timeout as flake under pipelining (AztecProtocol#23524) fix(e2e): propagate l1ContractsArgs into node config so archiver matches L1 (AztecProtocol#23514) test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as flake (AztecProtocol#23528) test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1 race (AztecProtocol#23533) test(e2e_ha_full): parallel HA peer node teardown with per-node deadline (AztecProtocol#23539) test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541) test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542) test(ci): skip e2e_multi_validator_node_key_store entirely on merge-train/spartan (AztecProtocol#23544) END_COMMIT_OVERRIDE
Handle fork lifetime correctly around checkpoints that might be cancelled part way through processing.