feat(prover-node): checkpoint-driven optimistic proving via L2BlockStream (A-955)#22783
Conversation
46e40c7 to
36fda66
Compare
22623c3 to
ecb3a07
Compare
…ream (A-955) Reworks the prover node around an L2BlockStream-driven flow where each EpochProvingJob owns its own lifecycle: - The job stores all checkpoints (pending and tracked) in a single map keyed by checkpoint number, with the caller-supplied orchestrator index pinned at register time so registration and addCheckpoint can race in either order. - New `completeEpoch()` / `whenComplete()` API: the prover node hands the epoch off to the job once L1 has signalled completion and every expected checkpoint has been registered; the job waits for any in-flight gather to settle, optionally sleeps for `finalizationDelayMs` (lets late-arriving events be processed), then runs `finalizeAndProve` and resolves the completion promise. - `removeCheckpoint(N)` is now async, idempotent, and serialises against any in-flight `addCheckpoint` so a re-org's prune + replacement can re-register the same checkpoint number without racing the orchestrator. - Per-entry `attestations` and `previousBlockHeader` (and a per-entry `checkpointIndex`) eliminate the previous job-level state derived from addition order, fixing the subtle bug where a partially-proven epoch's checkpoints could be skipped. Orchestrator side gains `removeCheckpoint(index)` for arbitrary slot removal, removing the LIFO-only constraint and the tail bookkeeping on the job. Prover node simplifications: - Drops `epochsReadyToFinalize`, `epochsFinalizing`, `epochAttestations`, `tryFinalizeEpoch`, and `finalizeEpochJob` in favour of a single `tryCompleteEpoch` hand-off and a `cleanupCompletedJob` chained off `whenComplete()`. - `computeStartingBlock` and `handleCheckpointEvent` use the same "is the proven block the last block of its epoch?" rule via a new `isEpochFullyProven` helper, so partially-proven epochs (e.g. epoch with 32 slots but proven only at slot 10) ingest every checkpoint correctly. - Catch-up sleep removed from `gatherAndAddCheckpoint`; the same `proverNodeEpochProvingDelayMs` config now feeds the job's `finalizationDelayMs` and the EpochMonitor's existing delay. E2E coverage: - Happy-path single + multi-epoch checkpoint-driven proving. - Mid-epoch reorg with replacement, mid-epoch reorg without replacement, last-slot reorg without replacement. - Reorg during proving: gates `prover.finalizeEpoch`, computes reorg depth from the published L1 block, cheats `proven` on the rollup so the next epoch's proof can land, and asserts a subsequent epoch is proven on chain.
218b4b6 to
d79de4f
Compare
…eckpoint `monitor.run()` reads the rollup contract directly so its `checkpointNumber` can be ahead of what `node.getCheckpoints` (archiver-backed) has indexed, producing an empty array and a `Cannot read properties of undefined` crash on the destructured `cp`. Wrap each lookup in `retryUntil` so the helpers wait up to 30s for the archiver to catch up.
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
spalladino
left a comment
There was a problem hiding this comment.
Got halfway through it
There was a problem hiding this comment.
Was this intended to be included with this PR?
| const block = removed.getBlockProvingStateByBlockNumber(blockNumber); | ||
| if (block) { | ||
| for (const tx of block.getProcessedTxs()) { | ||
| this.provingState.cachedChonkVerifierProofs.delete(tx.hash.toString()); |
There was a problem hiding this comment.
Heads up this could delete a useful verification if the tx gets added to another checkpoint in the same epoch after the reorg
|
|
||
| const removed = this.checkpoints[lastIndex]!; | ||
| this.checkpoints[lastIndex] = undefined; | ||
| this.checkpoints[checkpointIndex] = undefined; |
There was a problem hiding this comment.
I haven't checked who calls this yet, but shouldn't this method be removing this checkpoint and all subsequent ones, since later checkpoints probably depended on this one?
| if (this.isEpochStructureFinalized) { | ||
| throw new Error('Cannot remove checkpoints after epoch structure has been finalized.'); | ||
| } |
There was a problem hiding this comment.
What happens if we hit a reorg that removes the last checkpoint in the epoch after we processed it and finalized the epoch structure? Should we account for that?
|
|
||
| it( | ||
| 'after removing last checkpoint, remaining checkpoint state is consistent', | ||
| 'remove from middle of list leaves a hole and preserves surrounding checkpoints', |
There was a problem hiding this comment.
This feels dangerous, since later checkpoints have computed its outHashHint based on a now-removed checkpoint
| /** Marks the checkpoint as removed. Subsequent verifyState() returns false and reject() is a no-op. */ | ||
| public markRemoved() { | ||
| this.removed = true; | ||
| } |
There was a problem hiding this comment.
Should we also remove all related proving jobs from the prover broker?
There was a problem hiding this comment.
Note to self: still got to review this one
…rators directly EpochProvingJob now holds a CheckpointSubTreeOrchestrator per checkpoint and constructs a TopTreeOrchestrator inside finalizeAndProve, instead of holding a single EpochProver. The previous waitForAllCheckpointsReady step is gone — the top tree starts as soon as every checkpoint is tracked and pipelines its checkpoint root rollups against the still-pending sub-tree result promises. Each sub-tree owns its own per-checkpoint state, so removing one (e.g. via a prune) is now atomic and does not affect the others — the cross-checkpoint state coupling that triggered Palla's review concerns on #22783 is contained to the top-tree's lifetime. Also: - ProverClient implements a new EpochProverFactory interface with createCheckpointSubTreeOrchestrator and createTopTreeOrchestrator. The legacy createEpochProver remains for the orchestrator_*.test.ts (deleted in commit 2b). - EpochProvingJob accepts an EpochProvingJobHooks bag (beforeTopTreeProve, afterTopTreeProve, topTreeProveOverride) that gives the e2e tests a clean patch surface — but the four affected tests migrate to spying on createTopTreeOrchestrator and patching prove(), which is the closer analog to the legacy finalizeEpoch patch. - BrokerCircuitProverFacade is exported from @aztec/prover-client/broker so the job can manage its lifecycle. - CheckpointSubTreeOrchestrator gains getPreviousArchiveSiblingPath() so the top-tree data assembly is synchronous (no awaiting block-level proving). epoch-proving-job.test.ts is rewritten end-to-end to mock the new factory and the per-checkpoint sub-trees (28 tests, all passing). The four e2e tests that used to spy on createEpochProver().finalizeEpoch are migrated to spy on createTopTreeOrchestrator().orchestrator.prove.
…rators directly EpochProvingJob now holds a CheckpointSubTreeOrchestrator per checkpoint and constructs a TopTreeOrchestrator inside finalizeAndProve, instead of holding a single EpochProver. The previous waitForAllCheckpointsReady step is gone — the top tree starts as soon as every checkpoint is tracked and pipelines its checkpoint root rollups against the still-pending sub-tree result promises. Each sub-tree owns its own per-checkpoint state, so removing one (e.g. via a prune) is now atomic and does not affect the others — the cross-checkpoint state coupling that triggered Palla's review concerns on #22783 is contained to the top-tree's lifetime. Also: - ProverClient implements a new EpochProverFactory interface with createCheckpointSubTreeOrchestrator and createTopTreeOrchestrator. The legacy createEpochProver remains for the orchestrator_*.test.ts (deleted in commit 2b). - EpochProvingJob accepts an EpochProvingJobHooks bag (beforeTopTreeProve, afterTopTreeProve, topTreeProveOverride) that gives the e2e tests a clean patch surface — but the four affected tests migrate to spying on createTopTreeOrchestrator and patching prove(), which is the closer analog to the legacy finalizeEpoch patch. - BrokerCircuitProverFacade is exported from @aztec/prover-client/broker so the job can manage its lifecycle. - CheckpointSubTreeOrchestrator gains getPreviousArchiveSiblingPath() so the top-tree data assembly is synchronous (no awaiting block-level proving). epoch-proving-job.test.ts is rewritten end-to-end to mock the new factory and the per-checkpoint sub-trees (28 tests, all passing). The four e2e tests that used to spy on createEpochProver().finalizeEpoch are migrated to spy on createTopTreeOrchestrator().orchestrator.prove.
Summary
Refactors the prover node from epoch-driven to checkpoint-driven proving:
Fixes A-955