refactor(prover-node): CheckpointStore + SessionManager redesign#23552
Merged
Conversation
56fd460 to
e5b43b2
Compare
a94348c to
068c248
Compare
46b423c to
d550bec
Compare
spalladino
approved these changes
Jun 5, 2026
Contributor
There was a problem hiding this comment.
Looks great! I'll miss the good old epoch-proving-job.
Out of all the comments, I think the only important one to address is the always-advancing of local chain tips when handling blockstream events in the prover-node.
PhilWindle
added a commit
that referenced
this pull request
Jun 9, 2026
…design (#23972) Addresses the review feedback on the checkpoint-store redesign (PR #23552). Branches off `pw/checkpoint-store-redesign` so it can be folded back in. ### prover-node - Advance the local tips store only after block-stream handling succeeds, and propagate checkpoint registration failures instead of swallowing them, so a transient handler failure leaves the tips unadvanced and the `L2BlockStream` re-emits the event (A-1041). - Skip checkpoints for epochs already past their proof-submission window (avoids archiver catch-up noise). - Bound the publishing-service shutdown with a timeout so an in-flight L1 tx can't hang `stop()`. - `expireEpoch` fetches blocks via `getBlocks({ epoch })` instead of computing the range from `getCheckpointsData`. ### session-manager - `archiverFullyCovered` compares by content-addressed id `(number, slot, archive root)` rather than checkpoint number, so a reorg that keeps the number but changes content is detected. - `openPartialSession` reuses a live partial session whose checkpoint content already matches the canonical set instead of reconstructing one. - `startProof` refuses to re-prove an epoch the L1 proven chain already encompasses. - Collapsed the redundant proving/finalization delays into a single delay (they gated identical work). - `startProof` returns the job id without awaiting completion; callers poll `getJobs()` (proving can outlast an HTTP request). RPC output changes `void` → `string`. ### proof publishing - Wire the candidate's submission-window deadline into `l1-tx-utils` (`txTimeoutAt`) so the L1 tx stops retrying past the deadline. - Drop the now-redundant `waitUntilStartBuildsOnProven` loop from `ProverNodePublisher` — the publishing service already gates publication on the predecessor being proven. ### orchestrator - Abort the sub-tree when the chonk verifier proof fails (otherwise the checkpoint/epoch orchestrators hang forever). - Cancel the proving state on sub-tree cancel, matching `TopTreeOrchestrator`. - Name `BaseRollupHintsWithoutProofAndVK` at the `prepareBaseRollupInputs` boundary and correct the stale "mock proof and VK" comment. - Clarify the proving-scheduler throttle docs: the shared `SerialQueue` paces job initiation, it does not bound in-flight broker concurrency. ### e2e - Anchor the mid-epoch reorg test on a fresh epoch and add count-based assertions so it provably exercises in-epoch checkpoint removal: N checkpoints in the epoch, remove the last leaving N-1, prove up to and including the (N-1)th. Prover-node unit tests (162) and prover-client orchestrator tests pass; build and lint clean.
…tree schedulers Replace the old ProvingOrchestrator + EpochProvingState + EpochProvingContext + TopTreeProvingScheduler stack with a checkpoint-driven design. - `CheckpointSubTreeOrchestrator` extends `ProvingScheduler` directly, taking `ChonkCache + EpochNumber` instead of the EpochProvingContext indirection. - `TopTreeOrchestrator` extends `ProvingScheduler` (collapses the former TopTreeProvingScheduler). - `CheckpointProvingState` takes three discrete deps: `epochNumber`, `isAlive`, `onReject` — no CheckpointParent interface, no parentEpoch. - `ChonkCache`: tx-hash-keyed cache of chonk-verifier proofs, released when the owning epoch expires. Deletes the old orchestrator unit tests; new behaviour is covered per-component in the new prover-node tests landing in subsequent commits.
…ssion, ProofPublishingService Adds the new prover-node primitives. No integration into ProverNode yet — the next commit wires them up. - `CheckpointStore`: long-lived, content-keyed registry of `CheckpointProver` instances, id = (checkpoint number, slot, archive root). `reapExpired(epoch)` drops provers whose proof-submission window has closed; a SlotWatcher drops pruned provers once the chain moves past their slot. - `CheckpointProver` (in `job/`): per-checkpoint sub-tree work that survives prune/re-add cycles via `markPruned()` / `markCanonical()`. - `TopTreeJob` (in `job/`): drives the top tree over a frozen subset of CheckpointProvers and returns a TopTreeProof. Cooperatively cancellable via TopTreeCancelledError so a mid-prove reorg can short-circuit. - `EpochSession` (in `job/`): TopTreeJob -> ProofPublishingService.submit -> terminal state. Hooks (`beforeTopTreeProve`, `afterTopTreeProve`, `topTreeProveOverride`) for test interpose. - `SessionManager`: serial reconcile queue, periodic tick with a monotonic high-water mark, owns `fullSessions` / `partialSessions` maps and lifecycle. - `ProofPublishingService`: SerialQueue drain, fresh ProverNodePublisher per publish via factory, per-candidate deadline against an injected DateProvider, no mid-publish interrupts. Each new file has an exhaustive unit test alongside.
Replaces the old EpochProvingJob-driven flow with the new checkpoint-driven primitives. - `ProverNode` constructs `CheckpointStore`, `SessionManager`, and `ProofPublishingService` at startup and threads `DateProvider` through them. - Per-event epoch expiry replaces chain-finalized: every block-stream event polls `getSyncedL2SlotNumber`, advances the `lastExpiredEpoch` high-water mark, and reaps each newly-expired epoch's CheckpointProvers + ChonkCache entries. `lastExpiredEpoch` is seeded from the last fully-proven epoch via `computeStartupState`. - Adds ObservableGauges `aztec.prover_node.active_checkpoints` and `aztec.prover_node.active_epoch_sessions` (with EPOCH_SESSION_KIND attribute of full|partial). Drops the stale `recordChonkVerifier` and `recordAllCheckpointsProcessing` metrics. - `ProverNodePublisher`: removes `interrupt()` and `restart()` along with the entire mid-publish interrupt path. Fixes the `l1TxUtils.interrupted` leak between publishes by construction (a single-signer setup shares the publisher instance across calls; any mid-publish interrupt without a paired restart strands the next publish). - Stdlib `prover-node` / `prover-client` / `server` interfaces updated for the new RPC surface; the old `EpochProver` interface remains in place to be deleted in a later commit.
- New `epochs_optimistic_proving.parallel.test.ts` with eight scenarios covering the checkpoint-driven flow: happy path, multi-epoch, mid-epoch reorg with/without replacement, last-slot reorg, tx-moves-checkpoints reorg, reorg-while-proving, and prover-node-started-mid-epoch. - Adds `EpochsTestContext.waitUntilNextEpochStarts()` that warps L1 to within two slots of the next epoch boundary, anchoring tests on a fresh epoch regardless of CI load. - Fixes `waitUntilEpochStarts` timeout — the prior `30 * epochDuration` mixed slots and seconds (120s for `epochDuration=4`) and could time out before a 144s+ epoch's wall time. Now `2 * epochDuration * L2_SLOT_DURATION_IN_S`. - Other e2e tests touched by the redesign (cross-chain, p2p, prover, slashing) receive small import / interface adjustments.
These were the entry points of the pre-redesign proving stack. Their last references were removed by the integration commit, so they can now be deleted outright.
…design (#23972) Addresses the review feedback on the checkpoint-store redesign (PR #23552). Branches off `pw/checkpoint-store-redesign` so it can be folded back in. ### prover-node - Advance the local tips store only after block-stream handling succeeds, and propagate checkpoint registration failures instead of swallowing them, so a transient handler failure leaves the tips unadvanced and the `L2BlockStream` re-emits the event (A-1041). - Skip checkpoints for epochs already past their proof-submission window (avoids archiver catch-up noise). - Bound the publishing-service shutdown with a timeout so an in-flight L1 tx can't hang `stop()`. - `expireEpoch` fetches blocks via `getBlocks({ epoch })` instead of computing the range from `getCheckpointsData`. ### session-manager - `archiverFullyCovered` compares by content-addressed id `(number, slot, archive root)` rather than checkpoint number, so a reorg that keeps the number but changes content is detected. - `openPartialSession` reuses a live partial session whose checkpoint content already matches the canonical set instead of reconstructing one. - `startProof` refuses to re-prove an epoch the L1 proven chain already encompasses. - Collapsed the redundant proving/finalization delays into a single delay (they gated identical work). - `startProof` returns the job id without awaiting completion; callers poll `getJobs()` (proving can outlast an HTTP request). RPC output changes `void` → `string`. ### proof publishing - Wire the candidate's submission-window deadline into `l1-tx-utils` (`txTimeoutAt`) so the L1 tx stops retrying past the deadline. - Drop the now-redundant `waitUntilStartBuildsOnProven` loop from `ProverNodePublisher` — the publishing service already gates publication on the predecessor being proven. ### orchestrator - Abort the sub-tree when the chonk verifier proof fails (otherwise the checkpoint/epoch orchestrators hang forever). - Cancel the proving state on sub-tree cancel, matching `TopTreeOrchestrator`. - Name `BaseRollupHintsWithoutProofAndVK` at the `prepareBaseRollupInputs` boundary and correct the stale "mock proof and VK" comment. - Clarify the proving-scheduler throttle docs: the shared `SerialQueue` paces job initiation, it does not bound in-flight broker concurrency. ### e2e - Anchor the mid-epoch reorg test on a fresh epoch and add count-based assertions so it provably exercises in-epoch checkpoint removal: N checkpoints in the epoch, remove the last leaving N-1, prove up to and including the (N-1)th. Prover-node unit tests (162) and prover-client orchestrator tests pass; build and lint clean.
6a70e4b to
42b1c0c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the monolithic
EpochProvingJobwith a content-keyedCheckpointStore,a long-running
SessionManagerthat owns ephemeralEpochSessions, and a newProofPublishingServicethat centralises L1 submission.ProverNodebecomes athin event translator: each L2BlockStream event is applied to the store / chonk
cache, then dispatched to the session manager and publishing service via single
method calls.
The redesign closes a class of optimistic-proving bugs that the old sticky
epochCompleteflag and per-session publish path made structurally hard to fix,and lays the groundwork for re-using sub-tree work across epochs.
See
yarn-project/prover-node/README.mdfor architecture diagrams, statemachines and event-flow sequences.
Architectural changes
CheckpointProveris content-addressed by(number, slot, archiveRoot).A prune followed by a re-add of the same content (e.g. brief L1 reorg) reuses
the in-flight sub-tree work — no replay. The
CheckpointProverstarts its owntx gather + sub-tree pipeline in its constructor; there's no
provideTxsAPI.CheckpointStoreowns the registry, theSlotWatcher(aRunningPromisereaping pruned-past-slot
CheckpointProvers), andreapExpired(drops canonicalCheckpointProvers once their epoch's proof-submission window has closed, so theproof can no longer be accepted on L1).
EpochSessionspec is slot-based:[firstSlotOfEpoch(N), toSlot]. Everysession — full or partial — starts at the epoch's first slot because the L1
rollup requires every proof to extend from the previous proven tip. The
session does three things: run a
TopTreeJob, hand the proof toProofPublishingServiceas aPublishCandidate, translate the outcome into aterminal state. Predecessor gating, same-epoch dedup, deadline enforcement,
and the L1 tx are all the service's concern.
ProofPublishingService(new) is the single owner of L1 submission. Itserialises one publish at a time against a freshly-created publisher,
per-candidate
deadlinearms asetTimeout(resolves'expired'if itfires before publishing starts), persistent
publisherFactory.create()failures are retried on a 1s backoff (capped by the deadline). Once an L1
publish starts it runs to completion;
withdrawis queue-only.SessionManagerowns thefullSessions/partialSessionsmaps, thereconcile loop, and the periodic tick. Reconcile is uniform across kinds:
any session whose canonical content shifts is cancelled and recreated with the
same spec but new content. The tick high-water mark advances only after a
session actually exists for the epoch, so transient blockers (max-pending-jobs
reached, archiver still indexing) leave the mark in place and the next tick
retries.
ChonkCachemoved from per-epoch to a single prover-node-wide cache inprover-client/orchestrator, keyed by tx hash. Entries are released by theper-event expiry sweep (
releaseForBlocks) once an epoch's proof-submissionwindow has closed — there's no longer any proof to produce for those txs.
SerialQueuefrom@aztec/foundation/queueso concurrent events can'tinterleave on an
awaitand race.Removed
EpochProvingJoband its stickyepochCompleteflag, thefinalizationScheduledflag, the in-class restart loop, the'reorg'state.ProvingOrchestratorandEpochProvingState(test-only legacy);CheckpointSubTreeOrchestratornow extendsProvingSchedulerdirectly.TopTreeProvingSchedulercollapsed intoTopTreeOrchestrator(singleconcrete subclass).
EpochProvingContext(thin facade overChonkCache); the sub-tree takesChonkCache+EpochNumberdirectly.CheckpointParentinterface (vestige ofEpochProvingStateas a real parent);the per-checkpoint state takes three discrete
epochNumber/isAlive/onRejectdeps from its owner.
ProverNodePublisher.interrupt()/.restart()and the entire mid-publishinterrupt code path. The bug where
l1TxUtils.interruptedleaked betweenpublishes (the publisher is created fresh per publish but wraps a pooled
L1TxUtils) is gone by construction.CheckpointStore.resume()(dead code) andimplements ServiceonCheckpointStore.ReconcileTriggervariant'finalised'andSessionManager.onChainFinalised(redundant nudge; every reconcile already runs the
recreateInvalidSessionssweep).
ProofPublishingService.onPrune(redundant with the session-manager pathwhich already calls
withdraw(uuid)for every cancelled session).Smaller fixes folded in
tipsStore.handleBlockStreamEventmoved to thefinallyblock so a throwinghandler doesn't claim progress that didn't happen.
CheckpointProverregardless of sub-treecompletion.
ProverClient.stop()usestryStop(facade)to swallow already-stopped errors.lastExpiredEpochseeded from the last fully-proven epoch instart()so arestart never re-sweeps epochs that already reached L1.
DateProviderplumbed throughEpochSession,SessionManager,ProofPublishingService— no directDate.now()anywhere.Map<EpochNumber, ...>on session-manager andpublishing-service Maps;
TopTreeJob.getRange()returnsCheckpointNumber.Test plan
yarn workspace @aztec/prover-node test— 161 unit tests pass (includesnew
proof-publishing-service.test.ts,checkpoint-store.test.ts).yarn workspace @aztec/prover-client test src/orchestrator/ src/test/bb_prover_full_rollup.test.tswith
FAKE_PROOFS=1— 24 tests pass.yarn workspace @aztec/stdlib test src/interfaces/prover-node.test.ts—passes (state enum still includes legacy values for API compatibility).
yarn build— full monorepo TypeScript build clean.yarn format/yarn lintclean across prover-node, prover-client,and end-to-end.
yarn workspace @aztec/end-to-end test:e2e e2e_optimistic_provingande2e_multi_proof— exercised by CI.cancel-and-recreate path lands a valid proof on L1.