Skip to content

refactor(prover-node): CheckpointStore + SessionManager redesign#23552

Merged
PhilWindle merged 7 commits into
merge-train/spartan-v5from
pw/checkpoint-store-redesign
Jun 9, 2026
Merged

refactor(prover-node): CheckpointStore + SessionManager redesign#23552
PhilWindle merged 7 commits into
merge-train/spartan-v5from
pw/checkpoint-store-redesign

Conversation

@PhilWindle

@PhilWindle PhilWindle commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces the monolithic EpochProvingJob with a content-keyed CheckpointStore,
a long-running SessionManager that owns ephemeral EpochSessions, and a new
ProofPublishingService that centralises L1 submission. ProverNode becomes a
thin event translator: each L2BlockStream event is applied to the store / chonk
cache, then dispatched to the session manager and publishing service via single
method calls.

The redesign closes a class of optimistic-proving bugs that the old sticky
epochComplete flag and per-session publish path made structurally hard to fix,
and lays the groundwork for re-using sub-tree work across epochs.

See yarn-project/prover-node/README.md for architecture diagrams, state
machines and event-flow sequences.

Architectural changes

  • CheckpointProver is content-addressed by (number, slot, archiveRoot).
    A prune followed by a re-add of the same content (e.g. brief L1 reorg) reuses
    the in-flight sub-tree work — no replay. The CheckpointProver starts its own
    tx gather + sub-tree pipeline in its constructor; there's no provideTxs API.
  • CheckpointStore owns the registry, the SlotWatcher (a RunningPromise
    reaping pruned-past-slot CheckpointProvers), and reapExpired (drops canonical
    CheckpointProvers once their epoch's proof-submission window has closed, so the
    proof can no longer be accepted on L1).
  • EpochSession spec is slot-based: [firstSlotOfEpoch(N), toSlot]. Every
    session — full or partial — starts at the epoch's first slot because the L1
    rollup requires every proof to extend from the previous proven tip. The
    session does three things: run a TopTreeJob, hand the proof to
    ProofPublishingService as a PublishCandidate, translate the outcome into a
    terminal state. Predecessor gating, same-epoch dedup, deadline enforcement,
    and the L1 tx are all the service's concern.
  • ProofPublishingService (new) is the single owner of L1 submission. It
    serialises one publish at a time against a freshly-created publisher,
    per-candidate deadline arms a setTimeout (resolves 'expired' if it
    fires before publishing starts), persistent publisherFactory.create()
    failures are retried on a 1s backoff (capped by the deadline). Once an L1
    publish starts it runs to completion; withdraw is queue-only.
  • SessionManager owns the fullSessions / partialSessions maps, the
    reconcile loop, and the periodic tick. Reconcile is uniform across kinds:
    any session whose canonical content shifts is cancelled and recreated with the
    same spec but new content. The tick high-water mark advances only after a
    session actually exists for the epoch, so transient blockers (max-pending-jobs
    reached, archiver still indexing) leave the mark in place and the next tick
    retries.
  • ChonkCache moved from per-epoch to a single prover-node-wide cache in
    prover-client/orchestrator, keyed by tx hash. Entries are released by the
    per-event expiry sweep (releaseForBlocks) once an epoch's proof-submission
    window has closed — there's no longer any proof to produce for those txs.
  • Reconcile and publishing-service drain each run on their own
    SerialQueue from @aztec/foundation/queue so concurrent events can't
    interleave on an await and race.

Removed

  • EpochProvingJob and its sticky epochComplete flag, the
    finalizationScheduled flag, the in-class restart loop, the 'reorg' state.
  • ProvingOrchestrator and EpochProvingState (test-only legacy);
    CheckpointSubTreeOrchestrator now extends ProvingScheduler directly.
  • TopTreeProvingScheduler collapsed into TopTreeOrchestrator (single
    concrete subclass).
  • EpochProvingContext (thin facade over ChonkCache); the sub-tree takes
    ChonkCache + EpochNumber directly.
  • CheckpointParent interface (vestige of EpochProvingState as a real parent);
    the per-checkpoint state takes three discrete epochNumber/isAlive/onReject
    deps from its owner.
  • ProverNodePublisher.interrupt() / .restart() and the entire mid-publish
    interrupt code path. The bug where l1TxUtils.interrupted leaked between
    publishes (the publisher is created fresh per publish but wraps a pooled
    L1TxUtils) is gone by construction.
  • CheckpointStore.resume() (dead code) and implements Service on
    CheckpointStore.
  • ReconcileTrigger variant 'finalised' and SessionManager.onChainFinalised
    (redundant nudge; every reconcile already runs the recreateInvalidSessions
    sweep).
  • ProofPublishingService.onPrune (redundant with the session-manager path
    which already calls withdraw(uuid) for every cancelled session).

Smaller fixes folded in

  • tipsStore.handleBlockStreamEvent moved to the finally block so a throwing
    handler doesn't claim progress that didn't happen.
  • Failure-upload snapshots every CheckpointProver regardless of sub-tree
    completion.
  • ProverClient.stop() uses tryStop(facade) to swallow already-stopped errors.
  • lastExpiredEpoch seeded from the last fully-proven epoch in start() so a
    restart never re-sweeps epochs that already reached L1.
  • DateProvider plumbed through EpochSession, SessionManager,
    ProofPublishingService — no direct Date.now() anywhere.
  • Branded types throughout: Map<EpochNumber, ...> on session-manager and
    publishing-service Maps; TopTreeJob.getRange() returns CheckpointNumber.

Test plan

  • yarn workspace @aztec/prover-node test — 161 unit tests pass (includes
    new proof-publishing-service.test.ts, checkpoint-store.test.ts).
  • yarn workspace @aztec/prover-client test src/orchestrator/ src/test/bb_prover_full_rollup.test.ts
    with FAKE_PROOFS=1 — 24 tests pass.
  • yarn workspace @aztec/stdlib test src/interfaces/prover-node.test.ts
    passes (state enum still includes legacy values for API compatibility).
  • yarn build — full monorepo TypeScript build clean.
  • yarn format / yarn lint clean across prover-node, prover-client,
    and end-to-end.
  • yarn workspace @aztec/end-to-end test:e2e e2e_optimistic_proving and
    e2e_multi_proof — exercised by CI.
  • Kind-mode network run with a synthetic L1 prune to confirm the
    cancel-and-recreate path lands a valid proof on L1.

@PhilWindle PhilWindle force-pushed the pw/checkpoint-store-redesign branch 3 times, most recently from 56fd460 to e5b43b2 Compare May 26, 2026 09:14
@PhilWindle PhilWindle added ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure ci-full Run all master checks. labels May 26, 2026
@PhilWindle PhilWindle force-pushed the pw/checkpoint-store-redesign branch 6 times, most recently from a94348c to 068c248 Compare June 2, 2026 12:56
@PhilWindle PhilWindle force-pushed the pw/checkpoint-store-redesign branch 7 times, most recently from 46b423c to d550bec Compare June 4, 2026 09:38

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'll miss the good old epoch-proving-job.

Out of all the comments, I think the only important one to address is the always-advancing of local chain tips when handling blockstream events in the prover-node.

Comment thread yarn-project/prover-node/README.md
Comment thread yarn-project/prover-node/README.md
Comment thread yarn-project/prover-node/README.md
Comment thread yarn-project/prover-node/src/prover-node.ts Outdated
Comment thread yarn-project/prover-node/src/prover-node.ts
Comment thread yarn-project/prover-node/src/prover-node-publisher.ts Outdated
Comment thread yarn-project/prover-node/src/proof-publishing-service.ts
Comment thread yarn-project/prover-node/src/job/checkpoint-prover.ts
Comment thread yarn-project/prover-node/src/prover-node.ts
@PhilWindle PhilWindle changed the base branch from merge-train/spartan to merge-train/spartan-v5 June 8, 2026 21:59
PhilWindle added a commit that referenced this pull request Jun 9, 2026
…design (#23972)

Addresses the review feedback on the checkpoint-store redesign (PR
#23552). Branches off `pw/checkpoint-store-redesign` so it can be folded
back in.

### prover-node
- Advance the local tips store only after block-stream handling
succeeds, and propagate checkpoint registration failures instead of
swallowing them, so a transient handler failure leaves the tips
unadvanced and the `L2BlockStream` re-emits the event (A-1041).
- Skip checkpoints for epochs already past their proof-submission window
(avoids archiver catch-up noise).
- Bound the publishing-service shutdown with a timeout so an in-flight
L1 tx can't hang `stop()`.
- `expireEpoch` fetches blocks via `getBlocks({ epoch })` instead of
computing the range from `getCheckpointsData`.

### session-manager
- `archiverFullyCovered` compares by content-addressed id `(number,
slot, archive root)` rather than checkpoint number, so a reorg that
keeps the number but changes content is detected.
- `openPartialSession` reuses a live partial session whose checkpoint
content already matches the canonical set instead of reconstructing one.
- `startProof` refuses to re-prove an epoch the L1 proven chain already
encompasses.
- Collapsed the redundant proving/finalization delays into a single
delay (they gated identical work).
- `startProof` returns the job id without awaiting completion; callers
poll `getJobs()` (proving can outlast an HTTP request). RPC output
changes `void` → `string`.

### proof publishing
- Wire the candidate's submission-window deadline into `l1-tx-utils`
(`txTimeoutAt`) so the L1 tx stops retrying past the deadline.
- Drop the now-redundant `waitUntilStartBuildsOnProven` loop from
`ProverNodePublisher` — the publishing service already gates publication
on the predecessor being proven.

### orchestrator
- Abort the sub-tree when the chonk verifier proof fails (otherwise the
checkpoint/epoch orchestrators hang forever).
- Cancel the proving state on sub-tree cancel, matching
`TopTreeOrchestrator`.
- Name `BaseRollupHintsWithoutProofAndVK` at the
`prepareBaseRollupInputs` boundary and correct the stale "mock proof and
VK" comment.
- Clarify the proving-scheduler throttle docs: the shared `SerialQueue`
paces job initiation, it does not bound in-flight broker concurrency.

### e2e
- Anchor the mid-epoch reorg test on a fresh epoch and add count-based
assertions so it provably exercises in-epoch checkpoint removal: N
checkpoints in the epoch, remove the last leaving N-1, prove up to and
including the (N-1)th.

Prover-node unit tests (162) and prover-client orchestrator tests pass;
build and lint clean.
PhilWindle and others added 7 commits June 9, 2026 17:41
…tree schedulers

Replace the old ProvingOrchestrator + EpochProvingState + EpochProvingContext +
TopTreeProvingScheduler stack with a checkpoint-driven design.

- `CheckpointSubTreeOrchestrator` extends `ProvingScheduler` directly, taking
  `ChonkCache + EpochNumber` instead of the EpochProvingContext indirection.
- `TopTreeOrchestrator` extends `ProvingScheduler` (collapses the former
  TopTreeProvingScheduler).
- `CheckpointProvingState` takes three discrete deps: `epochNumber`, `isAlive`,
  `onReject` — no CheckpointParent interface, no parentEpoch.
- `ChonkCache`: tx-hash-keyed cache of chonk-verifier proofs, released when the
  owning epoch expires.

Deletes the old orchestrator unit tests; new behaviour is covered per-component
in the new prover-node tests landing in subsequent commits.
…ssion, ProofPublishingService

Adds the new prover-node primitives. No integration into ProverNode yet — the
next commit wires them up.

- `CheckpointStore`: long-lived, content-keyed registry of `CheckpointProver`
  instances, id = (checkpoint number, slot, archive root). `reapExpired(epoch)`
  drops provers whose proof-submission window has closed; a SlotWatcher drops
  pruned provers once the chain moves past their slot.
- `CheckpointProver` (in `job/`): per-checkpoint sub-tree work that survives
  prune/re-add cycles via `markPruned()` / `markCanonical()`.
- `TopTreeJob` (in `job/`): drives the top tree over a frozen subset of
  CheckpointProvers and returns a TopTreeProof. Cooperatively cancellable via
  TopTreeCancelledError so a mid-prove reorg can short-circuit.
- `EpochSession` (in `job/`): TopTreeJob -> ProofPublishingService.submit ->
  terminal state. Hooks (`beforeTopTreeProve`, `afterTopTreeProve`,
  `topTreeProveOverride`) for test interpose.
- `SessionManager`: serial reconcile queue, periodic tick with a monotonic
  high-water mark, owns `fullSessions` / `partialSessions` maps and lifecycle.
- `ProofPublishingService`: SerialQueue drain, fresh ProverNodePublisher per
  publish via factory, per-candidate deadline against an injected DateProvider,
  no mid-publish interrupts.

Each new file has an exhaustive unit test alongside.
Replaces the old EpochProvingJob-driven flow with the new checkpoint-driven
primitives.

- `ProverNode` constructs `CheckpointStore`, `SessionManager`, and
  `ProofPublishingService` at startup and threads `DateProvider` through them.
- Per-event epoch expiry replaces chain-finalized: every block-stream event
  polls `getSyncedL2SlotNumber`, advances the `lastExpiredEpoch` high-water
  mark, and reaps each newly-expired epoch's CheckpointProvers + ChonkCache
  entries. `lastExpiredEpoch` is seeded from the last fully-proven epoch via
  `computeStartupState`.
- Adds ObservableGauges `aztec.prover_node.active_checkpoints` and
  `aztec.prover_node.active_epoch_sessions` (with EPOCH_SESSION_KIND
  attribute of full|partial). Drops the stale `recordChonkVerifier` and
  `recordAllCheckpointsProcessing` metrics.
- `ProverNodePublisher`: removes `interrupt()` and `restart()` along with the
  entire mid-publish interrupt path. Fixes the `l1TxUtils.interrupted` leak
  between publishes by construction (a single-signer setup shares the
  publisher instance across calls; any mid-publish interrupt without a paired
  restart strands the next publish).
- Stdlib `prover-node` / `prover-client` / `server` interfaces updated for the
  new RPC surface; the old `EpochProver` interface remains in place to be
  deleted in a later commit.
- New `epochs_optimistic_proving.parallel.test.ts` with eight scenarios
  covering the checkpoint-driven flow: happy path, multi-epoch, mid-epoch
  reorg with/without replacement, last-slot reorg, tx-moves-checkpoints
  reorg, reorg-while-proving, and prover-node-started-mid-epoch.
- Adds `EpochsTestContext.waitUntilNextEpochStarts()` that warps L1 to within
  two slots of the next epoch boundary, anchoring tests on a fresh epoch
  regardless of CI load.
- Fixes `waitUntilEpochStarts` timeout — the prior `30 * epochDuration` mixed
  slots and seconds (120s for `epochDuration=4`) and could time out before a
  144s+ epoch's wall time. Now `2 * epochDuration * L2_SLOT_DURATION_IN_S`.
- Other e2e tests touched by the redesign (cross-chain, p2p, prover, slashing)
  receive small import / interface adjustments.
These were the entry points of the pre-redesign proving stack. Their last
references were removed by the integration commit, so they can now be deleted
outright.
…design (#23972)

Addresses the review feedback on the checkpoint-store redesign (PR
#23552). Branches off `pw/checkpoint-store-redesign` so it can be folded
back in.

### prover-node
- Advance the local tips store only after block-stream handling
succeeds, and propagate checkpoint registration failures instead of
swallowing them, so a transient handler failure leaves the tips
unadvanced and the `L2BlockStream` re-emits the event (A-1041).
- Skip checkpoints for epochs already past their proof-submission window
(avoids archiver catch-up noise).
- Bound the publishing-service shutdown with a timeout so an in-flight
L1 tx can't hang `stop()`.
- `expireEpoch` fetches blocks via `getBlocks({ epoch })` instead of
computing the range from `getCheckpointsData`.

### session-manager
- `archiverFullyCovered` compares by content-addressed id `(number,
slot, archive root)` rather than checkpoint number, so a reorg that
keeps the number but changes content is detected.
- `openPartialSession` reuses a live partial session whose checkpoint
content already matches the canonical set instead of reconstructing one.
- `startProof` refuses to re-prove an epoch the L1 proven chain already
encompasses.
- Collapsed the redundant proving/finalization delays into a single
delay (they gated identical work).
- `startProof` returns the job id without awaiting completion; callers
poll `getJobs()` (proving can outlast an HTTP request). RPC output
changes `void` → `string`.

### proof publishing
- Wire the candidate's submission-window deadline into `l1-tx-utils`
(`txTimeoutAt`) so the L1 tx stops retrying past the deadline.
- Drop the now-redundant `waitUntilStartBuildsOnProven` loop from
`ProverNodePublisher` — the publishing service already gates publication
on the predecessor being proven.

### orchestrator
- Abort the sub-tree when the chonk verifier proof fails (otherwise the
checkpoint/epoch orchestrators hang forever).
- Cancel the proving state on sub-tree cancel, matching
`TopTreeOrchestrator`.
- Name `BaseRollupHintsWithoutProofAndVK` at the
`prepareBaseRollupInputs` boundary and correct the stale "mock proof and
VK" comment.
- Clarify the proving-scheduler throttle docs: the shared `SerialQueue`
paces job initiation, it does not bound in-flight broker concurrency.

### e2e
- Anchor the mid-epoch reorg test on a fresh epoch and add count-based
assertions so it provably exercises in-epoch checkpoint removal: N
checkpoints in the epoch, remove the last leaving N-1, prove up to and
including the (N-1)th.

Prover-node unit tests (162) and prover-client orchestrator tests pass;
build and lint clean.
@PhilWindle PhilWindle force-pushed the pw/checkpoint-store-redesign branch from 6a70e4b to 42b1c0c Compare June 9, 2026 17:59
@PhilWindle PhilWindle enabled auto-merge (squash) June 9, 2026 18:00
@PhilWindle PhilWindle merged commit cb88eb6 into merge-train/spartan-v5 Jun 9, 2026
12 checks passed
@PhilWindle PhilWindle deleted the pw/checkpoint-store-redesign branch June 9, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants