Skip to content

feat(prover-node): checkpoint-driven optimistic proving via L2BlockStream (A-955)#22783

Closed
PhilWindle wants to merge 2 commits into
phil/a-954-orchestrator-restructure-deferred-finalization-checkpointsfrom
phil/a-955-optimistic-proving-checkpoint-driven-trigger
Closed

feat(prover-node): checkpoint-driven optimistic proving via L2BlockStream (A-955)#22783
PhilWindle wants to merge 2 commits into
phil/a-954-orchestrator-restructure-deferred-finalization-checkpointsfrom
phil/a-955-optimistic-proving-checkpoint-driven-trigger

Conversation

@PhilWindle

Copy link
Copy Markdown
Collaborator

Summary

Refactors the prover node from epoch-driven to checkpoint-driven proving:

  • Incremental EpochProvingJob: Replaces single run() with addCheckpoint() / removeLastCheckpoint() / finalizeAndProve()
  • L2BlockStream integration: ProverNode implements L2BlockStreamEventHandler for checkpoint and reorg detection
  • EpochMonitor signals finalization: Monitor tells us when epoch is complete, stream delivers checkpoints
  • Reorg safety: Prune events remove checkpoints from active jobs

Fixes A-955

@PhilWindle PhilWindle added ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure ci-draft Run CI on draft PRs. labels Apr 24, 2026
@PhilWindle PhilWindle force-pushed the phil/a-955-optimistic-proving-checkpoint-driven-trigger branch 5 times, most recently from 46e40c7 to 36fda66 Compare April 28, 2026 16:51
@PhilWindle PhilWindle marked this pull request as ready for review April 28, 2026 21:23
@PhilWindle PhilWindle added the ci-full Run all master checks. label Apr 28, 2026
@PhilWindle PhilWindle force-pushed the phil/a-955-optimistic-proving-checkpoint-driven-trigger branch from 22623c3 to ecb3a07 Compare April 28, 2026 21:24
…ream (A-955)

Reworks the prover node around an L2BlockStream-driven flow where each
EpochProvingJob owns its own lifecycle:

- The job stores all checkpoints (pending and tracked) in a single map keyed
  by checkpoint number, with the caller-supplied orchestrator index pinned at
  register time so registration and addCheckpoint can race in either order.
- New `completeEpoch()` / `whenComplete()` API: the prover node hands the
  epoch off to the job once L1 has signalled completion and every expected
  checkpoint has been registered; the job waits for any in-flight gather to
  settle, optionally sleeps for `finalizationDelayMs` (lets late-arriving
  events be processed), then runs `finalizeAndProve` and resolves the
  completion promise.
- `removeCheckpoint(N)` is now async, idempotent, and serialises against any
  in-flight `addCheckpoint` so a re-org's prune + replacement can re-register
  the same checkpoint number without racing the orchestrator.
- Per-entry `attestations` and `previousBlockHeader` (and a per-entry
  `checkpointIndex`) eliminate the previous job-level state derived from
  addition order, fixing the subtle bug where a partially-proven epoch's
  checkpoints could be skipped.

Orchestrator side gains `removeCheckpoint(index)` for arbitrary slot removal,
removing the LIFO-only constraint and the tail bookkeeping on the job.

Prover node simplifications:
- Drops `epochsReadyToFinalize`, `epochsFinalizing`, `epochAttestations`,
  `tryFinalizeEpoch`, and `finalizeEpochJob` in favour of a single
  `tryCompleteEpoch` hand-off and a `cleanupCompletedJob` chained off
  `whenComplete()`.
- `computeStartingBlock` and `handleCheckpointEvent` use the same
  "is the proven block the last block of its epoch?" rule via a new
  `isEpochFullyProven` helper, so partially-proven epochs (e.g. epoch with
  32 slots but proven only at slot 10) ingest every checkpoint correctly.
- Catch-up sleep removed from `gatherAndAddCheckpoint`; the same
  `proverNodeEpochProvingDelayMs` config now feeds the job's
  `finalizationDelayMs` and the EpochMonitor's existing delay.

E2E coverage:
- Happy-path single + multi-epoch checkpoint-driven proving.
- Mid-epoch reorg with replacement, mid-epoch reorg without replacement,
  last-slot reorg without replacement.
- Reorg during proving: gates `prover.finalizeEpoch`, computes reorg depth
  from the published L1 block, cheats `proven` on the rollup so the next
  epoch's proof can land, and asserts a subsequent epoch is proven on chain.
@PhilWindle PhilWindle force-pushed the phil/a-955-optimistic-proving-checkpoint-driven-trigger branch from 218b4b6 to d79de4f Compare April 30, 2026 17:30
…eckpoint

`monitor.run()` reads the rollup contract directly so its `checkpointNumber`
can be ahead of what `node.getCheckpoints` (archiver-backed) has indexed,
producing an empty array and a `Cannot read properties of undefined`
crash on the destructured `cp`. Wrap each lookup in `retryUntil` so the
helpers wait up to 30s for the archiver to catch up.
@AztecBot

Copy link
Copy Markdown
Collaborator

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/a9af6e081b36c52a�a9af6e081b36c52a8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_optimistic_proving.parallel.test.ts "removes the last-slot checkpoint of an epoch via reorg and proves with survivors" (270s) (code: 0) group:e2e-p2p-epoch-flakes

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got halfway through it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intended to be included with this PR?

const block = removed.getBlockProvingStateByBlockNumber(blockNumber);
if (block) {
for (const tx of block.getProcessedTxs()) {
this.provingState.cachedChonkVerifierProofs.delete(tx.hash.toString());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up this could delete a useful verification if the tx gets added to another checkpoint in the same epoch after the reorg


const removed = this.checkpoints[lastIndex]!;
this.checkpoints[lastIndex] = undefined;
this.checkpoints[checkpointIndex] = undefined;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked who calls this yet, but shouldn't this method be removing this checkpoint and all subsequent ones, since later checkpoints probably depended on this one?

Comment on lines 159 to 161
if (this.isEpochStructureFinalized) {
throw new Error('Cannot remove checkpoints after epoch structure has been finalized.');
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we hit a reorg that removes the last checkpoint in the epoch after we processed it and finalized the epoch structure? Should we account for that?


it(
'after removing last checkpoint, remaining checkpoint state is consistent',
'remove from middle of list leaves a hole and preserves surrounding checkpoints',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels dangerous, since later checkpoints have computed its outHashHint based on a now-removed checkpoint

Comment on lines +352 to +355
/** Marks the checkpoint as removed. Subsequent verifyState() returns false and reject() is a no-op. */
public markRemoved() {
this.removed = true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also remove all related proving jobs from the prover broker?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: still got to review this one

PhilWindle added a commit that referenced this pull request May 3, 2026
…rators directly

EpochProvingJob now holds a CheckpointSubTreeOrchestrator per checkpoint and
constructs a TopTreeOrchestrator inside finalizeAndProve, instead of holding a
single EpochProver. The previous waitForAllCheckpointsReady step is gone — the
top tree starts as soon as every checkpoint is tracked and pipelines its
checkpoint root rollups against the still-pending sub-tree result promises.

Each sub-tree owns its own per-checkpoint state, so removing one (e.g. via a
prune) is now atomic and does not affect the others — the cross-checkpoint
state coupling that triggered Palla's review concerns on #22783 is contained
to the top-tree's lifetime.

Also:
- ProverClient implements a new EpochProverFactory interface with
  createCheckpointSubTreeOrchestrator and createTopTreeOrchestrator. The
  legacy createEpochProver remains for the orchestrator_*.test.ts (deleted
  in commit 2b).
- EpochProvingJob accepts an EpochProvingJobHooks bag (beforeTopTreeProve,
  afterTopTreeProve, topTreeProveOverride) that gives the e2e tests a clean
  patch surface — but the four affected tests migrate to spying on
  createTopTreeOrchestrator and patching prove(), which is the closer analog
  to the legacy finalizeEpoch patch.
- BrokerCircuitProverFacade is exported from @aztec/prover-client/broker so
  the job can manage its lifecycle.
- CheckpointSubTreeOrchestrator gains getPreviousArchiveSiblingPath() so the
  top-tree data assembly is synchronous (no awaiting block-level proving).

epoch-proving-job.test.ts is rewritten end-to-end to mock the new factory
and the per-checkpoint sub-trees (28 tests, all passing). The four e2e tests
that used to spy on createEpochProver().finalizeEpoch are migrated to spy on
createTopTreeOrchestrator().orchestrator.prove.
PhilWindle added a commit that referenced this pull request May 3, 2026
…rators directly

EpochProvingJob now holds a CheckpointSubTreeOrchestrator per checkpoint and
constructs a TopTreeOrchestrator inside finalizeAndProve, instead of holding a
single EpochProver. The previous waitForAllCheckpointsReady step is gone — the
top tree starts as soon as every checkpoint is tracked and pipelines its
checkpoint root rollups against the still-pending sub-tree result promises.

Each sub-tree owns its own per-checkpoint state, so removing one (e.g. via a
prune) is now atomic and does not affect the others — the cross-checkpoint
state coupling that triggered Palla's review concerns on #22783 is contained
to the top-tree's lifetime.

Also:
- ProverClient implements a new EpochProverFactory interface with
  createCheckpointSubTreeOrchestrator and createTopTreeOrchestrator. The
  legacy createEpochProver remains for the orchestrator_*.test.ts (deleted
  in commit 2b).
- EpochProvingJob accepts an EpochProvingJobHooks bag (beforeTopTreeProve,
  afterTopTreeProve, topTreeProveOverride) that gives the e2e tests a clean
  patch surface — but the four affected tests migrate to spying on
  createTopTreeOrchestrator and patching prove(), which is the closer analog
  to the legacy finalizeEpoch patch.
- BrokerCircuitProverFacade is exported from @aztec/prover-client/broker so
  the job can manage its lifecycle.
- CheckpointSubTreeOrchestrator gains getPreviousArchiveSiblingPath() so the
  top-tree data assembly is synchronous (no awaiting block-level proving).

epoch-proving-job.test.ts is rewritten end-to-end to mock the new factory
and the per-checkpoint sub-trees (28 tests, all passing). The four e2e tests
that used to spy on createEpochProver().finalizeEpoch are migrated to spy on
createTopTreeOrchestrator().orchestrator.prove.
@PhilWindle PhilWindle closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants