Skip to content

feat(prover-node): checkpoint-driven optimistic proving#23002

Closed
PhilWindle wants to merge 11 commits into
merge-train/spartanfrom
pw/optimistic-final
Closed

feat(prover-node): checkpoint-driven optimistic proving#23002
PhilWindle wants to merge 11 commits into
merge-train/spartanfrom
pw/optimistic-final

Conversation

@PhilWindle

@PhilWindle PhilWindle commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Drives the prover-node onto the split CheckpointSubTreeOrchestrator + TopTreeOrchestrator pair, with checkpoint-driven proving that pipelines sub-trees against tx-gathering and the top-tree against the in-flight sub-trees.

Stacked on #22996 (orchestrator split). Diff above is just the prover-node + e2e + cleanup delta.

What's new

prover-nodeEpochProvingJob job-model rewrite

EpochProvingJob becomes an orchestrator over a Map<string, CheckpointJob> keyed by ${number}:${slot}. Each CheckpointJob owns a single CheckpointSubTreeOrchestrator with its own per-checkpoint context (txs, attestations, previous-block header, l1ToL2 messages, archive sibling path). A TopTreeJob drives the epoch root rollup once all checkpoint sub-trees have started block-level proving.

Public API:

  • registerCheckpoint — synchronous; sets up sub-tree, kicks off chonk-verifier cache fill, attaches the blockProofs promise to the eventual top-tree job.
  • provideTxs — supplies simulated txs, transitions the checkpoint job from registered → block-proving.
  • removeCheckpointsAfter(threshold) — sole removal API. Suffix-only: cancels every job whose checkpoint number sits strictly above the threshold, atomically. Suffix-only by design — middle-removal would leave a non-contiguous survivor set that cannot be proved.
  • getCheckpointCount, getCheckpointNumbers.

TopTreeJob constructor defensively asserts that its snapshot's checkpoint numbers form a contiguous run (prev + 1 === curr). With suffix-only removal this is structurally guaranteed; the assertion is a runtime tripwire against future regressions.

prover-nodeL2BlockStream-driven checkpoint pipeline

The prover-node consumes chain-checkpointed / chain-pruned events from an L2BlockStream rooted at the first block of the first unproven epoch. On each chain-checkpointed:

  1. Resolve the epoch via getEpochAtSlot.
  2. Get-or-create the per-epoch EpochProvingJob.
  3. Register the checkpoint synchronously and detach a tx-gathering task.
  4. Call tryCompleteEpoch for both older epochs (tertiary signal) and the current epoch (covers the case where the EpochMonitor already fired and we were waiting on this last checkpoint to arrive).

On chain-pruned: call removeCheckpointsAfter(threshold) on every job whose first checkpoint sits at or above the threshold. Pending gather tasks are cancelled via AbortSignal.

Finalization is driven by the union of three signals — EpochMonitor (epoch closes on L1), tertiary (a checkpoint for a strictly later epoch arrives), or final-checkpoint registration on an already-monitored epoch. Each signal calls tryCompleteEpoch, which queries l2BlockSource.isEpochComplete directly: the archiver is the source of truth, so the prover-node carries no in-memory cache of "epochs complete on L1".

prover-node — reorg-after-finalization restart

When the L2BlockStream emits a prune that retroactively invalidates an epoch already in finalize, the prover-node aborts the in-flight publish, clears the job, and restarts proving from the new tip.

e2e

  • New epochs_optimistic_proving.parallel.test.ts: full e2e covering the pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.
  • The two happy-path tests use a wall-clock-vs-job-epoch sampler to assert that the prover-node has registered checkpoints with the in-flight job for an epoch while still inside that epoch — proves checkpoints aren't deferred until epoch end. The multi-epoch test runs 4 epochs and asserts this for each.
  • epochs_proof_fails, epochs_upload_failed_proof, epochs_long_proving_time, epochs_multi_proof updated to assert the new in-flight epoch behaviour.

Logging

Structured-context logs added across the CheckpointJob / TopTreeJob / EpochProvingJob lifecycles: register/provide-txs/sub-tree-created/per-block-progress/blockProofs-ready/cancel/teardown for checkpoints, and create/cancel/prove-start/prove-success for top-tree. Every log entry indexes on epochNumber / checkpointNumber per repo convention.

What's removed

EpochProver interface and ServerEpochProver are removed: the prover-node no longer drives a single-class epoch prover, so the legacy API has no production callers. ProvingOrchestrator survives only as a base class for CheckpointSubTreeOrchestrator and as the single-class driver used by prover-client's integration tests; it no longer implements EpochProver.

EpochProvingJob.removeCheckpoint(checkpointNumber) and cancelPendingCheckpoints() are also removed: middle-removal could leave a non-contiguous survivor set that the TopTreeJob constructor would reject. Suffix-only removeCheckpointsAfter is the only checkpoint-removal API now.

Test plan

  • yarn workspace @aztec/prover-client test — 261 tests pass.
  • yarn workspace @aztec/prover-node test — 116 tests pass (incl. new dedicated top-tree-job.test.ts and checkpoint-job.test.ts).
  • e2e tests covering optimistic proving, reorgs during proving, failed proof publish, and multi-checkpoint flows are included in this PR.

@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch from 087e60b to f181313 Compare May 6, 2026 17:47
@PhilWindle PhilWindle changed the base branch from merge-train/spartan to pw/optimistic-orchestrator May 6, 2026 17:47
@PhilWindle PhilWindle added ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 6, 2026
@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch from f181313 to 3694a9b Compare May 6, 2026 17:51
…pair

Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing
single-class proving orchestrator along the natural state-coupling boundary —
per-checkpoint block-level work vs. epoch-level top-tree work — while leaving
every existing API on the legacy `EpochProver` / `ProvingOrchestrator` /
`EpochProvingState` path untouched. The prover-node and e2e tests build
unchanged; this PR is purely additive in surface area, with structural
refactors on `ProvingOrchestrator` to share scheduling and top-tree drivers
with the new `TopTreeOrchestrator`.

Split out from #22990 so it can land independently.

## What's new

- **`CheckpointSubTreeOrchestrator`** (`checkpoint-sub-tree-orchestrator.ts`):
  extends `ProvingOrchestrator`, single-checkpoint by construction. Drives
  chonk-verifier / base / merge / block-root / block-merge for one checkpoint
  and resolves a `SubTreeResult` instead of escalating to the checkpoint root —
  the parent's `checkAndEnqueueCheckpointRootRollup` is overridden to
  short-circuit. The constructor calls `super.startNewEpoch(epoch, 1, empty
  challenges)` to set up a single-checkpoint mini-epoch; the count and
  challenges are never read because the override prevents the parent's
  finalize / root path from running.

- **`TopTreeOrchestrator`** + **`TopTreeProvingState`**: self-contained driver
  from checkpoint-root through epoch-root rollup. Takes per-checkpoint
  block-proof promises and pipelines its hint chain against them. Cancellation
  surfaces as `TopTreeCancelledError` so callers can distinguish reorg-driven
  cancel from a genuine proving failure.

- **`EpochProvingContext`** (`epoch-proving-context.ts`): per-epoch shared
  cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that
  gets reorged out and re-appears in a replacement checkpoint reuses the
  cached proof.

- **`ProvingScheduler`** (`proving-scheduler.ts`): abstract base owning the
  `SerialQueue` deferred-job lifecycle, the `pendingProvingJobs` controller
  list, and a unified `deferredProving<S, T>(state, request, callback,
  isCancelled?)` submit envelope. The minimal `ProvingStateLike` contract is
  just `verifyState()` + `reject(reason)`.

- **`TopTreeProvingScheduler`** (`top-tree-proving-scheduler.ts`): extends
  `ProvingScheduler` and holds the checkpoint-merge, padding, and root-rollup
  drivers (plus tree-walking helpers) shared by both orchestrators. Wraps
  circuit calls via a `wrapCircuitCall` hook (orchestrator overrides for spans;
  top-tree leaves identity) and resolves via an `onRootRollupComplete` hook to
  bridge the two states' differing `resolve` signatures. The per-checkpoint
  root driver stays subclass-specific because input-building flows differ.

- **`EpochProverFactory` interface on `ProverClient`**: new factory methods
  `createEpochProvingContext(epochNumber)`,
  `createCheckpointSubTreeOrchestrator(...)`, and
  `createTopTreeOrchestrator()`. A single shared `BrokerCircuitProverFacade` is
  owned by `ProverClient` and shared across every orchestrator.

## What changes in existing code

- `ProvingOrchestrator` extends `TopTreeProvingScheduler`; the inline
  broker-job submit envelope, queue lifecycle, and the top-tree-section
  drivers are inherited. `cancel()` delegates the queue-recreate + abort-jobs
  logic to `resetSchedulerState(this.cancelJobsOnStop)`. Three internal
  methods (`getOrEnqueueChonkVerifier`, `checkAndEnqueueBaseRollup`,
  `checkAndEnqueueCheckpointRootRollup`) become `protected` so the sub-tree
  can override them; `provingState` and `provingPromise` likewise become
  `protected` so the sub-tree can hook the parent's failure stream onto
  `subTreeResult`. No public API change on `ProvingOrchestrator`.
- `CheckpointProvingState`: gains two read-only accessors used by the
  sub-tree's checkpoint-root override — `getSubTreeOutputProofs()` and
  `getLastArchiveSiblingPath()`. No state changes.
- `ProverClient` keeps `createEpochProver()` exactly as before (each call
  spawns its own `BrokerCircuitProverFacade`); the new factory methods share
  a `getFacade()` set up in `start()` and torn down in `stop()`.

`EpochProver`, `EpochProverManager`, `ServerEpochProver`, `EpochProvingState`,
the integration tests in `orchestrator_*.test.ts`, `bb_prover_full_rollup.test.ts`,
and `stdlib/interfaces/*` are all unchanged from `merge-train/spartan` — the
prover-node and e2e tests continue to build against the existing `EpochProver`
API. Migrating the prover-node onto the new factories (and the deferred-finalize
flow that goes with optimistic proving) is the follow-up PR.

## Test plan

- [x] 261 prover-client tests pass (full `yarn workspace @aztec/prover-client test`).
- [x] `yarn build` clean against current merge-train/spartan (modulo the
  pre-existing `@aztec/sqlite3mc-wasm` issue inherited from baseline).
@PhilWindle PhilWindle force-pushed the pw/optimistic-orchestrator branch from da67473 to 7342efc Compare May 6, 2026 18:09
@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch 17 times, most recently from fd62ad7 to 093fdde Compare May 8, 2026 08:50
Drives the prover-node onto the split `CheckpointSubTreeOrchestrator` +
`TopTreeOrchestrator` pair, with checkpoint-driven proving that pipelines
sub-trees against tx-gathering and the top-tree against the in-flight
sub-trees.

## What's new

### `prover-node` — `EpochProvingJob` job-model rewrite

`EpochProvingJob` becomes an orchestrator over a `Map<string, CheckpointJob>`
keyed by `${number}:${slot}`. Each `CheckpointJob` owns a single
`CheckpointSubTreeOrchestrator` with its own per-checkpoint context (txs,
attestations, previous-block header, l1ToL2 messages, archive sibling path).
A `TopTreeJob` drives the epoch root rollup once all checkpoint sub-trees
have started block-level proving.

Public API:
- `registerCheckpoint` — synchronous; sets up sub-tree, kicks off chonk-verifier
  cache fill, attaches the `blockProofs` promise to the eventual top-tree job.
- `provideTxs` — supplies simulated txs, transitions the checkpoint job from
  registered → block-proving.
- `removeCheckpoint(synchronous, idempotent)` — drops a single checkpoint by
  `(number, slot)`, fire-and-forget cancels its sub-tree. Tolerates re-add of
  the same checkpoint number under a different slot.
- `removeCheckpointsAfter`, `getCheckpointCount`, `getCheckpointNumbers`,
  `cancelPendingCheckpoints`.

### `prover-node` — `L2BlockStream`-driven checkpoint pipeline

The prover-node consumes `chain-checkpointed` / `chain-pruned` events from an
`L2BlockStream` rooted at the first block of the first unproven epoch. On
each `chain-checkpointed`:
1. Resolve the epoch via `getEpochAtSlot`.
2. Get-or-create the per-epoch `EpochProvingJob`.
3. Detached-task gather txs + register the checkpoint with the job.

On `chain-pruned`: call `removeCheckpointsAfter(threshold)` on every job
whose first checkpoint sits at or above the threshold. Pending gather tasks
are cancelled via `AbortSignal`.

Finalization is driven by the union of three signals: epoch-monitor sees
the epoch close on L1, a checkpoint for a strictly later epoch arrives, or
all expected checkpoints (per archiver) are registered while the epoch is
already complete on L1.

### `prover-node` — reorg-after-finalization restart

When the L2BlockStream emits a prune that retroactively invalidates an
epoch already in finalize, the prover-node aborts the in-flight publish,
clears the job, and restarts proving from the new tip.

### e2e

- New `epochs_optimistic_proving.parallel.test.ts`: full e2e covering the
  pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.
- `epochs_proof_fails`, `epochs_upload_failed_proof`, `epochs_long_proving_time`,
  `epochs_multi_proof` updated to assert the new in-flight epoch behaviour.

## What's removed

`EpochProver` interface and `ServerEpochProver` are removed: the prover-node
no longer drives a single-class epoch prover, so the legacy API has no
production callers. `ProvingOrchestrator` survives only as a base class for
`CheckpointSubTreeOrchestrator` and as the single-class driver used by
`prover-client`'s integration tests; it no longer implements `EpochProver`.

## Test plan

- `yarn workspace @aztec/prover-client test` — 261 tests pass.
- `yarn workspace @aztec/prover-node test` — 89 tests pass.
- e2e tests covering optimistic proving, reorgs during proving, failed
  proof publish, and multi-checkpoint flows are included in this PR.
@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch from 093fdde to 81dc825 Compare May 8, 2026 09:23

@spalladino spalladino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got halfway through it, but looks good so far. Pasting here some observations by Claude that I didn't get to review myself:

  • Same-epoch replacement checkpoints are silently dropped after completeEpoch(). prover-node.ts:213-220 skips any checkpoint for an epoch whose job has isEpochComplete(). Sequence: (1) all checkpoints registered, completeEpoch() fires; (2) prune removes the suffix; (3) the L1 reorg replacement for the same epoch arrives. The replacement is dropped because existingJob.isEpochComplete() is true. The finalize loop then restarts against an incomplete prefix of the epoch and eventually publishes a proof for a checkpoint range that does not match the canonical L1 chain. This contradicts the stated "orphan + replacement coexist briefly" model. The check should be conditioned on the surviving checkpoint set, not on the boolean flag.

  • A prune that lands after topTreeJob.start() but during publishProof is not interceptable. epoch-proving-job.ts:478 clears this.topTreeJob immediately after the prove resolves; the publish at epoch-proving-job.ts:487 operates on the captured snapshot array. Meanwhile removeCheckpointsAfter (epoch-proving-job.ts:332-369) only acts on the live topTreeJob — which is now undefined — so the publish proceeds with stale data and the (out-of-date) attestations from snapshot.at(-1). Best case L1 rejects the proof (gas burn, retry confusion); worst case it lands against a window the canonical chain has since invalidated. Either re-validate the snapshot's checkpoint set (by id) against the live registry just before submitting, or hold the top-tree-job reference until publish completes and route prune through it.

  • EpochMonitor permanently records "processed" on a no-op tryCompleteEpoch. prover-node.ts:145-153 returns true from handleEpochReadyToProve whenever tryCompleteEpoch doesn't throw — even if it bailed at prover-node.ts:394 (archiver says incomplete) or prover-node.ts:399-405 (not all checkpoints known to the job). epoch-monitor.ts:93 then sets latestEpochNumber = epochToProve, and epoch-monitor.ts:76-79 ensures the same epoch is never re-considered. In a healthy network the next checkpoint event for a later epoch reseeds via the tertiary signal, but if the epoch in question is the last produced (network stall, end-of-test) it stays stuck forever. tryCompleteEpoch should return whether completeEpoch() was actually invoked, and handleEpochReadyToProve should propagate that.

Comment on lines +160 to +161
// First, update the local tips store so the stream can track our state.
await this.tipsStore.handleBlockStreamEvent(event);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that by updating the store first, any error in the handle methods below will not be retried in the next iteration of the L2BlockStream. Not sure if this is intentional.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should go at the end I think.

// (e.g. an early proof submission landed for an earlier epoch that ends mid-way
// through this epoch's slot range) does not prove the entire epoch. We need every
// checkpoint of a partially-proven epoch to feed the orchestrator.
if (await this.isEpochFullyProven(epochNumber, l1Constants)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check in the archiver and blockstream what happens first: do checkpoints get emitted, or does the proven block number get updated? I'm worried that, when starting a long sync after offline for a while, all the checkpoints are emitted first and then the proven tip is updated, so the prover node would trigger a ton of jobs only to cancel them immediately afterwards.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just saw the compute starting block number for the blockstream. And as long as the prover node waits for the archiver initial sync to be completed before starting, we should be good I guess.

* to the job for finalization. If checkpoints are still being delivered by the
* L2BlockStream, the handoff happens later when the last one arrives.
*/
async handleEpochReadyToProve(epochNumber: EpochNumber): Promise<boolean> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can now replace the entire epoch monitor with a running promise that just calls l2BlockSource.isEpochComplete periodically.

Comment on lines +222 to +233
// Add to an existing job, or create a new one if allowed.
let job = existingJob;
if (!job) {
const { proverNodeMaxPendingJobs: maxPendingJobs } = this.config;
if (maxPendingJobs > 0 && this.jobs.size >= maxPendingJobs) {
this.log.debug(
`Skipping checkpoint ${checkpoint.number} for epoch ${epochNumber}: max pending jobs ${maxPendingJobs} reached`,
);
return;
}
job = await this.createEpochJob(epochNumber);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this create a hole in a given epoch? I don't see another code path where we patch an epoch job with checkpoint jobs we failed to add due to lack of capacity. I'm worried about a scenario where we don't add a checkpoint to an epoch, but then a prior epoch finishes, so we do add the next, and that epoch job ends up waiting for that missing checkpoint forever.

Speaking of which: is there anything that cleans up very old jobs when the epoch proof submission window closes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was going to ask about this. As it stands here there could be problems if this is set. Do you think this is needed or makes sense anymore? Proving is more of a continuous process rather than a set of large jobs.


const taskId = `${checkpoint.number} - ${checkpoint.slot}`;
const task = this.gatherAndAddCheckpoint(job, checkpoint, epochNumber, abortSignal);
this.pendingGatherTasks.set(taskId, task);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to move the responsibility of collecting their data to the checkpoint jobs, to keep the prover node leaner?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does. As a further refactor, I want to break the checkpoint proving out of EpochProvingJob. Then we maintain a collection of checkpoint jobs being proven. EpochProvingJob then becomes a collection of references to a subset of those checkpoints and a top tree. This would enable us to create instances of EpochProvingJob with arbitrary checkpoints (of a valid range of course) enabling efficient partial epoch proving.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks comprehensive. I'd add tests for cross-chain messages as well, those are usually a source of problems.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment on lines +73 to +77
const wallClockEpoch = Number(test.epochCache.getEpochAndSlotNow().epoch);
const job = proverNode.epochJobs.get(wallClockEpoch);
if (job && job.getCheckpointCount() > 0) {
observed.add(wallClockEpoch);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up this should pass even without optimistic proving: prover would start as soon as the last checkpoint of the epoch was pushed to L1, which could happen a few L1 slots before the actual end of the epoch. Maybe we should offset this a bit so the check is more meaningful, or check for specific checkpoint jobs vs all jobs in the epoch?

Comment on lines +172 to +177
// Wait for the 2nd checkpoint within this epoch.
const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber;
const midCheckpoint = CheckpointNumber(initialCheckpoint + 2);
await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6);
const checkpointBeforeReorg = test.monitor.checkpointNumber;
logger.info(`Reached checkpoint ${checkpointBeforeReorg}`);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add an assertion that the prover actually started the job for the given checkpoint before reorging it out. Otherwise, this test could pass if the prover just takes a bit more time to start assembling the checkpoint job.

Comment on lines +239 to +244
// Wait for 2 checkpoints mid-epoch.
const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber;
const midCheckpoint = CheckpointNumber(initialCheckpoint + 2);
await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6);
const checkpointBeforeReorg = test.monitor.checkpointNumber;
logger.info(`Reached checkpoint ${checkpointBeforeReorg}`);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also assert that both checkpoints landed on the same epoch, which could not be the case if this test started near the end of an epoch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add txs. And it'd be interesting to see a case where one of the reorgs means that a tx that was mined in a checkpoint is now mined in a different one.

Comment on lines +391 to +408
const proverNode = test.proverNodes[0].getProverNode() as TestProverNode;
const proverManager = proverNode.getProver();
const origCreateTopTree = proverManager.createTopTreeOrchestrator.bind(proverManager);
let releaseProvingGate: () => void = () => {};
const provingGate = new Promise<void>(resolve => {
releaseProvingGate = resolve;
});
proverManager.createTopTreeOrchestrator = () => {
const topTree = origCreateTopTree();
const origProve = topTree.prove.bind(topTree);
topTree.prove = async (...args: Parameters<typeof origProve>) => {
logger.warn('Top-tree proving gated — waiting for test to release');
await provingGate;
logger.warn('Proving gate released');
return origProve(...args);
};
return topTree;
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there's a cleaner way of doing this, perhaps reusing the EpochProvingJobHooks? Not sure if worth stressing over it though.

Comment on lines +435 to +437
if (snapshot.length === 0) {
throw new Error(`Cannot finalize epoch ${this.epochNumber}: no surviving checkpoints`);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a race condition where a reorg removes all checkpoints from an epoch and then new ones are immediately added again, but we hit this case before the new ones get added? Seems like something that would never happen in prod, but could affect e2e tests that have short epochs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be. The prover node does this in it's prune handler.

      if (job.getCheckpointCount() === 0) {
        this.log.info(`Cancelling epoch ${epochNum} job — all checkpoints pruned`);
        await this.cancelAndCleanupJob(epochNum, job);
      }

So that epoch job instance is removed and replaced with another when a new checkpoint arrives for the same epoch number.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a test though

Comment on lines +579 to +588
// Capture proving data *before* we clear the live job map, so a failure-upload
// attempt that runs after teardown still has the per-checkpoint txs / messages /
// previous-block-header to work with.
if (!this.provingDataSnapshot) {
try {
this.provingDataSnapshot = this.buildProvingDataSnapshot();
} catch {
// No completed checkpoints — failure-upload will have nothing to upload, which
// is fine. Snapshot stays undefined and getProvingData will rebuild and rethrow.
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move the responsibility of uploading this data to the epoch-proving-job itself, so we don't need to collect this just for the prover-node to maybe read if it's configured for uploads?


/** Stable identifier: `${checkpoint number}:${slot}`. Used as the parent's map key. */
public static idFor(checkpoint: Checkpoint): string {
return `${checkpoint.number}:${checkpoint.header.slotNumber}`;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes a difference based on how reorgs are handled, but I'd throw in the checkpoint archive as part of the identifier, just in case

Comment on lines +237 to +246
const publicTxs = allTxs.filter(tx => tx?.data.forPublic);
if (publicTxs.length > 0) {
this.deps.log.verbose(
`Kicking off ${publicTxs.length} chonk-verifier circuits for checkpoint ${this.checkpoint.number}`,
{
checkpointNumber: this.checkpoint.number,
publicTxCount: publicTxs.length,
},
);
await this.subTree.startChonkVerifierCircuits(publicTxs);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this didn't change with this PR, but for private-only txs, who verifies the chonk proof?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private base rollup. Public txs have a different process. They require a dedicated chonk verifier circuit.

Comment on lines +123 to +124
public start(): Promise<TopTreeProof> {
void this.run();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it it's responsibility of the caller to check that all dependencies are ready, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which dependencies are you referring to? At the point this class is constructed, it's arguments contain all information required to start the proving process (e.g. compute blob challenges). If the checkpoint root input proofs aren't ready yet that is fine, the orchestrator fires the appropriate events as and when they are.

Base automatically changed from pw/optimistic-orchestrator to merge-train/spartan May 13, 2026 10:03

@alexghr alexghr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea! I'd consider extending it to Base/Root parity circuits too because they could benefit from the same caching behaviour across reorged checkpoints

fcarreiro and others added 6 commits May 13, 2026 17:31
…al pipeline (#23245)

## Summary

- Removes `FastTxCollection` as a separate class and absorbs all its
logic directly into `TxCollection`
- Replaces the old parallel file-store delay with a single sequential
pipeline: node RPC → reqresp → file store, where each phase blocks on
the previous (cancellation-aware)
- File store collection is now driven by `IRequestTracker` — the same
synchronization primitive used by node and reqresp paths. The tracker is
the single source of truth for "is this tx still missing?" and "is this
request still alive?"
- `FileStoreTxCollection` simplified: dropped
`start()`/`stop()`/persistent worker pool/`wakeSignal`.
`startCollecting(requestTracker, context)` returns `Promise<void>`,
spins up its own per-call worker pool, and workers self-terminate when
the tracker is cancelled (all-fetched / deadline / external)

## Collection flow inside `collectFast`

1. Start node RPC collection in the background
2. Wait `txCollectionFastNodesTimeoutBeforeReqRespMs` — interruptible by
cancellation **or by node exhaustion** (so when no nodes are configured,
reqresp starts immediately)
3. Start reqresp in the background (parallel with nodes)
4. Wait `txCollectionFileStoreFastDelayMs` — interruptible by
cancellation or reqresp completion
5. Start file store collection in the background (its workers
self-terminate)
6. `Promise.allSettled` on node + reqresp + file store

`txCollectionFileStoreFastDelayMs` description updated to reflect it is
now anchored to reqresp start, not collection start.

## File store / tracker integration

- `FileStoreTxCollection.startCollecting` no longer takes `(txHashes,
context, deadline)`; it takes `(requestTracker, context)` and reads the
missing txs + deadline from the tracker
- Workers check `requestTracker.isMissing(hash)` each scan — if the tx
was found via another path (node/reqresp/gossipsub), the entry is
dropped without an extra fetch
- Workers race their backoff sleeps against
`requestTracker.cancellationToken` — cancelling a request (deadline,
`stopCollectingForBlocksUpTo/After`, or `stop()`) propagates to file
store workers immediately
- Removed `foundTxs`/`clearPending` plumbing on `FileStoreTxCollection`
— the tracker handles both implicitly
- `startCollecting` yields once after building its entry set, so a
synchronous follow-up call (e.g. `markFetched` in tests, or the
gossipsub-found path in production) lands before workers begin scanning

## Tests

- `tx_collection.test.ts`: collapsed the `TestFastTxCollection`
subclass; all accesses go directly through `TxCollection`. Added "starts
reqresp immediately when no nodes are configured" covering the
node-exhaustion shortcut
- `file_store_tx_collection.test.ts`: rewritten for the new shape — no
`start()`/`stop()`, lifecycle driven by the tracker (cancel to terminate
workers). New "workers exit when tracker is cancelled" covers the
per-call worker-pool teardown

Closes
https://linear.app/aztec-labs/issue/A-933/tx-collection-dont-retrieve-transactions-that-have-already-been
via new synchronization with the request tracker.
…ims (#23165)

## Context

`SequencerPublisher` simulates each enqueued L1 action individually at
enqueue time, then sends them bundled through Multicall3. The `propose`
checkpoint action is validated at enqueue and send time (the latter via
a `preCheck` mechanism), but in isolation and relying on overrides.
There is no simulation of the multicall payload before sending it, so a
reverting tx is most likely not caught.

This refactor:

- Replaces the per-request `preCheck` mechanism with a **single
bundle-level `eth_simulateV1`** of the assembled `aggregate3` payload,
run right before send. If any entry reverts in sim it is dropped from
the bundle, the reduced bundle is re-simulated to get an honest
`gasUsed`, and the survivors are sent. Extracted to a
`SequencerBundleSimulator`.
- Drops the entire propose simulate at enqueue (`simulateProposeTx`,
`validateCheckpointForSubmission`). The bundle simulate covers it.
- Adds a new pre-broadcast `validateBlockHeader` call (calling
`validateHeaderWithAttestations` with empty attestations +
`ignoreSignatures: true`) that catches header-level bugs before we
gossip the proposal to peers. Emits a new `header-validation-failed`
event on failure.
- Drops every per-action simulate at enqueue (governance signal **and**
slashing votes/executes). Bundle simulate at send time is the single
decision point for every per-action revert. `simulateAndEnqueueRequest`
is deleted. We were enqueuing votes even if the simulation failed, after
all.
- Rewrites `sendRequestsAt` so it takes an L2 `SlotNumber`, derives the
timestamp for the start of that slot, and sleeps until one L1 slot
before that boundary, so we can land on the first L1 slot of the target
L2 slot.
- Centralises `SimulationOverridesPlan` construction into a single
`buildCheckpointSimulationOverridesPlan` helper. The plan **always**
pins both `pending` and `proven` chain tips (to the pipelined parent /
invalidation target, or to the current snapshot when neither applies),
so `STFLib.canPruneAtTime` cannot reintroduce a phantom prune during
simulation.
- Makes `SimulationOverridesBuilder.merge` undefined-safe: explicit
`undefined` fields in an incoming plan no longer erase previously-set
values. `withPendingTempCheckpointLogFields` now accepts a partial
subset of fields.
- Moves the payload-empty cache onto `GovernanceProposerContract` next
to its concern. Only `isPayloadEmpty=false` is cached (a CREATE2
redeploy could go empty → populated).
- Drops the old Multicall3 revert-recovery and per-request-resim
machinery, since with `allowFailure: true` the top-level multicall is
expected to land successfully. `Multicall3.forward` now throws
`MulticallForwarderRevertedError` if the receipt reports a reverted
status; the publisher does **not** rotate to a new publisher on that
error (on-chain failure, not a send failure). Adds `Multicall3.hasCode`
helper and a `simulateAggregate3` entrypoint used by the bundle
simulator.
- `L1TxUtils.sendTransaction` fails fast if `txTimeoutAt` has already
elapsed when called. `SequencerPublisher.forwardWithPublisherRotation`
re-checks the deadline at the head of each rotation iteration so it
doesn't keep cycling through publishers after the L2 slot's submission
window has closed.
- Sequencer escape-hatch (`voteInSlotWithoutSyncing`) and
full-escape-hatch (`voteOnSlotWithEscapeHatch`) vote-only paths now
submit via `sendRequestsAt(slot)` rather than `sendRequests()`, so the
bundle-simulate `block.timestamp` override matches the slot the EIP-712
vote signatures were generated for.

The intended outcome is a publisher with one explicit re-validation
point (the bundle simulate), measurable bundle gas (from the bundle
simulate's `gasUsed`), and dead/duplicated state-override plumbing
removed.

## Resulting simulations after this refactor

The full list of simulation / gas-estimation steps that remain in a
pipelined proposer slot, in execution order.

### Pre-build, in `Sequencer.doWork`

1. **`publisher.canProposeAt`** — rollup view call simulated with the
centralised override plan. Cheap pre-check gate before any block-build
work.
2. **`publisher.simulateInvalidateCheckpoint`** (conditional) — runs
**only** if `syncedTo.pendingChainValidationStatus.valid === false` AND
`!syncedTo.hasProposedCheckpoint`. Simulates the invalidate call against
the rollup. Result becomes the `invalidateCheckpoint` package passed
into `CheckpointProposalJob`. The previous code called this even when
there's a proposed parent and discarded the result; this refactor adds
the `!hasProposedCheckpoint` gate so we skip the wasted RPC.

### Per-slot, in `CheckpointProposalJob.proposeCheckpoint`

3. **CheckpointVoter votes** — `CheckpointVoter.enqueueVotes()` runs at
the top of `execute()`, returning two promises that are awaited in
parallel with block-build. It enqueues two kinds of votes via the
publisher, **neither of which simulates at enqueue time** after this
refactor:
- **`enqueueGovernanceCastSignal`** — does an `isPayloadEmpty`
pre-flight check (now on `GovernanceProposerContract`), then enqueues.
No `eth_simulateV1`.
- **`enqueueSlashingActions`** (one call per slashing action, type
`vote-offenses` or `execute-slash`) — builds the request and enqueues.
No `eth_simulateV1`.

Real reverts on any of these are caught by the bundle simulate at send
time, which drops the failing entry and proceeds with the survivors.
4. **`publisher.validateBlockHeader` (NEW: pre-broadcast)** — replaces
the old `simulateProposeTx`-at-enqueue. Calls
`validateHeaderWithAttestations` with empty attestations and
`ignoreSignatures: true` so the rollup runs the header checks (archive
match, slot match, timestamp, mana-min-fee, …) without needing real
attestations. Runs **before** we gossip the proposal to peers. If it
fails, abort the slot — log an error, emit `header-validation-failed`,
don't broadcast, don't enqueue.
5. **`prepareProposeTx → validateBlobs estimateGas`** — kept as the
blob-commitment **consistency check** (detects locally-built commitments
not matching the blob sidecars). Returns `blobEvaluationGas`, which we
stash on the propose `RequestWithExpiry` for use by the bundle gasLimit
later. The simulate-step that previously paired with this
(`simulateProposeTx`) is removed.

### Background pipeline, in
`waitForAttestationsAndEnqueueSubmissionAsync`

6. **`publisher.simulateInvalidateCheckpoint` (conditional)** — runs
**only** in the fallback path where attestation collection failed AND
the pending chain turned out to be invalid. Triggered from
`CheckpointProposalJob.enqueueInvalidation`. This is the second, late
trigger for invalidation simulation — distinct from step 2's pre-build
trigger.

### Send time, in `sendRequestsAt(targetSlot)`

7. **Bundle simulate (NEW)** — single `eth_simulateV1` of the assembled
`aggregate3` payload, with `block.timestamp` overridden to the start of
`targetSlot`, and state overrides = `[disableBlobCheck]` iff `propose`
is in the bundle and `[]` otherwise. Per-entry result decoded from the
returned `Result[]`. This is the **only** post-pipeline-sleep
re-validation; it replaces the per-request `preCheck` mechanism
entirely.
8. **Bundle re-simulate (NEW, conditional)** — runs **only** when step 7
dropped at least one entry. Re-runs the bundle simulate on the reduced
payload to get an honest `gasUsed`, and applies the same per-entry
decode so additional drops are caught. If the re-simulate falls back
(node doesn't support `eth_simulateV1`), the publisher sends the
**first-pass survivors only** with `MAX_L1_TX_LIMIT`; the entries that
the first pass already proved would revert stay dropped and are reported
as failed actions.

### Post-send

No diagnostic-only simulate paths remain. `Multicall3.forward` throws
`MulticallForwarderRevertedError` on a reverted receipt and re-throws on
a send error; per-request revert resimulation has been removed.

## Known caveats

- **`sendRequestsAt` early lead**: sleeps until `startOfTargetSlot -
ethereumSlotDuration` to maximise inclusion in the first L1 block of the
L2 slot. There is a known correctness risk: a tx mined in the L1 block
immediately preceding the L2-slot boundary would revert via
`ProposeLib.validateHeader`'s `slot ==
block.timestamp.slotFromTimestamp()` check. In practice the prior L1
block is usually already committed before this send wakes; if observed
to be unreliable in production, tune the lead down, especially on tests.
- **`validateBlockHeader` pre-broadcast coverage**: covers the
`validateHeader` checks (archive, slot/timestamp, mana-min-fee, …) and
the empty-attestation path of `validateHeaderWithAttestations`, but does
NOT cover proposer-signature verification, inbox consumption
(`Rollup__InvalidInHash`), or `header.inHash` match. Those still execute
inside the full `propose` and are caught by the bundle simulate at send
time. The cost of a rare miss is one wasted broadcast.
- **Top-level `aggregate3` revert diagnostics removed**: the previous
`Multicall3.forward` code decoded receipt-reverted reasons via
`tryGetErrorFromRevertedTx` and did a per-request resim on send-throw.
Both paths are gone. With `allowFailure: true` and `Multicall3.hasCode`
covering the no-bytecode case, a reverted forwarder receipt is genuinely
unexpected (OOG, forwarder bug). The throw of
`MulticallForwarderRevertedError` is the only diagnostic surface —
operators will need the transaction hash from the log to investigate.
@ludamad ludamad force-pushed the merge-train/spartan branch from 4af2626 to db4ec58 Compare May 16, 2026 19:07
tryCompleteEpoch now returns whether the epoch is in a handled state
(handed off now, previously handed off, or no job for an already-
finalized epoch). EpochMonitor only advances latestEpochNumber on a
true return, so the monitor retries while the archiver is still
catching up or expected checkpoints are not yet registered.

Reorders the archiver isEpochComplete check before the job lookup so
the monitor keeps firing until L1 confirms completion; only then does
the "no job" case short-circuit as nothing-to-do.

A-1051.
PhilWindle added a commit that referenced this pull request May 27, 2026
Adds the scenario flagged in the PR #23002 review: a message-bearing tx that is
reorged out of its checkpoint and remined into a fresh one must still prove. The
test sends an L2->L1 message tx, reorgs out its checkpoint, waits for the node to
detect the prune and the tx to be remined, advances the epoch to proven, then
consumes the message from the L1 outbox — which only succeeds if the message's
out-hash was correctly carried through the remine into a valid epoch proof (A-1039).
PhilWindle added a commit that referenced this pull request May 27, 2026
Addresses the remaining half of the PR #23002 review comment ("add txs ... a tx
that was mined in a checkpoint is now mined in a different one") directly in
epochs_optimistic_proving.parallel.test.ts: sends a real tx, reorgs out the
checkpoint containing it, waits for the node to detect the prune and the tx to be
remined into a fresh checkpoint, then proves that checkpoint on L1.
PhilWindle added a commit that referenced this pull request May 27, 2026
The session/top-tree hooks (beforeTopTreeProve/afterTopTreeProve/topTreeProveOverride)
existed on the types but had no injection path — dead code. Wire a setSessionHooks
path through ProverNode -> SessionManager so hooks apply to every session constructed
after the call.

Use it to replace the monkey-patching of createTopTreeOrchestrator in two e2e tests
(addresses the PR #23002 review comment suggesting hook reuse):
- epochs_optimistic_proving 'reorg during proving' gates via beforeTopTreeProve.
- epochs_upload_failed_proof forces a failure via topTreeProveOverride.
PhilWindle added a commit that referenced this pull request May 27, 2026
Addresses the PR #23002 review question about a reorg removing all of an epoch's
checkpoints and new ones being added immediately after. Reconcile drops the session
with no error when canonical content goes empty, and reopens a fresh session for the
same epoch once a checkpoint is re-added — this test asserts that full cycle.
PhilWindle added a commit that referenced this pull request May 27, 2026
…chive root

Switches the CheckpointProver identity from (number, slot, prevArchiveRoot) to
(number, slot, checkpoint.archive.root) — the checkpoint's own post-state archive,
as suggested in PR #23002 review.

prevArchiveRoot only distinguished reorg branches that diverged before the checkpoint;
it could not tell apart two checkpoints built on the same predecessor at the same slot
with different content (e.g. a reorg replacement with a different tx set), which would
have wrongly reused the stale prover. Keying on the post-state archive root closes that:
identical content collapses to one prover (reusing its in-flight sub-tree), any
difference in history or content yields a distinct prover.

prevArchiveRoot was only ever used to build the id, so it's removed from the prover/
register-data entirely.
@PhilWindle PhilWindle closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants