feat(prover-node): checkpoint-driven optimistic proving by PhilWindle · Pull Request #23002 · AztecProtocol/aztec-packages

PhilWindle · 2026-05-06T17:29:54Z

Drives the prover-node onto the split CheckpointSubTreeOrchestrator + TopTreeOrchestrator pair, with checkpoint-driven proving that pipelines sub-trees against tx-gathering and the top-tree against the in-flight sub-trees.

Stacked on #22996 (orchestrator split). Diff above is just the prover-node + e2e + cleanup delta.

What's new

`prover-node` — `EpochProvingJob` job-model rewrite

EpochProvingJob becomes an orchestrator over a Map<string, CheckpointJob> keyed by ${number}:${slot}. Each CheckpointJob owns a single CheckpointSubTreeOrchestrator with its own per-checkpoint context (txs, attestations, previous-block header, l1ToL2 messages, archive sibling path). A TopTreeJob drives the epoch root rollup once all checkpoint sub-trees have started block-level proving.

Public API:

registerCheckpoint — synchronous; sets up sub-tree, kicks off chonk-verifier cache fill, attaches the blockProofs promise to the eventual top-tree job.
provideTxs — supplies simulated txs, transitions the checkpoint job from registered → block-proving.
removeCheckpointsAfter(threshold) — sole removal API. Suffix-only: cancels every job whose checkpoint number sits strictly above the threshold, atomically. Suffix-only by design — middle-removal would leave a non-contiguous survivor set that cannot be proved.
getCheckpointCount, getCheckpointNumbers.

TopTreeJob constructor defensively asserts that its snapshot's checkpoint numbers form a contiguous run (prev + 1 === curr). With suffix-only removal this is structurally guaranteed; the assertion is a runtime tripwire against future regressions.

`prover-node` — `L2BlockStream`-driven checkpoint pipeline

The prover-node consumes chain-checkpointed / chain-pruned events from an L2BlockStream rooted at the first block of the first unproven epoch. On each chain-checkpointed:

Resolve the epoch via getEpochAtSlot.
Get-or-create the per-epoch EpochProvingJob.
Register the checkpoint synchronously and detach a tx-gathering task.
Call tryCompleteEpoch for both older epochs (tertiary signal) and the current epoch (covers the case where the EpochMonitor already fired and we were waiting on this last checkpoint to arrive).

On chain-pruned: call removeCheckpointsAfter(threshold) on every job whose first checkpoint sits at or above the threshold. Pending gather tasks are cancelled via AbortSignal.

Finalization is driven by the union of three signals — EpochMonitor (epoch closes on L1), tertiary (a checkpoint for a strictly later epoch arrives), or final-checkpoint registration on an already-monitored epoch. Each signal calls tryCompleteEpoch, which queries l2BlockSource.isEpochComplete directly: the archiver is the source of truth, so the prover-node carries no in-memory cache of "epochs complete on L1".

`prover-node` — reorg-after-finalization restart

When the L2BlockStream emits a prune that retroactively invalidates an epoch already in finalize, the prover-node aborts the in-flight publish, clears the job, and restarts proving from the new tip.

e2e

New epochs_optimistic_proving.parallel.test.ts: full e2e covering the pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.
The two happy-path tests use a wall-clock-vs-job-epoch sampler to assert that the prover-node has registered checkpoints with the in-flight job for an epoch while still inside that epoch — proves checkpoints aren't deferred until epoch end. The multi-epoch test runs 4 epochs and asserts this for each.
epochs_proof_fails, epochs_upload_failed_proof, epochs_long_proving_time, epochs_multi_proof updated to assert the new in-flight epoch behaviour.

Logging

Structured-context logs added across the CheckpointJob / TopTreeJob / EpochProvingJob lifecycles: register/provide-txs/sub-tree-created/per-block-progress/blockProofs-ready/cancel/teardown for checkpoints, and create/cancel/prove-start/prove-success for top-tree. Every log entry indexes on epochNumber / checkpointNumber per repo convention.

What's removed

EpochProver interface and ServerEpochProver are removed: the prover-node no longer drives a single-class epoch prover, so the legacy API has no production callers. ProvingOrchestrator survives only as a base class for CheckpointSubTreeOrchestrator and as the single-class driver used by prover-client's integration tests; it no longer implements EpochProver.

EpochProvingJob.removeCheckpoint(checkpointNumber) and cancelPendingCheckpoints() are also removed: middle-removal could leave a non-contiguous survivor set that the TopTreeJob constructor would reject. Suffix-only removeCheckpointsAfter is the only checkpoint-removal API now.

Test plan

yarn workspace @aztec/prover-client test — 261 tests pass.
yarn workspace @aztec/prover-node test — 116 tests pass (incl. new dedicated top-tree-job.test.ts and checkpoint-job.test.ts).
e2e tests covering optimistic proving, reorgs during proving, failed proof publish, and multi-checkpoint flows are included in this PR.

…pair Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing single-class proving orchestrator along the natural state-coupling boundary — per-checkpoint block-level work vs. epoch-level top-tree work — while leaving every existing API on the legacy `EpochProver` / `ProvingOrchestrator` / `EpochProvingState` path untouched. The prover-node and e2e tests build unchanged; this PR is purely additive in surface area, with structural refactors on `ProvingOrchestrator` to share scheduling and top-tree drivers with the new `TopTreeOrchestrator`. Split out from #22990 so it can land independently. ## What's new - **`CheckpointSubTreeOrchestrator`** (`checkpoint-sub-tree-orchestrator.ts`): extends `ProvingOrchestrator`, single-checkpoint by construction. Drives chonk-verifier / base / merge / block-root / block-merge for one checkpoint and resolves a `SubTreeResult` instead of escalating to the checkpoint root — the parent's `checkAndEnqueueCheckpointRootRollup` is overridden to short-circuit. The constructor calls `super.startNewEpoch(epoch, 1, empty challenges)` to set up a single-checkpoint mini-epoch; the count and challenges are never read because the override prevents the parent's finalize / root path from running. - **`TopTreeOrchestrator`** + **`TopTreeProvingState`**: self-contained driver from checkpoint-root through epoch-root rollup. Takes per-checkpoint block-proof promises and pipelines its hint chain against them. Cancellation surfaces as `TopTreeCancelledError` so callers can distinguish reorg-driven cancel from a genuine proving failure. - **`EpochProvingContext`** (`epoch-proving-context.ts`): per-epoch shared cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof. - **`ProvingScheduler`** (`proving-scheduler.ts`): abstract base owning the `SerialQueue` deferred-job lifecycle, the `pendingProvingJobs` controller list, and a unified `deferredProving<S, T>(state, request, callback, isCancelled?)` submit envelope. The minimal `ProvingStateLike` contract is just `verifyState()` + `reject(reason)`. - **`TopTreeProvingScheduler`** (`top-tree-proving-scheduler.ts`): extends `ProvingScheduler` and holds the checkpoint-merge, padding, and root-rollup drivers (plus tree-walking helpers) shared by both orchestrators. Wraps circuit calls via a `wrapCircuitCall` hook (orchestrator overrides for spans; top-tree leaves identity) and resolves via an `onRootRollupComplete` hook to bridge the two states' differing `resolve` signatures. The per-checkpoint root driver stays subclass-specific because input-building flows differ. - **`EpochProverFactory` interface on `ProverClient`**: new factory methods `createEpochProvingContext(epochNumber)`, `createCheckpointSubTreeOrchestrator(...)`, and `createTopTreeOrchestrator()`. A single shared `BrokerCircuitProverFacade` is owned by `ProverClient` and shared across every orchestrator. ## What changes in existing code - `ProvingOrchestrator` extends `TopTreeProvingScheduler`; the inline broker-job submit envelope, queue lifecycle, and the top-tree-section drivers are inherited. `cancel()` delegates the queue-recreate + abort-jobs logic to `resetSchedulerState(this.cancelJobsOnStop)`. Three internal methods (`getOrEnqueueChonkVerifier`, `checkAndEnqueueBaseRollup`, `checkAndEnqueueCheckpointRootRollup`) become `protected` so the sub-tree can override them; `provingState` and `provingPromise` likewise become `protected` so the sub-tree can hook the parent's failure stream onto `subTreeResult`. No public API change on `ProvingOrchestrator`. - `CheckpointProvingState`: gains two read-only accessors used by the sub-tree's checkpoint-root override — `getSubTreeOutputProofs()` and `getLastArchiveSiblingPath()`. No state changes. - `ProverClient` keeps `createEpochProver()` exactly as before (each call spawns its own `BrokerCircuitProverFacade`); the new factory methods share a `getFacade()` set up in `start()` and torn down in `stop()`. `EpochProver`, `EpochProverManager`, `ServerEpochProver`, `EpochProvingState`, the integration tests in `orchestrator_*.test.ts`, `bb_prover_full_rollup.test.ts`, and `stdlib/interfaces/*` are all unchanged from `merge-train/spartan` — the prover-node and e2e tests continue to build against the existing `EpochProver` API. Migrating the prover-node onto the new factories (and the deferred-finalize flow that goes with optimistic proving) is the follow-up PR. ## Test plan - [x] 261 prover-client tests pass (full `yarn workspace @aztec/prover-client test`). - [x] `yarn build` clean against current merge-train/spartan (modulo the pre-existing `@aztec/sqlite3mc-wasm` issue inherited from baseline).

Drives the prover-node onto the split `CheckpointSubTreeOrchestrator` + `TopTreeOrchestrator` pair, with checkpoint-driven proving that pipelines sub-trees against tx-gathering and the top-tree against the in-flight sub-trees. ## What's new ### `prover-node` — `EpochProvingJob` job-model rewrite `EpochProvingJob` becomes an orchestrator over a `Map<string, CheckpointJob>` keyed by `${number}:${slot}`. Each `CheckpointJob` owns a single `CheckpointSubTreeOrchestrator` with its own per-checkpoint context (txs, attestations, previous-block header, l1ToL2 messages, archive sibling path). A `TopTreeJob` drives the epoch root rollup once all checkpoint sub-trees have started block-level proving. Public API: - `registerCheckpoint` — synchronous; sets up sub-tree, kicks off chonk-verifier cache fill, attaches the `blockProofs` promise to the eventual top-tree job. - `provideTxs` — supplies simulated txs, transitions the checkpoint job from registered → block-proving. - `removeCheckpoint(synchronous, idempotent)` — drops a single checkpoint by `(number, slot)`, fire-and-forget cancels its sub-tree. Tolerates re-add of the same checkpoint number under a different slot. - `removeCheckpointsAfter`, `getCheckpointCount`, `getCheckpointNumbers`, `cancelPendingCheckpoints`. ### `prover-node` — `L2BlockStream`-driven checkpoint pipeline The prover-node consumes `chain-checkpointed` / `chain-pruned` events from an `L2BlockStream` rooted at the first block of the first unproven epoch. On each `chain-checkpointed`: 1. Resolve the epoch via `getEpochAtSlot`. 2. Get-or-create the per-epoch `EpochProvingJob`. 3. Detached-task gather txs + register the checkpoint with the job. On `chain-pruned`: call `removeCheckpointsAfter(threshold)` on every job whose first checkpoint sits at or above the threshold. Pending gather tasks are cancelled via `AbortSignal`. Finalization is driven by the union of three signals: epoch-monitor sees the epoch close on L1, a checkpoint for a strictly later epoch arrives, or all expected checkpoints (per archiver) are registered while the epoch is already complete on L1. ### `prover-node` — reorg-after-finalization restart When the L2BlockStream emits a prune that retroactively invalidates an epoch already in finalize, the prover-node aborts the in-flight publish, clears the job, and restarts proving from the new tip. ### e2e - New `epochs_optimistic_proving.parallel.test.ts`: full e2e covering the pipelining, replacement-checkpoint reuse, and reorg-during-proving paths. - `epochs_proof_fails`, `epochs_upload_failed_proof`, `epochs_long_proving_time`, `epochs_multi_proof` updated to assert the new in-flight epoch behaviour. ## What's removed `EpochProver` interface and `ServerEpochProver` are removed: the prover-node no longer drives a single-class epoch prover, so the legacy API has no production callers. `ProvingOrchestrator` survives only as a base class for `CheckpointSubTreeOrchestrator` and as the single-class driver used by `prover-client`'s integration tests; it no longer implements `EpochProver`. ## Test plan - `yarn workspace @aztec/prover-client test` — 261 tests pass. - `yarn workspace @aztec/prover-node test` — 89 tests pass. - e2e tests covering optimistic proving, reorgs during proving, failed proof publish, and multi-checkpoint flows are included in this PR.

spalladino

Got halfway through it, but looks good so far. Pasting here some observations by Claude that I didn't get to review myself:

Same-epoch replacement checkpoints are silently dropped after completeEpoch(). prover-node.ts:213-220 skips any checkpoint for an epoch whose job has isEpochComplete(). Sequence: (1) all checkpoints registered, completeEpoch() fires; (2) prune removes the suffix; (3) the L1 reorg replacement for the same epoch arrives. The replacement is dropped because existingJob.isEpochComplete() is true. The finalize loop then restarts against an incomplete prefix of the epoch and eventually publishes a proof for a checkpoint range that does not match the canonical L1 chain. This contradicts the stated "orphan + replacement coexist briefly" model. The check should be conditioned on the surviving checkpoint set, not on the boolean flag.
A prune that lands after topTreeJob.start() but during publishProof is not interceptable. epoch-proving-job.ts:478 clears this.topTreeJob immediately after the prove resolves; the publish at epoch-proving-job.ts:487 operates on the captured snapshot array. Meanwhile removeCheckpointsAfter (epoch-proving-job.ts:332-369) only acts on the live topTreeJob — which is now undefined — so the publish proceeds with stale data and the (out-of-date) attestations from snapshot.at(-1). Best case L1 rejects the proof (gas burn, retry confusion); worst case it lands against a window the canonical chain has since invalidated. Either re-validate the snapshot's checkpoint set (by id) against the live registry just before submitting, or hold the top-tree-job reference until publish completes and route prune through it.
EpochMonitor permanently records "processed" on a no-op tryCompleteEpoch. prover-node.ts:145-153 returns true from handleEpochReadyToProve whenever tryCompleteEpoch doesn't throw — even if it bailed at prover-node.ts:394 (archiver says incomplete) or prover-node.ts:399-405 (not all checkpoints known to the job). epoch-monitor.ts:93 then sets latestEpochNumber = epochToProve, and epoch-monitor.ts:76-79 ensures the same epoch is never re-considered. In a healthy network the next checkpoint event for a later epoch reseeds via the tertiary signal, but if the epoch in question is the last produced (network stall, end-of-test) it stays stuck forever. tryCompleteEpoch should return whether completeEpoch() was actually invoked, and handleEpochReadyToProve should propagate that.

spalladino · 2026-05-08T22:48:55Z

+    // First, update the local tips store so the stream can track our state.
+    await this.tipsStore.handleBlockStreamEvent(event);


Note that by updating the store first, any error in the handle methods below will not be retried in the next iteration of the L2BlockStream. Not sure if this is intentional.

Yes it should go at the end I think.

spalladino · 2026-05-08T22:54:42Z

+    // (e.g. an early proof submission landed for an earlier epoch that ends mid-way
+    // through this epoch's slot range) does not prove the entire epoch. We need every
+    // checkpoint of a partially-proven epoch to feed the orchestrator.
+    if (await this.isEpochFullyProven(epochNumber, l1Constants)) {


We should check in the archiver and blockstream what happens first: do checkpoints get emitted, or does the proven block number get updated? I'm worried that, when starting a long sync after offline for a while, all the checkpoints are emitted first and then the proven tip is updated, so the prover node would trigger a ton of jobs only to cancel them immediately afterwards.

Ah just saw the compute starting block number for the blockstream. And as long as the prover node waits for the archiver initial sync to be completed before starting, we should be good I guess.

spalladino · 2026-05-08T22:57:01Z

+   * to the job for finalization. If checkpoints are still being delivered by the
+   * L2BlockStream, the handoff happens later when the last one arrives.
   */
  async handleEpochReadyToProve(epochNumber: EpochNumber): Promise<boolean> {


I think we can now replace the entire epoch monitor with a running promise that just calls l2BlockSource.isEpochComplete periodically.

spalladino · 2026-05-08T23:04:16Z

+    // Add to an existing job, or create a new one if allowed.
+    let job = existingJob;
+    if (!job) {
+      const { proverNodeMaxPendingJobs: maxPendingJobs } = this.config;
+      if (maxPendingJobs > 0 && this.jobs.size >= maxPendingJobs) {
+        this.log.debug(
+          `Skipping checkpoint ${checkpoint.number} for epoch ${epochNumber}: max pending jobs ${maxPendingJobs} reached`,
+        );
+        return;
+      }
+      job = await this.createEpochJob(epochNumber);
+    }


Won't this create a hole in a given epoch? I don't see another code path where we patch an epoch job with checkpoint jobs we failed to add due to lack of capacity. I'm worried about a scenario where we don't add a checkpoint to an epoch, but then a prior epoch finishes, so we do add the next, and that epoch job ends up waiting for that missing checkpoint forever.

Speaking of which: is there anything that cleans up very old jobs when the epoch proof submission window closes?

Yes, was going to ask about this. As it stands here there could be problems if this is set. Do you think this is needed or makes sense anymore? Proving is more of a continuous process rather than a set of large jobs.

spalladino · 2026-05-08T23:04:44Z

+
+    const taskId = `${checkpoint.number} - ${checkpoint.slot}`;
+    const task = this.gatherAndAddCheckpoint(job, checkpoint, epochNumber, abortSignal);
+    this.pendingGatherTasks.set(taskId, task);


Maybe it makes sense to move the responsibility of collecting their data to the checkpoint jobs, to keep the prover node leaner?

I think it does. As a further refactor, I want to break the checkpoint proving out of EpochProvingJob. Then we maintain a collection of checkpoint jobs being proven. EpochProvingJob then becomes a collection of references to a subset of those checkpoints and a top tree. This would enable us to create instances of EpochProvingJob with arbitrary checkpoints (of a valid range of course) enabling efficient partial epoch proving.

spalladino · 2026-05-08T23:13:39Z

Looks comprehensive. I'd add tests for cross-chain messages as well, those are usually a source of problems.

spalladino · 2026-05-11T12:56:02Z

+        const wallClockEpoch = Number(test.epochCache.getEpochAndSlotNow().epoch);
+        const job = proverNode.epochJobs.get(wallClockEpoch);
+        if (job && job.getCheckpointCount() > 0) {
+          observed.add(wallClockEpoch);
+        }


Heads up this should pass even without optimistic proving: prover would start as soon as the last checkpoint of the epoch was pushed to L1, which could happen a few L1 slots before the actual end of the epoch. Maybe we should offset this a bit so the check is more meaningful, or check for specific checkpoint jobs vs all jobs in the epoch?

spalladino · 2026-05-11T13:02:29Z

+      // Wait for the 2nd checkpoint within this epoch.
+      const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber;
+      const midCheckpoint = CheckpointNumber(initialCheckpoint + 2);
+      await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6);
+      const checkpointBeforeReorg = test.monitor.checkpointNumber;
+      logger.info(`Reached checkpoint ${checkpointBeforeReorg}`);


I'd also add an assertion that the prover actually started the job for the given checkpoint before reorging it out. Otherwise, this test could pass if the prover just takes a bit more time to start assembling the checkpoint job.

spalladino · 2026-05-11T13:03:38Z

+      // Wait for 2 checkpoints mid-epoch.
+      const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber;
+      const midCheckpoint = CheckpointNumber(initialCheckpoint + 2);
+      await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6);
+      const checkpointBeforeReorg = test.monitor.checkpointNumber;
+      logger.info(`Reached checkpoint ${checkpointBeforeReorg}`);


This should also assert that both checkpoints landed on the same epoch, which could not be the case if this test started near the end of an epoch.

spalladino · 2026-05-11T13:10:24Z

We should also add txs. And it'd be interesting to see a case where one of the reorgs means that a tx that was mined in a checkpoint is now mined in a different one.

spalladino · 2026-05-11T13:12:56Z

+      const proverNode = test.proverNodes[0].getProverNode() as TestProverNode;
+      const proverManager = proverNode.getProver();
+      const origCreateTopTree = proverManager.createTopTreeOrchestrator.bind(proverManager);
+      let releaseProvingGate: () => void = () => {};
+      const provingGate = new Promise<void>(resolve => {
+        releaseProvingGate = resolve;
+      });
+      proverManager.createTopTreeOrchestrator = () => {
+        const topTree = origCreateTopTree();
+        const origProve = topTree.prove.bind(topTree);
+        topTree.prove = async (...args: Parameters<typeof origProve>) => {
+          logger.warn('Top-tree proving gated — waiting for test to release');
+          await provingGate;
+          logger.warn('Proving gate released');
+          return origProve(...args);
+        };
+        return topTree;
+      };


I'm wondering if there's a cleaner way of doing this, perhaps reusing the EpochProvingJobHooks? Not sure if worth stressing over it though.

spalladino · 2026-05-11T13:55:55Z

+        if (snapshot.length === 0) {
+          throw new Error(`Cannot finalize epoch ${this.epochNumber}: no surviving checkpoints`);
+        }


Is there a race condition where a reorg removes all checkpoints from an epoch and then new ones are immediately added again, but we hit this case before the new ones get added? Seems like something that would never happen in prod, but could affect e2e tests that have short epochs.

There shouldn't be. The prover node does this in it's prune handler.

if (job.getCheckpointCount() === 0) { this.log.info(`Cancelling epoch ${epochNum} job — all checkpoints pruned`); await this.cancelAndCleanupJob(epochNum, job); }

So that epoch job instance is removed and replaced with another when a new checkpoint arrives for the same epoch number.

I will add a test though

spalladino · 2026-05-11T14:04:00Z

+    // Capture proving data *before* we clear the live job map, so a failure-upload
+    // attempt that runs after teardown still has the per-checkpoint txs / messages /
+    // previous-block-header to work with.
+    if (!this.provingDataSnapshot) {
+      try {
+        this.provingDataSnapshot = this.buildProvingDataSnapshot();
+      } catch {
+        // No completed checkpoints — failure-upload will have nothing to upload, which
+        // is fine. Snapshot stays undefined and getProvingData will rebuild and rethrow.
+      }


How about we move the responsibility of uploading this data to the epoch-proving-job itself, so we don't need to collect this just for the prover-node to maybe read if it's configured for uploads?

spalladino · 2026-05-11T14:14:12Z

+
+  /** Stable identifier: `${checkpoint number}:${slot}`. Used as the parent's map key. */
+  public static idFor(checkpoint: Checkpoint): string {
+    return `${checkpoint.number}:${checkpoint.header.slotNumber}`;


I don't think it makes a difference based on how reorgs are handled, but I'd throw in the checkpoint archive as part of the identifier, just in case

spalladino · 2026-05-11T14:17:47Z

+      const publicTxs = allTxs.filter(tx => tx?.data.forPublic);
+      if (publicTxs.length > 0) {
+        this.deps.log.verbose(
+          `Kicking off ${publicTxs.length} chonk-verifier circuits for checkpoint ${this.checkpoint.number}`,
+          {
+            checkpointNumber: this.checkpoint.number,
+            publicTxCount: publicTxs.length,
+          },
+        );
+        await this.subTree.startChonkVerifierCircuits(publicTxs);


I know this didn't change with this PR, but for private-only txs, who verifies the chonk proof?

The private base rollup. Public txs have a different process. They require a dedicated chonk verifier circuit.

spalladino · 2026-05-11T14:30:40Z

+  public start(): Promise<TopTreeProof> {
+    void this.run();


I take it it's responsibility of the caller to check that all dependencies are ready, right?

Which dependencies are you referring to? At the point this class is constructed, it's arguments contain all information required to start the proving process (e.g. compute blob challenges). If the checkpoint root input proofs aren't ready yet that is fine, the orchestrator fires the appropriate events as and when they are.

alexghr

Looks good!

alexghr · 2026-05-13T10:29:35Z

I think this is a great idea! I'd consider extending it to Base/Root parity circuits too because they could benefit from the same caching behaviour across reorged checkpoints

…al pipeline (#23245) ## Summary - Removes `FastTxCollection` as a separate class and absorbs all its logic directly into `TxCollection` - Replaces the old parallel file-store delay with a single sequential pipeline: node RPC → reqresp → file store, where each phase blocks on the previous (cancellation-aware) - File store collection is now driven by `IRequestTracker` — the same synchronization primitive used by node and reqresp paths. The tracker is the single source of truth for "is this tx still missing?" and "is this request still alive?" - `FileStoreTxCollection` simplified: dropped `start()`/`stop()`/persistent worker pool/`wakeSignal`. `startCollecting(requestTracker, context)` returns `Promise<void>`, spins up its own per-call worker pool, and workers self-terminate when the tracker is cancelled (all-fetched / deadline / external) ## Collection flow inside `collectFast` 1. Start node RPC collection in the background 2. Wait `txCollectionFastNodesTimeoutBeforeReqRespMs` — interruptible by cancellation **or by node exhaustion** (so when no nodes are configured, reqresp starts immediately) 3. Start reqresp in the background (parallel with nodes) 4. Wait `txCollectionFileStoreFastDelayMs` — interruptible by cancellation or reqresp completion 5. Start file store collection in the background (its workers self-terminate) 6. `Promise.allSettled` on node + reqresp + file store `txCollectionFileStoreFastDelayMs` description updated to reflect it is now anchored to reqresp start, not collection start. ## File store / tracker integration - `FileStoreTxCollection.startCollecting` no longer takes `(txHashes, context, deadline)`; it takes `(requestTracker, context)` and reads the missing txs + deadline from the tracker - Workers check `requestTracker.isMissing(hash)` each scan — if the tx was found via another path (node/reqresp/gossipsub), the entry is dropped without an extra fetch - Workers race their backoff sleeps against `requestTracker.cancellationToken` — cancelling a request (deadline, `stopCollectingForBlocksUpTo/After`, or `stop()`) propagates to file store workers immediately - Removed `foundTxs`/`clearPending` plumbing on `FileStoreTxCollection` — the tracker handles both implicitly - `startCollecting` yields once after building its entry set, so a synchronous follow-up call (e.g. `markFetched` in tests, or the gossipsub-found path in production) lands before workers begin scanning ## Tests - `tx_collection.test.ts`: collapsed the `TestFastTxCollection` subclass; all accesses go directly through `TxCollection`. Added "starts reqresp immediately when no nodes are configured" covering the node-exhaustion shortcut - `file_store_tx_collection.test.ts`: rewritten for the new shape — no `start()`/`stop()`, lifecycle driven by the tracker (cancel to terminate workers). New "workers exit when tracker is cancelled" covers the per-call worker-pool teardown Closes https://linear.app/aztec-labs/issue/A-933/tx-collection-dont-retrieve-transactions-that-have-already-been via new synchronization with the request tracker.

…ims (#23165) ## Context `SequencerPublisher` simulates each enqueued L1 action individually at enqueue time, then sends them bundled through Multicall3. The `propose` checkpoint action is validated at enqueue and send time (the latter via a `preCheck` mechanism), but in isolation and relying on overrides. There is no simulation of the multicall payload before sending it, so a reverting tx is most likely not caught. This refactor: - Replaces the per-request `preCheck` mechanism with a **single bundle-level `eth_simulateV1`** of the assembled `aggregate3` payload, run right before send. If any entry reverts in sim it is dropped from the bundle, the reduced bundle is re-simulated to get an honest `gasUsed`, and the survivors are sent. Extracted to a `SequencerBundleSimulator`. - Drops the entire propose simulate at enqueue (`simulateProposeTx`, `validateCheckpointForSubmission`). The bundle simulate covers it. - Adds a new pre-broadcast `validateBlockHeader` call (calling `validateHeaderWithAttestations` with empty attestations + `ignoreSignatures: true`) that catches header-level bugs before we gossip the proposal to peers. Emits a new `header-validation-failed` event on failure. - Drops every per-action simulate at enqueue (governance signal **and** slashing votes/executes). Bundle simulate at send time is the single decision point for every per-action revert. `simulateAndEnqueueRequest` is deleted. We were enqueuing votes even if the simulation failed, after all. - Rewrites `sendRequestsAt` so it takes an L2 `SlotNumber`, derives the timestamp for the start of that slot, and sleeps until one L1 slot before that boundary, so we can land on the first L1 slot of the target L2 slot. - Centralises `SimulationOverridesPlan` construction into a single `buildCheckpointSimulationOverridesPlan` helper. The plan **always** pins both `pending` and `proven` chain tips (to the pipelined parent / invalidation target, or to the current snapshot when neither applies), so `STFLib.canPruneAtTime` cannot reintroduce a phantom prune during simulation. - Makes `SimulationOverridesBuilder.merge` undefined-safe: explicit `undefined` fields in an incoming plan no longer erase previously-set values. `withPendingTempCheckpointLogFields` now accepts a partial subset of fields. - Moves the payload-empty cache onto `GovernanceProposerContract` next to its concern. Only `isPayloadEmpty=false` is cached (a CREATE2 redeploy could go empty → populated). - Drops the old Multicall3 revert-recovery and per-request-resim machinery, since with `allowFailure: true` the top-level multicall is expected to land successfully. `Multicall3.forward` now throws `MulticallForwarderRevertedError` if the receipt reports a reverted status; the publisher does **not** rotate to a new publisher on that error (on-chain failure, not a send failure). Adds `Multicall3.hasCode` helper and a `simulateAggregate3` entrypoint used by the bundle simulator. - `L1TxUtils.sendTransaction` fails fast if `txTimeoutAt` has already elapsed when called. `SequencerPublisher.forwardWithPublisherRotation` re-checks the deadline at the head of each rotation iteration so it doesn't keep cycling through publishers after the L2 slot's submission window has closed. - Sequencer escape-hatch (`voteInSlotWithoutSyncing`) and full-escape-hatch (`voteOnSlotWithEscapeHatch`) vote-only paths now submit via `sendRequestsAt(slot)` rather than `sendRequests()`, so the bundle-simulate `block.timestamp` override matches the slot the EIP-712 vote signatures were generated for. The intended outcome is a publisher with one explicit re-validation point (the bundle simulate), measurable bundle gas (from the bundle simulate's `gasUsed`), and dead/duplicated state-override plumbing removed. ## Resulting simulations after this refactor The full list of simulation / gas-estimation steps that remain in a pipelined proposer slot, in execution order. ### Pre-build, in `Sequencer.doWork` 1. **`publisher.canProposeAt`** — rollup view call simulated with the centralised override plan. Cheap pre-check gate before any block-build work. 2. **`publisher.simulateInvalidateCheckpoint`** (conditional) — runs **only** if `syncedTo.pendingChainValidationStatus.valid === false` AND `!syncedTo.hasProposedCheckpoint`. Simulates the invalidate call against the rollup. Result becomes the `invalidateCheckpoint` package passed into `CheckpointProposalJob`. The previous code called this even when there's a proposed parent and discarded the result; this refactor adds the `!hasProposedCheckpoint` gate so we skip the wasted RPC. ### Per-slot, in `CheckpointProposalJob.proposeCheckpoint` 3. **CheckpointVoter votes** — `CheckpointVoter.enqueueVotes()` runs at the top of `execute()`, returning two promises that are awaited in parallel with block-build. It enqueues two kinds of votes via the publisher, **neither of which simulates at enqueue time** after this refactor: - **`enqueueGovernanceCastSignal`** — does an `isPayloadEmpty` pre-flight check (now on `GovernanceProposerContract`), then enqueues. No `eth_simulateV1`. - **`enqueueSlashingActions`** (one call per slashing action, type `vote-offenses` or `execute-slash`) — builds the request and enqueues. No `eth_simulateV1`. Real reverts on any of these are caught by the bundle simulate at send time, which drops the failing entry and proceeds with the survivors. 4. **`publisher.validateBlockHeader` (NEW: pre-broadcast)** — replaces the old `simulateProposeTx`-at-enqueue. Calls `validateHeaderWithAttestations` with empty attestations and `ignoreSignatures: true` so the rollup runs the header checks (archive match, slot match, timestamp, mana-min-fee, …) without needing real attestations. Runs **before** we gossip the proposal to peers. If it fails, abort the slot — log an error, emit `header-validation-failed`, don't broadcast, don't enqueue. 5. **`prepareProposeTx → validateBlobs estimateGas`** — kept as the blob-commitment **consistency check** (detects locally-built commitments not matching the blob sidecars). Returns `blobEvaluationGas`, which we stash on the propose `RequestWithExpiry` for use by the bundle gasLimit later. The simulate-step that previously paired with this (`simulateProposeTx`) is removed. ### Background pipeline, in `waitForAttestationsAndEnqueueSubmissionAsync` 6. **`publisher.simulateInvalidateCheckpoint` (conditional)** — runs **only** in the fallback path where attestation collection failed AND the pending chain turned out to be invalid. Triggered from `CheckpointProposalJob.enqueueInvalidation`. This is the second, late trigger for invalidation simulation — distinct from step 2's pre-build trigger. ### Send time, in `sendRequestsAt(targetSlot)` 7. **Bundle simulate (NEW)** — single `eth_simulateV1` of the assembled `aggregate3` payload, with `block.timestamp` overridden to the start of `targetSlot`, and state overrides = `[disableBlobCheck]` iff `propose` is in the bundle and `[]` otherwise. Per-entry result decoded from the returned `Result[]`. This is the **only** post-pipeline-sleep re-validation; it replaces the per-request `preCheck` mechanism entirely. 8. **Bundle re-simulate (NEW, conditional)** — runs **only** when step 7 dropped at least one entry. Re-runs the bundle simulate on the reduced payload to get an honest `gasUsed`, and applies the same per-entry decode so additional drops are caught. If the re-simulate falls back (node doesn't support `eth_simulateV1`), the publisher sends the **first-pass survivors only** with `MAX_L1_TX_LIMIT`; the entries that the first pass already proved would revert stay dropped and are reported as failed actions. ### Post-send No diagnostic-only simulate paths remain. `Multicall3.forward` throws `MulticallForwarderRevertedError` on a reverted receipt and re-throws on a send error; per-request revert resimulation has been removed. ## Known caveats - **`sendRequestsAt` early lead**: sleeps until `startOfTargetSlot - ethereumSlotDuration` to maximise inclusion in the first L1 block of the L2 slot. There is a known correctness risk: a tx mined in the L1 block immediately preceding the L2-slot boundary would revert via `ProposeLib.validateHeader`'s `slot == block.timestamp.slotFromTimestamp()` check. In practice the prior L1 block is usually already committed before this send wakes; if observed to be unreliable in production, tune the lead down, especially on tests. - **`validateBlockHeader` pre-broadcast coverage**: covers the `validateHeader` checks (archive, slot/timestamp, mana-min-fee, …) and the empty-attestation path of `validateHeaderWithAttestations`, but does NOT cover proposer-signature verification, inbox consumption (`Rollup__InvalidInHash`), or `header.inHash` match. Those still execute inside the full `propose` and are caught by the bundle simulate at send time. The cost of a rare miss is one wasted broadcast. - **Top-level `aggregate3` revert diagnostics removed**: the previous `Multicall3.forward` code decoded receipt-reverted reasons via `tryGetErrorFromRevertedTx` and did a per-request resim on send-throw. Both paths are gone. With `allowFailure: true` and `Multicall3.hasCode` covering the no-bytecode case, a reverted forwarder receipt is genuinely unexpected (OOG, forwarder bug). The throw of `MulticallForwarderRevertedError` is the only diagnostic surface — operators will need the transaction hash from the log to investigate.

…imistic-final

tryCompleteEpoch now returns whether the epoch is in a handled state (handed off now, previously handed off, or no job for an already- finalized epoch). EpochMonitor only advances latestEpochNumber on a true return, so the monitor retries while the archiver is still catching up or expected checkpoints are not yet registered. Reorders the archiver isEpochComplete check before the job lookup so the monitor keeps firing until L1 confirms completion; only then does the "no job" case short-circuit as nothing-to-do. A-1051.

Adds the scenario flagged in the PR #23002 review: a message-bearing tx that is reorged out of its checkpoint and remined into a fresh one must still prove. The test sends an L2->L1 message tx, reorgs out its checkpoint, waits for the node to detect the prune and the tx to be remined, advances the epoch to proven, then consumes the message from the L1 outbox — which only succeeds if the message's out-hash was correctly carried through the remine into a valid epoch proof (A-1039).

Addresses the remaining half of the PR #23002 review comment ("add txs ... a tx that was mined in a checkpoint is now mined in a different one") directly in epochs_optimistic_proving.parallel.test.ts: sends a real tx, reorgs out the checkpoint containing it, waits for the node to detect the prune and the tx to be remined into a fresh checkpoint, then proves that checkpoint on L1.

The session/top-tree hooks (beforeTopTreeProve/afterTopTreeProve/topTreeProveOverride) existed on the types but had no injection path — dead code. Wire a setSessionHooks path through ProverNode -> SessionManager so hooks apply to every session constructed after the call. Use it to replace the monkey-patching of createTopTreeOrchestrator in two e2e tests (addresses the PR #23002 review comment suggesting hook reuse): - epochs_optimistic_proving 'reorg during proving' gates via beforeTopTreeProve. - epochs_upload_failed_proof forces a failure via topTreeProveOverride.

Addresses the PR #23002 review question about a reorg removing all of an epoch's checkpoints and new ones being added immediately after. Reconcile drops the session with no error when canonical content goes empty, and reopens a fresh session for the same epoch once a checkpoint is re-added — this test asserts that full cycle.

…chive root Switches the CheckpointProver identity from (number, slot, prevArchiveRoot) to (number, slot, checkpoint.archive.root) — the checkpoint's own post-state archive, as suggested in PR #23002 review. prevArchiveRoot only distinguished reorg branches that diverged before the checkpoint; it could not tell apart two checkpoints built on the same predecessor at the same slot with different content (e.g. a reorg replacement with a different tx set), which would have wrongly reused the stale prover. Keying on the post-state archive root closes that: identical content collapses to one prover (reusing its in-flight sub-tree), any difference in history or content yields a distinct prover. prevArchiveRoot was only ever used to build the id, so it's removed from the prover/ register-data entirely.

PhilWindle force-pushed the pw/optimistic-final branch from 087e60b to f181313 Compare May 6, 2026 17:47

PhilWindle changed the base branch from merge-train/spartan to pw/optimistic-orchestrator May 6, 2026 17:47

PhilWindle added ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 6, 2026

PhilWindle force-pushed the pw/optimistic-final branch from f181313 to 3694a9b Compare May 6, 2026 17:51

PhilWindle force-pushed the pw/optimistic-orchestrator branch from da67473 to 7342efc Compare May 6, 2026 18:09

PhilWindle force-pushed the pw/optimistic-final branch 17 times, most recently from fd62ad7 to 093fdde Compare May 8, 2026 08:50

PhilWindle force-pushed the pw/optimistic-final branch from 093fdde to 81dc825 Compare May 8, 2026 09:23

spalladino reviewed May 8, 2026

View reviewed changes

spalladino reviewed May 11, 2026

View reviewed changes

Base automatically changed from pw/optimistic-orchestrator to merge-train/spartan May 13, 2026 10:03

alexghr reviewed May 13, 2026

View reviewed changes

fcarreiro and others added 6 commits May 13, 2026 17:31

Merge branch 'next' into merge-train/spartan

4b52402

Merge branch 'next' into merge-train/spartan

ad97697

Merge branch 'merge-train/spartan' into pw/optimistic-final

d595768

Fix

ec141e8

PhilWindle requested review from LeilaWang, Thunkar, charlielye and nventuro as code owners May 13, 2026 17:39

PhilWindle added 2 commits May 13, 2026 17:45

Merge remote-tracking branch 'origin/merge-train/spartan' into pw/opt…

62bdebb

…imistic-final

Fix

7cc1d23

ludamad force-pushed the merge-train/spartan branch from 4af2626 to db4ec58 Compare May 16, 2026 19:07

PhilWindle closed this Jun 10, 2026

		// First, update the local tips store so the stream can track our state.
		await this.tipsStore.handleBlockStreamEvent(event);

Conversation

PhilWindle commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's new

prover-node — EpochProvingJob job-model rewrite

prover-node — L2BlockStream-driven checkpoint pipeline

prover-node — reorg-after-finalization restart

e2e

Logging

What's removed

Test plan

Uh oh!

spalladino left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexghr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PhilWindle commented May 6, 2026 •

edited

Loading

`prover-node` — `EpochProvingJob` job-model rewrite

`prover-node` — `L2BlockStream`-driven checkpoint pipeline

`prover-node` — reorg-after-finalization restart