feat(prover-node): checkpoint-driven optimistic proving#23002
feat(prover-node): checkpoint-driven optimistic proving#23002PhilWindle wants to merge 11 commits into
Conversation
087e60b to
f181313
Compare
f181313 to
3694a9b
Compare
…pair Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing single-class proving orchestrator along the natural state-coupling boundary — per-checkpoint block-level work vs. epoch-level top-tree work — while leaving every existing API on the legacy `EpochProver` / `ProvingOrchestrator` / `EpochProvingState` path untouched. The prover-node and e2e tests build unchanged; this PR is purely additive in surface area, with structural refactors on `ProvingOrchestrator` to share scheduling and top-tree drivers with the new `TopTreeOrchestrator`. Split out from #22990 so it can land independently. ## What's new - **`CheckpointSubTreeOrchestrator`** (`checkpoint-sub-tree-orchestrator.ts`): extends `ProvingOrchestrator`, single-checkpoint by construction. Drives chonk-verifier / base / merge / block-root / block-merge for one checkpoint and resolves a `SubTreeResult` instead of escalating to the checkpoint root — the parent's `checkAndEnqueueCheckpointRootRollup` is overridden to short-circuit. The constructor calls `super.startNewEpoch(epoch, 1, empty challenges)` to set up a single-checkpoint mini-epoch; the count and challenges are never read because the override prevents the parent's finalize / root path from running. - **`TopTreeOrchestrator`** + **`TopTreeProvingState`**: self-contained driver from checkpoint-root through epoch-root rollup. Takes per-checkpoint block-proof promises and pipelines its hint chain against them. Cancellation surfaces as `TopTreeCancelledError` so callers can distinguish reorg-driven cancel from a genuine proving failure. - **`EpochProvingContext`** (`epoch-proving-context.ts`): per-epoch shared cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof. - **`ProvingScheduler`** (`proving-scheduler.ts`): abstract base owning the `SerialQueue` deferred-job lifecycle, the `pendingProvingJobs` controller list, and a unified `deferredProving<S, T>(state, request, callback, isCancelled?)` submit envelope. The minimal `ProvingStateLike` contract is just `verifyState()` + `reject(reason)`. - **`TopTreeProvingScheduler`** (`top-tree-proving-scheduler.ts`): extends `ProvingScheduler` and holds the checkpoint-merge, padding, and root-rollup drivers (plus tree-walking helpers) shared by both orchestrators. Wraps circuit calls via a `wrapCircuitCall` hook (orchestrator overrides for spans; top-tree leaves identity) and resolves via an `onRootRollupComplete` hook to bridge the two states' differing `resolve` signatures. The per-checkpoint root driver stays subclass-specific because input-building flows differ. - **`EpochProverFactory` interface on `ProverClient`**: new factory methods `createEpochProvingContext(epochNumber)`, `createCheckpointSubTreeOrchestrator(...)`, and `createTopTreeOrchestrator()`. A single shared `BrokerCircuitProverFacade` is owned by `ProverClient` and shared across every orchestrator. ## What changes in existing code - `ProvingOrchestrator` extends `TopTreeProvingScheduler`; the inline broker-job submit envelope, queue lifecycle, and the top-tree-section drivers are inherited. `cancel()` delegates the queue-recreate + abort-jobs logic to `resetSchedulerState(this.cancelJobsOnStop)`. Three internal methods (`getOrEnqueueChonkVerifier`, `checkAndEnqueueBaseRollup`, `checkAndEnqueueCheckpointRootRollup`) become `protected` so the sub-tree can override them; `provingState` and `provingPromise` likewise become `protected` so the sub-tree can hook the parent's failure stream onto `subTreeResult`. No public API change on `ProvingOrchestrator`. - `CheckpointProvingState`: gains two read-only accessors used by the sub-tree's checkpoint-root override — `getSubTreeOutputProofs()` and `getLastArchiveSiblingPath()`. No state changes. - `ProverClient` keeps `createEpochProver()` exactly as before (each call spawns its own `BrokerCircuitProverFacade`); the new factory methods share a `getFacade()` set up in `start()` and torn down in `stop()`. `EpochProver`, `EpochProverManager`, `ServerEpochProver`, `EpochProvingState`, the integration tests in `orchestrator_*.test.ts`, `bb_prover_full_rollup.test.ts`, and `stdlib/interfaces/*` are all unchanged from `merge-train/spartan` — the prover-node and e2e tests continue to build against the existing `EpochProver` API. Migrating the prover-node onto the new factories (and the deferred-finalize flow that goes with optimistic proving) is the follow-up PR. ## Test plan - [x] 261 prover-client tests pass (full `yarn workspace @aztec/prover-client test`). - [x] `yarn build` clean against current merge-train/spartan (modulo the pre-existing `@aztec/sqlite3mc-wasm` issue inherited from baseline).
da67473 to
7342efc
Compare
fd62ad7 to
093fdde
Compare
Drives the prover-node onto the split `CheckpointSubTreeOrchestrator` +
`TopTreeOrchestrator` pair, with checkpoint-driven proving that pipelines
sub-trees against tx-gathering and the top-tree against the in-flight
sub-trees.
## What's new
### `prover-node` — `EpochProvingJob` job-model rewrite
`EpochProvingJob` becomes an orchestrator over a `Map<string, CheckpointJob>`
keyed by `${number}:${slot}`. Each `CheckpointJob` owns a single
`CheckpointSubTreeOrchestrator` with its own per-checkpoint context (txs,
attestations, previous-block header, l1ToL2 messages, archive sibling path).
A `TopTreeJob` drives the epoch root rollup once all checkpoint sub-trees
have started block-level proving.
Public API:
- `registerCheckpoint` — synchronous; sets up sub-tree, kicks off chonk-verifier
cache fill, attaches the `blockProofs` promise to the eventual top-tree job.
- `provideTxs` — supplies simulated txs, transitions the checkpoint job from
registered → block-proving.
- `removeCheckpoint(synchronous, idempotent)` — drops a single checkpoint by
`(number, slot)`, fire-and-forget cancels its sub-tree. Tolerates re-add of
the same checkpoint number under a different slot.
- `removeCheckpointsAfter`, `getCheckpointCount`, `getCheckpointNumbers`,
`cancelPendingCheckpoints`.
### `prover-node` — `L2BlockStream`-driven checkpoint pipeline
The prover-node consumes `chain-checkpointed` / `chain-pruned` events from an
`L2BlockStream` rooted at the first block of the first unproven epoch. On
each `chain-checkpointed`:
1. Resolve the epoch via `getEpochAtSlot`.
2. Get-or-create the per-epoch `EpochProvingJob`.
3. Detached-task gather txs + register the checkpoint with the job.
On `chain-pruned`: call `removeCheckpointsAfter(threshold)` on every job
whose first checkpoint sits at or above the threshold. Pending gather tasks
are cancelled via `AbortSignal`.
Finalization is driven by the union of three signals: epoch-monitor sees
the epoch close on L1, a checkpoint for a strictly later epoch arrives, or
all expected checkpoints (per archiver) are registered while the epoch is
already complete on L1.
### `prover-node` — reorg-after-finalization restart
When the L2BlockStream emits a prune that retroactively invalidates an
epoch already in finalize, the prover-node aborts the in-flight publish,
clears the job, and restarts proving from the new tip.
### e2e
- New `epochs_optimistic_proving.parallel.test.ts`: full e2e covering the
pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.
- `epochs_proof_fails`, `epochs_upload_failed_proof`, `epochs_long_proving_time`,
`epochs_multi_proof` updated to assert the new in-flight epoch behaviour.
## What's removed
`EpochProver` interface and `ServerEpochProver` are removed: the prover-node
no longer drives a single-class epoch prover, so the legacy API has no
production callers. `ProvingOrchestrator` survives only as a base class for
`CheckpointSubTreeOrchestrator` and as the single-class driver used by
`prover-client`'s integration tests; it no longer implements `EpochProver`.
## Test plan
- `yarn workspace @aztec/prover-client test` — 261 tests pass.
- `yarn workspace @aztec/prover-node test` — 89 tests pass.
- e2e tests covering optimistic proving, reorgs during proving, failed
proof publish, and multi-checkpoint flows are included in this PR.
093fdde to
81dc825
Compare
spalladino
left a comment
There was a problem hiding this comment.
Got halfway through it, but looks good so far. Pasting here some observations by Claude that I didn't get to review myself:
-
Same-epoch replacement checkpoints are silently dropped after
completeEpoch().prover-node.ts:213-220skips any checkpoint for an epoch whose job hasisEpochComplete(). Sequence: (1) all checkpoints registered,completeEpoch()fires; (2) prune removes the suffix; (3) the L1 reorg replacement for the same epoch arrives. The replacement is dropped becauseexistingJob.isEpochComplete()is true. The finalize loop then restarts against an incomplete prefix of the epoch and eventually publishes a proof for a checkpoint range that does not match the canonical L1 chain. This contradicts the stated "orphan + replacement coexist briefly" model. The check should be conditioned on the surviving checkpoint set, not on the boolean flag. -
A prune that lands after
topTreeJob.start()but duringpublishProofis not interceptable.epoch-proving-job.ts:478clearsthis.topTreeJobimmediately after the prove resolves; the publish atepoch-proving-job.ts:487operates on the capturedsnapshotarray. MeanwhileremoveCheckpointsAfter(epoch-proving-job.ts:332-369) only acts on the livetopTreeJob— which is nowundefined— so the publish proceeds with stale data and the (out-of-date) attestations fromsnapshot.at(-1). Best case L1 rejects the proof (gas burn, retry confusion); worst case it lands against a window the canonical chain has since invalidated. Either re-validate the snapshot's checkpoint set (by id) against the live registry just before submitting, or hold the top-tree-job reference until publish completes and route prune through it. -
EpochMonitor permanently records "processed" on a no-op
tryCompleteEpoch.prover-node.ts:145-153returnstruefromhandleEpochReadyToProvewhenevertryCompleteEpochdoesn't throw — even if it bailed atprover-node.ts:394(archiver says incomplete) orprover-node.ts:399-405(not all checkpoints known to the job).epoch-monitor.ts:93then setslatestEpochNumber = epochToProve, andepoch-monitor.ts:76-79ensures the same epoch is never re-considered. In a healthy network the next checkpoint event for a later epoch reseeds via the tertiary signal, but if the epoch in question is the last produced (network stall, end-of-test) it stays stuck forever.tryCompleteEpochshould return whethercompleteEpoch()was actually invoked, andhandleEpochReadyToProveshould propagate that.
| // First, update the local tips store so the stream can track our state. | ||
| await this.tipsStore.handleBlockStreamEvent(event); |
There was a problem hiding this comment.
Note that by updating the store first, any error in the handle methods below will not be retried in the next iteration of the L2BlockStream. Not sure if this is intentional.
There was a problem hiding this comment.
Yes it should go at the end I think.
| // (e.g. an early proof submission landed for an earlier epoch that ends mid-way | ||
| // through this epoch's slot range) does not prove the entire epoch. We need every | ||
| // checkpoint of a partially-proven epoch to feed the orchestrator. | ||
| if (await this.isEpochFullyProven(epochNumber, l1Constants)) { |
There was a problem hiding this comment.
We should check in the archiver and blockstream what happens first: do checkpoints get emitted, or does the proven block number get updated? I'm worried that, when starting a long sync after offline for a while, all the checkpoints are emitted first and then the proven tip is updated, so the prover node would trigger a ton of jobs only to cancel them immediately afterwards.
There was a problem hiding this comment.
Ah just saw the compute starting block number for the blockstream. And as long as the prover node waits for the archiver initial sync to be completed before starting, we should be good I guess.
| * to the job for finalization. If checkpoints are still being delivered by the | ||
| * L2BlockStream, the handoff happens later when the last one arrives. | ||
| */ | ||
| async handleEpochReadyToProve(epochNumber: EpochNumber): Promise<boolean> { |
There was a problem hiding this comment.
I think we can now replace the entire epoch monitor with a running promise that just calls l2BlockSource.isEpochComplete periodically.
| // Add to an existing job, or create a new one if allowed. | ||
| let job = existingJob; | ||
| if (!job) { | ||
| const { proverNodeMaxPendingJobs: maxPendingJobs } = this.config; | ||
| if (maxPendingJobs > 0 && this.jobs.size >= maxPendingJobs) { | ||
| this.log.debug( | ||
| `Skipping checkpoint ${checkpoint.number} for epoch ${epochNumber}: max pending jobs ${maxPendingJobs} reached`, | ||
| ); | ||
| return; | ||
| } | ||
| job = await this.createEpochJob(epochNumber); | ||
| } |
There was a problem hiding this comment.
Won't this create a hole in a given epoch? I don't see another code path where we patch an epoch job with checkpoint jobs we failed to add due to lack of capacity. I'm worried about a scenario where we don't add a checkpoint to an epoch, but then a prior epoch finishes, so we do add the next, and that epoch job ends up waiting for that missing checkpoint forever.
Speaking of which: is there anything that cleans up very old jobs when the epoch proof submission window closes?
There was a problem hiding this comment.
Yes, was going to ask about this. As it stands here there could be problems if this is set. Do you think this is needed or makes sense anymore? Proving is more of a continuous process rather than a set of large jobs.
|
|
||
| const taskId = `${checkpoint.number} - ${checkpoint.slot}`; | ||
| const task = this.gatherAndAddCheckpoint(job, checkpoint, epochNumber, abortSignal); | ||
| this.pendingGatherTasks.set(taskId, task); |
There was a problem hiding this comment.
Maybe it makes sense to move the responsibility of collecting their data to the checkpoint jobs, to keep the prover node leaner?
There was a problem hiding this comment.
I think it does. As a further refactor, I want to break the checkpoint proving out of EpochProvingJob. Then we maintain a collection of checkpoint jobs being proven. EpochProvingJob then becomes a collection of references to a subset of those checkpoints and a top tree. This would enable us to create instances of EpochProvingJob with arbitrary checkpoints (of a valid range of course) enabling efficient partial epoch proving.
There was a problem hiding this comment.
Looks comprehensive. I'd add tests for cross-chain messages as well, those are usually a source of problems.
| const wallClockEpoch = Number(test.epochCache.getEpochAndSlotNow().epoch); | ||
| const job = proverNode.epochJobs.get(wallClockEpoch); | ||
| if (job && job.getCheckpointCount() > 0) { | ||
| observed.add(wallClockEpoch); | ||
| } |
There was a problem hiding this comment.
Heads up this should pass even without optimistic proving: prover would start as soon as the last checkpoint of the epoch was pushed to L1, which could happen a few L1 slots before the actual end of the epoch. Maybe we should offset this a bit so the check is more meaningful, or check for specific checkpoint jobs vs all jobs in the epoch?
| // Wait for the 2nd checkpoint within this epoch. | ||
| const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber; | ||
| const midCheckpoint = CheckpointNumber(initialCheckpoint + 2); | ||
| await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6); | ||
| const checkpointBeforeReorg = test.monitor.checkpointNumber; | ||
| logger.info(`Reached checkpoint ${checkpointBeforeReorg}`); |
There was a problem hiding this comment.
I'd also add an assertion that the prover actually started the job for the given checkpoint before reorging it out. Otherwise, this test could pass if the prover just takes a bit more time to start assembling the checkpoint job.
| // Wait for 2 checkpoints mid-epoch. | ||
| const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber; | ||
| const midCheckpoint = CheckpointNumber(initialCheckpoint + 2); | ||
| await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6); | ||
| const checkpointBeforeReorg = test.monitor.checkpointNumber; | ||
| logger.info(`Reached checkpoint ${checkpointBeforeReorg}`); |
There was a problem hiding this comment.
This should also assert that both checkpoints landed on the same epoch, which could not be the case if this test started near the end of an epoch.
There was a problem hiding this comment.
We should also add txs. And it'd be interesting to see a case where one of the reorgs means that a tx that was mined in a checkpoint is now mined in a different one.
| const proverNode = test.proverNodes[0].getProverNode() as TestProverNode; | ||
| const proverManager = proverNode.getProver(); | ||
| const origCreateTopTree = proverManager.createTopTreeOrchestrator.bind(proverManager); | ||
| let releaseProvingGate: () => void = () => {}; | ||
| const provingGate = new Promise<void>(resolve => { | ||
| releaseProvingGate = resolve; | ||
| }); | ||
| proverManager.createTopTreeOrchestrator = () => { | ||
| const topTree = origCreateTopTree(); | ||
| const origProve = topTree.prove.bind(topTree); | ||
| topTree.prove = async (...args: Parameters<typeof origProve>) => { | ||
| logger.warn('Top-tree proving gated — waiting for test to release'); | ||
| await provingGate; | ||
| logger.warn('Proving gate released'); | ||
| return origProve(...args); | ||
| }; | ||
| return topTree; | ||
| }; |
There was a problem hiding this comment.
I'm wondering if there's a cleaner way of doing this, perhaps reusing the EpochProvingJobHooks? Not sure if worth stressing over it though.
| if (snapshot.length === 0) { | ||
| throw new Error(`Cannot finalize epoch ${this.epochNumber}: no surviving checkpoints`); | ||
| } |
There was a problem hiding this comment.
Is there a race condition where a reorg removes all checkpoints from an epoch and then new ones are immediately added again, but we hit this case before the new ones get added? Seems like something that would never happen in prod, but could affect e2e tests that have short epochs.
There was a problem hiding this comment.
There shouldn't be. The prover node does this in it's prune handler.
if (job.getCheckpointCount() === 0) {
this.log.info(`Cancelling epoch ${epochNum} job — all checkpoints pruned`);
await this.cancelAndCleanupJob(epochNum, job);
}
So that epoch job instance is removed and replaced with another when a new checkpoint arrives for the same epoch number.
There was a problem hiding this comment.
I will add a test though
| // Capture proving data *before* we clear the live job map, so a failure-upload | ||
| // attempt that runs after teardown still has the per-checkpoint txs / messages / | ||
| // previous-block-header to work with. | ||
| if (!this.provingDataSnapshot) { | ||
| try { | ||
| this.provingDataSnapshot = this.buildProvingDataSnapshot(); | ||
| } catch { | ||
| // No completed checkpoints — failure-upload will have nothing to upload, which | ||
| // is fine. Snapshot stays undefined and getProvingData will rebuild and rethrow. | ||
| } |
There was a problem hiding this comment.
How about we move the responsibility of uploading this data to the epoch-proving-job itself, so we don't need to collect this just for the prover-node to maybe read if it's configured for uploads?
|
|
||
| /** Stable identifier: `${checkpoint number}:${slot}`. Used as the parent's map key. */ | ||
| public static idFor(checkpoint: Checkpoint): string { | ||
| return `${checkpoint.number}:${checkpoint.header.slotNumber}`; |
There was a problem hiding this comment.
I don't think it makes a difference based on how reorgs are handled, but I'd throw in the checkpoint archive as part of the identifier, just in case
| const publicTxs = allTxs.filter(tx => tx?.data.forPublic); | ||
| if (publicTxs.length > 0) { | ||
| this.deps.log.verbose( | ||
| `Kicking off ${publicTxs.length} chonk-verifier circuits for checkpoint ${this.checkpoint.number}`, | ||
| { | ||
| checkpointNumber: this.checkpoint.number, | ||
| publicTxCount: publicTxs.length, | ||
| }, | ||
| ); | ||
| await this.subTree.startChonkVerifierCircuits(publicTxs); |
There was a problem hiding this comment.
I know this didn't change with this PR, but for private-only txs, who verifies the chonk proof?
There was a problem hiding this comment.
The private base rollup. Public txs have a different process. They require a dedicated chonk verifier circuit.
| public start(): Promise<TopTreeProof> { | ||
| void this.run(); |
There was a problem hiding this comment.
I take it it's responsibility of the caller to check that all dependencies are ready, right?
There was a problem hiding this comment.
Which dependencies are you referring to? At the point this class is constructed, it's arguments contain all information required to start the proving process (e.g. compute blob challenges). If the checkpoint root input proofs aren't ready yet that is fine, the orchestrator fires the appropriate events as and when they are.
There was a problem hiding this comment.
I think this is a great idea! I'd consider extending it to Base/Root parity circuits too because they could benefit from the same caching behaviour across reorged checkpoints
…al pipeline (#23245) ## Summary - Removes `FastTxCollection` as a separate class and absorbs all its logic directly into `TxCollection` - Replaces the old parallel file-store delay with a single sequential pipeline: node RPC → reqresp → file store, where each phase blocks on the previous (cancellation-aware) - File store collection is now driven by `IRequestTracker` — the same synchronization primitive used by node and reqresp paths. The tracker is the single source of truth for "is this tx still missing?" and "is this request still alive?" - `FileStoreTxCollection` simplified: dropped `start()`/`stop()`/persistent worker pool/`wakeSignal`. `startCollecting(requestTracker, context)` returns `Promise<void>`, spins up its own per-call worker pool, and workers self-terminate when the tracker is cancelled (all-fetched / deadline / external) ## Collection flow inside `collectFast` 1. Start node RPC collection in the background 2. Wait `txCollectionFastNodesTimeoutBeforeReqRespMs` — interruptible by cancellation **or by node exhaustion** (so when no nodes are configured, reqresp starts immediately) 3. Start reqresp in the background (parallel with nodes) 4. Wait `txCollectionFileStoreFastDelayMs` — interruptible by cancellation or reqresp completion 5. Start file store collection in the background (its workers self-terminate) 6. `Promise.allSettled` on node + reqresp + file store `txCollectionFileStoreFastDelayMs` description updated to reflect it is now anchored to reqresp start, not collection start. ## File store / tracker integration - `FileStoreTxCollection.startCollecting` no longer takes `(txHashes, context, deadline)`; it takes `(requestTracker, context)` and reads the missing txs + deadline from the tracker - Workers check `requestTracker.isMissing(hash)` each scan — if the tx was found via another path (node/reqresp/gossipsub), the entry is dropped without an extra fetch - Workers race their backoff sleeps against `requestTracker.cancellationToken` — cancelling a request (deadline, `stopCollectingForBlocksUpTo/After`, or `stop()`) propagates to file store workers immediately - Removed `foundTxs`/`clearPending` plumbing on `FileStoreTxCollection` — the tracker handles both implicitly - `startCollecting` yields once after building its entry set, so a synchronous follow-up call (e.g. `markFetched` in tests, or the gossipsub-found path in production) lands before workers begin scanning ## Tests - `tx_collection.test.ts`: collapsed the `TestFastTxCollection` subclass; all accesses go directly through `TxCollection`. Added "starts reqresp immediately when no nodes are configured" covering the node-exhaustion shortcut - `file_store_tx_collection.test.ts`: rewritten for the new shape — no `start()`/`stop()`, lifecycle driven by the tracker (cancel to terminate workers). New "workers exit when tracker is cancelled" covers the per-call worker-pool teardown Closes https://linear.app/aztec-labs/issue/A-933/tx-collection-dont-retrieve-transactions-that-have-already-been via new synchronization with the request tracker.
…ims (#23165) ## Context `SequencerPublisher` simulates each enqueued L1 action individually at enqueue time, then sends them bundled through Multicall3. The `propose` checkpoint action is validated at enqueue and send time (the latter via a `preCheck` mechanism), but in isolation and relying on overrides. There is no simulation of the multicall payload before sending it, so a reverting tx is most likely not caught. This refactor: - Replaces the per-request `preCheck` mechanism with a **single bundle-level `eth_simulateV1`** of the assembled `aggregate3` payload, run right before send. If any entry reverts in sim it is dropped from the bundle, the reduced bundle is re-simulated to get an honest `gasUsed`, and the survivors are sent. Extracted to a `SequencerBundleSimulator`. - Drops the entire propose simulate at enqueue (`simulateProposeTx`, `validateCheckpointForSubmission`). The bundle simulate covers it. - Adds a new pre-broadcast `validateBlockHeader` call (calling `validateHeaderWithAttestations` with empty attestations + `ignoreSignatures: true`) that catches header-level bugs before we gossip the proposal to peers. Emits a new `header-validation-failed` event on failure. - Drops every per-action simulate at enqueue (governance signal **and** slashing votes/executes). Bundle simulate at send time is the single decision point for every per-action revert. `simulateAndEnqueueRequest` is deleted. We were enqueuing votes even if the simulation failed, after all. - Rewrites `sendRequestsAt` so it takes an L2 `SlotNumber`, derives the timestamp for the start of that slot, and sleeps until one L1 slot before that boundary, so we can land on the first L1 slot of the target L2 slot. - Centralises `SimulationOverridesPlan` construction into a single `buildCheckpointSimulationOverridesPlan` helper. The plan **always** pins both `pending` and `proven` chain tips (to the pipelined parent / invalidation target, or to the current snapshot when neither applies), so `STFLib.canPruneAtTime` cannot reintroduce a phantom prune during simulation. - Makes `SimulationOverridesBuilder.merge` undefined-safe: explicit `undefined` fields in an incoming plan no longer erase previously-set values. `withPendingTempCheckpointLogFields` now accepts a partial subset of fields. - Moves the payload-empty cache onto `GovernanceProposerContract` next to its concern. Only `isPayloadEmpty=false` is cached (a CREATE2 redeploy could go empty → populated). - Drops the old Multicall3 revert-recovery and per-request-resim machinery, since with `allowFailure: true` the top-level multicall is expected to land successfully. `Multicall3.forward` now throws `MulticallForwarderRevertedError` if the receipt reports a reverted status; the publisher does **not** rotate to a new publisher on that error (on-chain failure, not a send failure). Adds `Multicall3.hasCode` helper and a `simulateAggregate3` entrypoint used by the bundle simulator. - `L1TxUtils.sendTransaction` fails fast if `txTimeoutAt` has already elapsed when called. `SequencerPublisher.forwardWithPublisherRotation` re-checks the deadline at the head of each rotation iteration so it doesn't keep cycling through publishers after the L2 slot's submission window has closed. - Sequencer escape-hatch (`voteInSlotWithoutSyncing`) and full-escape-hatch (`voteOnSlotWithEscapeHatch`) vote-only paths now submit via `sendRequestsAt(slot)` rather than `sendRequests()`, so the bundle-simulate `block.timestamp` override matches the slot the EIP-712 vote signatures were generated for. The intended outcome is a publisher with one explicit re-validation point (the bundle simulate), measurable bundle gas (from the bundle simulate's `gasUsed`), and dead/duplicated state-override plumbing removed. ## Resulting simulations after this refactor The full list of simulation / gas-estimation steps that remain in a pipelined proposer slot, in execution order. ### Pre-build, in `Sequencer.doWork` 1. **`publisher.canProposeAt`** — rollup view call simulated with the centralised override plan. Cheap pre-check gate before any block-build work. 2. **`publisher.simulateInvalidateCheckpoint`** (conditional) — runs **only** if `syncedTo.pendingChainValidationStatus.valid === false` AND `!syncedTo.hasProposedCheckpoint`. Simulates the invalidate call against the rollup. Result becomes the `invalidateCheckpoint` package passed into `CheckpointProposalJob`. The previous code called this even when there's a proposed parent and discarded the result; this refactor adds the `!hasProposedCheckpoint` gate so we skip the wasted RPC. ### Per-slot, in `CheckpointProposalJob.proposeCheckpoint` 3. **CheckpointVoter votes** — `CheckpointVoter.enqueueVotes()` runs at the top of `execute()`, returning two promises that are awaited in parallel with block-build. It enqueues two kinds of votes via the publisher, **neither of which simulates at enqueue time** after this refactor: - **`enqueueGovernanceCastSignal`** — does an `isPayloadEmpty` pre-flight check (now on `GovernanceProposerContract`), then enqueues. No `eth_simulateV1`. - **`enqueueSlashingActions`** (one call per slashing action, type `vote-offenses` or `execute-slash`) — builds the request and enqueues. No `eth_simulateV1`. Real reverts on any of these are caught by the bundle simulate at send time, which drops the failing entry and proceeds with the survivors. 4. **`publisher.validateBlockHeader` (NEW: pre-broadcast)** — replaces the old `simulateProposeTx`-at-enqueue. Calls `validateHeaderWithAttestations` with empty attestations and `ignoreSignatures: true` so the rollup runs the header checks (archive match, slot match, timestamp, mana-min-fee, …) without needing real attestations. Runs **before** we gossip the proposal to peers. If it fails, abort the slot — log an error, emit `header-validation-failed`, don't broadcast, don't enqueue. 5. **`prepareProposeTx → validateBlobs estimateGas`** — kept as the blob-commitment **consistency check** (detects locally-built commitments not matching the blob sidecars). Returns `blobEvaluationGas`, which we stash on the propose `RequestWithExpiry` for use by the bundle gasLimit later. The simulate-step that previously paired with this (`simulateProposeTx`) is removed. ### Background pipeline, in `waitForAttestationsAndEnqueueSubmissionAsync` 6. **`publisher.simulateInvalidateCheckpoint` (conditional)** — runs **only** in the fallback path where attestation collection failed AND the pending chain turned out to be invalid. Triggered from `CheckpointProposalJob.enqueueInvalidation`. This is the second, late trigger for invalidation simulation — distinct from step 2's pre-build trigger. ### Send time, in `sendRequestsAt(targetSlot)` 7. **Bundle simulate (NEW)** — single `eth_simulateV1` of the assembled `aggregate3` payload, with `block.timestamp` overridden to the start of `targetSlot`, and state overrides = `[disableBlobCheck]` iff `propose` is in the bundle and `[]` otherwise. Per-entry result decoded from the returned `Result[]`. This is the **only** post-pipeline-sleep re-validation; it replaces the per-request `preCheck` mechanism entirely. 8. **Bundle re-simulate (NEW, conditional)** — runs **only** when step 7 dropped at least one entry. Re-runs the bundle simulate on the reduced payload to get an honest `gasUsed`, and applies the same per-entry decode so additional drops are caught. If the re-simulate falls back (node doesn't support `eth_simulateV1`), the publisher sends the **first-pass survivors only** with `MAX_L1_TX_LIMIT`; the entries that the first pass already proved would revert stay dropped and are reported as failed actions. ### Post-send No diagnostic-only simulate paths remain. `Multicall3.forward` throws `MulticallForwarderRevertedError` on a reverted receipt and re-throws on a send error; per-request revert resimulation has been removed. ## Known caveats - **`sendRequestsAt` early lead**: sleeps until `startOfTargetSlot - ethereumSlotDuration` to maximise inclusion in the first L1 block of the L2 slot. There is a known correctness risk: a tx mined in the L1 block immediately preceding the L2-slot boundary would revert via `ProposeLib.validateHeader`'s `slot == block.timestamp.slotFromTimestamp()` check. In practice the prior L1 block is usually already committed before this send wakes; if observed to be unreliable in production, tune the lead down, especially on tests. - **`validateBlockHeader` pre-broadcast coverage**: covers the `validateHeader` checks (archive, slot/timestamp, mana-min-fee, …) and the empty-attestation path of `validateHeaderWithAttestations`, but does NOT cover proposer-signature verification, inbox consumption (`Rollup__InvalidInHash`), or `header.inHash` match. Those still execute inside the full `propose` and are caught by the bundle simulate at send time. The cost of a rare miss is one wasted broadcast. - **Top-level `aggregate3` revert diagnostics removed**: the previous `Multicall3.forward` code decoded receipt-reverted reasons via `tryGetErrorFromRevertedTx` and did a per-request resim on send-throw. Both paths are gone. With `allowFailure: true` and `Multicall3.hasCode` covering the no-bytecode case, a reverted forwarder receipt is genuinely unexpected (OOG, forwarder bug). The throw of `MulticallForwarderRevertedError` is the only diagnostic surface — operators will need the transaction hash from the log to investigate.
4af2626 to
db4ec58
Compare
tryCompleteEpoch now returns whether the epoch is in a handled state (handed off now, previously handed off, or no job for an already- finalized epoch). EpochMonitor only advances latestEpochNumber on a true return, so the monitor retries while the archiver is still catching up or expected checkpoints are not yet registered. Reorders the archiver isEpochComplete check before the job lookup so the monitor keeps firing until L1 confirms completion; only then does the "no job" case short-circuit as nothing-to-do. A-1051.
Adds the scenario flagged in the PR #23002 review: a message-bearing tx that is reorged out of its checkpoint and remined into a fresh one must still prove. The test sends an L2->L1 message tx, reorgs out its checkpoint, waits for the node to detect the prune and the tx to be remined, advances the epoch to proven, then consumes the message from the L1 outbox — which only succeeds if the message's out-hash was correctly carried through the remine into a valid epoch proof (A-1039).
Addresses the remaining half of the PR #23002 review comment ("add txs ... a tx that was mined in a checkpoint is now mined in a different one") directly in epochs_optimistic_proving.parallel.test.ts: sends a real tx, reorgs out the checkpoint containing it, waits for the node to detect the prune and the tx to be remined into a fresh checkpoint, then proves that checkpoint on L1.
The session/top-tree hooks (beforeTopTreeProve/afterTopTreeProve/topTreeProveOverride) existed on the types but had no injection path — dead code. Wire a setSessionHooks path through ProverNode -> SessionManager so hooks apply to every session constructed after the call. Use it to replace the monkey-patching of createTopTreeOrchestrator in two e2e tests (addresses the PR #23002 review comment suggesting hook reuse): - epochs_optimistic_proving 'reorg during proving' gates via beforeTopTreeProve. - epochs_upload_failed_proof forces a failure via topTreeProveOverride.
Addresses the PR #23002 review question about a reorg removing all of an epoch's checkpoints and new ones being added immediately after. Reconcile drops the session with no error when canonical content goes empty, and reopens a fresh session for the same epoch once a checkpoint is re-added — this test asserts that full cycle.
…chive root Switches the CheckpointProver identity from (number, slot, prevArchiveRoot) to (number, slot, checkpoint.archive.root) — the checkpoint's own post-state archive, as suggested in PR #23002 review. prevArchiveRoot only distinguished reorg branches that diverged before the checkpoint; it could not tell apart two checkpoints built on the same predecessor at the same slot with different content (e.g. a reorg replacement with a different tx set), which would have wrongly reused the stale prover. Keying on the post-state archive root closes that: identical content collapses to one prover (reusing its in-flight sub-tree), any difference in history or content yields a distinct prover. prevArchiveRoot was only ever used to build the id, so it's removed from the prover/ register-data entirely.
Drives the prover-node onto the split
CheckpointSubTreeOrchestrator+TopTreeOrchestratorpair, with checkpoint-driven proving that pipelines sub-trees against tx-gathering and the top-tree against the in-flight sub-trees.Stacked on #22996 (orchestrator split). Diff above is just the prover-node + e2e + cleanup delta.
What's new
prover-node—EpochProvingJobjob-model rewriteEpochProvingJobbecomes an orchestrator over aMap<string, CheckpointJob>keyed by${number}:${slot}. EachCheckpointJobowns a singleCheckpointSubTreeOrchestratorwith its own per-checkpoint context (txs, attestations, previous-block header, l1ToL2 messages, archive sibling path). ATopTreeJobdrives the epoch root rollup once all checkpoint sub-trees have started block-level proving.Public API:
registerCheckpoint— synchronous; sets up sub-tree, kicks off chonk-verifier cache fill, attaches theblockProofspromise to the eventual top-tree job.provideTxs— supplies simulated txs, transitions the checkpoint job from registered → block-proving.removeCheckpointsAfter(threshold)— sole removal API. Suffix-only: cancels every job whose checkpoint number sits strictly above the threshold, atomically. Suffix-only by design — middle-removal would leave a non-contiguous survivor set that cannot be proved.getCheckpointCount,getCheckpointNumbers.TopTreeJobconstructor defensively asserts that its snapshot's checkpoint numbers form a contiguous run (prev + 1 === curr). With suffix-only removal this is structurally guaranteed; the assertion is a runtime tripwire against future regressions.prover-node—L2BlockStream-driven checkpoint pipelineThe prover-node consumes
chain-checkpointed/chain-prunedevents from anL2BlockStreamrooted at the first block of the first unproven epoch. On eachchain-checkpointed:getEpochAtSlot.EpochProvingJob.tryCompleteEpochfor both older epochs (tertiary signal) and the current epoch (covers the case where the EpochMonitor already fired and we were waiting on this last checkpoint to arrive).On
chain-pruned: callremoveCheckpointsAfter(threshold)on every job whose first checkpoint sits at or above the threshold. Pending gather tasks are cancelled viaAbortSignal.Finalization is driven by the union of three signals — EpochMonitor (epoch closes on L1), tertiary (a checkpoint for a strictly later epoch arrives), or final-checkpoint registration on an already-monitored epoch. Each signal calls
tryCompleteEpoch, which queriesl2BlockSource.isEpochCompletedirectly: the archiver is the source of truth, so the prover-node carries no in-memory cache of "epochs complete on L1".prover-node— reorg-after-finalization restartWhen the L2BlockStream emits a prune that retroactively invalidates an epoch already in finalize, the prover-node aborts the in-flight publish, clears the job, and restarts proving from the new tip.
e2e
epochs_optimistic_proving.parallel.test.ts: full e2e covering the pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.epochs_proof_fails,epochs_upload_failed_proof,epochs_long_proving_time,epochs_multi_proofupdated to assert the new in-flight epoch behaviour.Logging
Structured-context logs added across the
CheckpointJob/TopTreeJob/EpochProvingJoblifecycles: register/provide-txs/sub-tree-created/per-block-progress/blockProofs-ready/cancel/teardown for checkpoints, and create/cancel/prove-start/prove-success for top-tree. Every log entry indexes onepochNumber/checkpointNumberper repo convention.What's removed
EpochProverinterface andServerEpochProverare removed: the prover-node no longer drives a single-class epoch prover, so the legacy API has no production callers.ProvingOrchestratorsurvives only as a base class forCheckpointSubTreeOrchestratorand as the single-class driver used byprover-client's integration tests; it no longer implementsEpochProver.EpochProvingJob.removeCheckpoint(checkpointNumber)andcancelPendingCheckpoints()are also removed: middle-removal could leave a non-contiguous survivor set that theTopTreeJobconstructor would reject. Suffix-onlyremoveCheckpointsAfteris the only checkpoint-removal API now.Test plan
yarn workspace @aztec/prover-client test— 261 tests pass.yarn workspace @aztec/prover-node test— 116 tests pass (incl. new dedicatedtop-tree-job.test.tsandcheckpoint-job.test.ts).