Skip to content

feat: merge-train/spartan-v5#23965

Merged
PhilWindle merged 7 commits into
v5-nextfrom
merge-train/spartan-v5
Jun 9, 2026
Merged

feat: merge-train/spartan-v5#23965
PhilWindle merged 7 commits into
v5-nextfrom
merge-train/spartan-v5

Conversation

@AztecBot

@AztecBot AztecBot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

BEGIN_COMMIT_OVERRIDE
docs(stdlib): clarify checkpoint capacity ceiling is the provable max (#23952)
test: always capture local network logs for compose tests (#23912)
fix: pin getAttesters reads to a single L1 block (A-819) (#23920)
refactor(prover-node): CheckpointStore + SessionManager redesign (#23552)
END_COMMIT_OVERRIDE

AztecBot and others added 3 commits June 9, 2026 14:02
Always print the local-network service logs at the end of normal compose
e2e tests, regardless of whether the test passed or failed.

Changes:
- Keep local-network output in /logs/local-network.log via the compose
service tee.
- Print those logs after the end-to-end test process exits for every
local-network compose run.
- Keep /logs in a per-project Docker named volume (local-network-logs)
and remove it on teardown via REMOVE_COMPOSE_VOLUMES=1, so logs don't
persist across runs. No host bind mount.
- Remove the private_transfer.sh-specific LOCAL_NETWORK_LOG_LEVEL
override and run the compose flows at LOG_LEVEL=verbose so the captured
node logs are detailed enough for diagnostics.
- Add a short flush window in run_compose_test after the container
exits, so the trailing local-network logs aren't truncated by docker
compose down.
- Leave the proposed-chain mint behavior unchanged; there is no
checkpoint wait in the mint helper.

Example run: http://ci.aztec-labs.com/0245ece3810f8624
@spalladino spalladino requested a review from charlielye as a code owner June 9, 2026 17:00
PhilWindle and others added 3 commits June 9, 2026 14:42
Fixes A-819 (Audit #164).

## Problem

`RollupContract.getAttesters`
(`yarn-project/ethereum/src/contracts/rollup.ts`) made several
sequential RPC reads with no pinned block:

- `getActiveAttesterCount()` (read at `latest`)
- N chunked `getAttestersFromIndicesAtTime(...)` reads, one per 1000
indices (each read at `latest`)

The `ts` timestamp argument was already captured once and reused
consistently across chunks, so the literal "stale timestamp across
chunks" framing of the title doesn't occur. The real defect is that the
**reads are not pinned to a single L1 block**: across a block boundary
or reorg, the count and the individual chunk reads can observe different
attester sets, yielding an inconsistent or truncated result. This only
bites for attester sets larger than the 1000-entry chunk size, read
precisely across a set-changing block — hence low impact, but real.

## Fix

Fetch the current block once in `getAttesters`, then thread its `number`
as a `blockNumber` option through `getActiveAttesterCount` and every
chunked `getAttestersFromIndicesAtTime` read so they all evaluate
against the same L1 block. This follows the existing
`checkBlockTag(options?.blockNumber, ...)` pattern already used by many
reads in `rollup.ts` (e.g. `getCheckpointNumber`, `status`,
`canPruneAtTime`).

- `getActiveAttesterCount` and
`GSEContract.getAttestersFromIndicesAtTime` now accept an optional `{
blockNumber }`.

## Testing

Verified the full TypeScript build passes. No automated test added:
reproducing the block-drift race deterministically would require anvil
plus a hook to advance an L1 block between the count read and the chunk
reads (or deep viem-client mocking), which isn't justified for this
low-impact, pattern-following change. The block-pinning behavior mirrors
other pinned reads in the same file.
)

## Summary

Replaces the monolithic `EpochProvingJob` with a content-keyed
`CheckpointStore`,
a long-running `SessionManager` that owns ephemeral `EpochSession`s, and
a new
`ProofPublishingService` that centralises L1 submission. `ProverNode`
becomes a
thin event translator: each L2BlockStream event is applied to the store
/ chonk
cache, then dispatched to the session manager and publishing service via
single
method calls.

The redesign closes a class of optimistic-proving bugs that the old
sticky
`epochComplete` flag and per-session publish path made structurally hard
to fix,
and lays the groundwork for re-using sub-tree work across epochs.

See `yarn-project/prover-node/README.md` for architecture diagrams,
state
machines and event-flow sequences.

### Architectural changes

- **`CheckpointProver`** is content-addressed by `(number, slot,
archiveRoot)`.
A prune followed by a re-add of the same content (e.g. brief L1 reorg)
reuses
the in-flight sub-tree work — no replay. The `CheckpointProver` starts
its own
tx gather + sub-tree pipeline in its constructor; there's no
`provideTxs` API.
- **`CheckpointStore`** owns the registry, the `SlotWatcher` (a
`RunningPromise`
reaping pruned-past-slot `CheckpointProver`s), and `reapExpired` (drops
canonical
`CheckpointProver`s once their epoch's proof-submission window has
closed, so the
  proof can no longer be accepted on L1).
- **`EpochSession`** spec is slot-based: `[firstSlotOfEpoch(N),
toSlot]`. Every
session — full or partial — starts at the epoch's first slot because the
L1
rollup requires every proof to extend from the previous proven tip. The
  session does three things: run a `TopTreeJob`, hand the proof to
`ProofPublishingService` as a `PublishCandidate`, translate the outcome
into a
terminal state. Predecessor gating, same-epoch dedup, deadline
enforcement,
  and the L1 tx are all the service's concern.
- **`ProofPublishingService`** (new) is the single owner of L1
submission. It
  serialises one publish at a time against a freshly-created publisher,
per-candidate `deadline` arms a `setTimeout` (resolves `'expired'` if it
fires before publishing starts), persistent `publisherFactory.create()`
failures are retried on a 1s backoff (capped by the deadline). Once an
L1
  publish starts it runs to completion; `withdraw` is queue-only.
- **`SessionManager`** owns the `fullSessions` / `partialSessions` maps,
the
reconcile loop, **and** the periodic tick. Reconcile is uniform across
kinds:
any session whose canonical content shifts is cancelled and recreated
with the
same spec but new content. The tick high-water mark advances only after
a
session actually exists for the epoch, so transient blockers
(max-pending-jobs
reached, archiver still indexing) leave the mark in place and the next
tick
  retries.
- **`ChonkCache`** moved from per-epoch to a single prover-node-wide
cache in
`prover-client/orchestrator`, keyed by tx hash. Entries are released by
the
per-event expiry sweep (`releaseForBlocks`) once an epoch's
proof-submission
window has closed — there's no longer any proof to produce for those
txs.
- Reconcile and publishing-service drain each run on their own
`SerialQueue` from `@aztec/foundation/queue` so concurrent events can't
  interleave on an `await` and race.

### Removed

- `EpochProvingJob` and its sticky `epochComplete` flag, the
`finalizationScheduled` flag, the in-class restart loop, the `'reorg'`
state.
- `ProvingOrchestrator` and `EpochProvingState` (test-only legacy);
`CheckpointSubTreeOrchestrator` now extends `ProvingScheduler` directly.
- `TopTreeProvingScheduler` collapsed into `TopTreeOrchestrator` (single
  concrete subclass).
- `EpochProvingContext` (thin facade over `ChonkCache`); the sub-tree
takes
  `ChonkCache` + `EpochNumber` directly.
- `CheckpointParent` interface (vestige of `EpochProvingState` as a real
parent);
the per-checkpoint state takes three discrete
`epochNumber`/`isAlive`/`onReject`
  deps from its owner.
- `ProverNodePublisher.interrupt()` / `.restart()` and the entire
mid-publish
interrupt code path. The bug where `l1TxUtils.interrupted` leaked
between
publishes (the publisher is created fresh per publish but wraps a pooled
  `L1TxUtils`) is gone by construction.
- `CheckpointStore.resume()` (dead code) and `implements Service` on
  `CheckpointStore`.
- `ReconcileTrigger` variant `'finalised'` and
`SessionManager.onChainFinalised`
(redundant nudge; every reconcile already runs the
`recreateInvalidSessions`
  sweep).
- `ProofPublishingService.onPrune` (redundant with the session-manager
path
  which already calls `withdraw(uuid)` for every cancelled session).

### Smaller fixes folded in

- `tipsStore.handleBlockStreamEvent` moved to the `finally` block so a
throwing
  handler doesn't claim progress that didn't happen.
- Failure-upload snapshots every `CheckpointProver` regardless of
sub-tree
  completion.
- `ProverClient.stop()` uses `tryStop(facade)` to swallow
already-stopped errors.
- `lastExpiredEpoch` seeded from the last fully-proven epoch in
`start()` so a
  restart never re-sweeps epochs that already reached L1.
- `DateProvider` plumbed through `EpochSession`, `SessionManager`,
  `ProofPublishingService` — no direct `Date.now()` anywhere.
- Branded types throughout: `Map<EpochNumber, ...>` on session-manager
and
publishing-service Maps; `TopTreeJob.getRange()` returns
`CheckpointNumber`.

## Test plan

- [x] `yarn workspace @aztec/prover-node test` — 161 unit tests pass
(includes
new `proof-publishing-service.test.ts`, `checkpoint-store.test.ts`).
- [x] `yarn workspace @aztec/prover-client test src/orchestrator/
src/test/bb_prover_full_rollup.test.ts`
      with `FAKE_PROOFS=1` — 24 tests pass.
- [x] `yarn workspace @aztec/stdlib test
src/interfaces/prover-node.test.ts` —
passes (state enum still includes legacy values for API compatibility).
- [x] `yarn build` — full monorepo TypeScript build clean.
- [x] `yarn format` / `yarn lint` clean across prover-node,
prover-client,
      and end-to-end.
- [ ] `yarn workspace @aztec/end-to-end test:e2e e2e_optimistic_proving`
and
      `e2e_multi_proof` — exercised by CI.
- [ ] Kind-mode network run with a synthetic L1 prune to confirm the
      cancel-and-recreate path lands a valid proof on L1.
@PhilWindle PhilWindle enabled auto-merge June 9, 2026 19:26
@PhilWindle PhilWindle added this pull request to the merge queue Jun 9, 2026
@AztecBot

AztecBot commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/c26a71708225292d�c26a71708225292d8;;�): yarn-project/end-to-end/scripts/run_test.sh ha src/composed/ha/e2e_ha_full.test.ts (267s) (code: 0)

Merged via the queue into v5-next with commit 7baab83 Jun 9, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants