Skip to content

feat(sequencer): build optimistically across pruning epoch boundary#23056

Merged
PhilWindle merged 2 commits into
merge-train/spartanfrom
palla/optimistic-build-boundary
May 8, 2026
Merged

feat(sequencer): build optimistically across pruning epoch boundary#23056
PhilWindle merged 2 commits into
merge-train/spartanfrom
palla/optimistic-build-boundary

Conversation

@spalladino

@spalladino spalladino commented May 7, 2026

Copy link
Copy Markdown
Contributor

Motivation

At a pruning epoch boundary, today's canProposeAtTime simulation pre-emptively reverts when an unproven epoch's deadline is about to expire — even if the proof lands seconds later. The slot is silently skipped. This loses a checkpoint window for no good reason: the publisher's preCheck right before L1 submission is the authoritative gate.

Similarly, the simulation overrides applied to the preCheck flight due to pipelining (as in overriding the pending chain with the last mined slot) meant that we were silently missing the case where the epoch prune did trigger, so we were sending the tx and reverting. This is fixed by having different plans for the first simulation and the right-before-submission simulation. That said, sequencer publisher checks are a bit convoluted now, so I'm making a pass at them to try and simplify in a later PR.

Approach

Apply a proven-override at the three pre-submission simulation sites (canProposeAt, the globals builder, and enqueue-time validateCheckpointForSubmission) that forces pending == proven so STFLib.canPruneAtTime short-circuits to false. Submission's preCheck runs without the override against real L1 state and decides whether to actually send. A new structured preparing-checkpoint sequencer event surfaces the override/parent state for tests. Tip storage now goes through a single makeChainTipsOverride to avoid same-slot state-diff clobbering.

Changes

  • archiver: isPruneDueAtSlot(slot) on L2BlockSource replicates STFLib.canPruneAtTime locally (no L1 RPC).
  • ethereum: RollupContract.makeChainTipsOverride({pending?, proven?}) writes a single combined state-diff and guards proven > pending. forPendingCheckpoint(n)withChainTips({pending?, proven?}) on the simulation overrides builder.
  • sequencer-client (publisher): enqueueProposeCheckpoint accepts preCheckSimulationOverridesPlan separately from simulationOverridesPlan; the preCheck closure uses it (no fallback) so the parent / proven overrides never reach pre-send validation.
  • sequencer-client (sequencer): applies the proven override at the canProposeAt site, plumbs it through prunePending to CheckpointProposalJob so the globals builder and enqueue-time validation see it. New pauseProposingForSlots test-only config.
  • sequencer-client (events): new preparing-checkpoint event with targetSlot, checkpointNumber, hadProposedParent, provenOverride.
  • ethereum (test infra): Delayer.pauseNextTxUntil* accept a per-call timeout to support boundary tests that need to wait > 180s.
  • end-to-end (new tests): epochs_proof_at_boundary.parallel.test.ts covers smoke + four boundary scenarios — proof lands during pipeline sleep; proof lands well before deadline; proof never lands (with parent); proof lands / never lands without proposed parent — using structured events and retryUntil rather than log greps.
  • stdlib + interfaces: schemas and configs updated for the new RPC method and the new sequencer config knob.

@spalladino spalladino added the ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure label May 7, 2026
When the proof-submission deadline expires at slot N, the proposer for
slot N+1 normally cannot build because the L1 contract's canProposeAt
simulation reverts (it expects an in-tx prune to reset tips.pending).
This conservative gate forfeits the slot even when the proof lands
moments later.

Add a proven-override applied at canProposeAt, the globals builder, and
enqueue-time validation that pretends pending == proven, so canPruneAtTime
returns false in simulation. The publisher's preCheck (without the
override) runs against real L1 right before submission and is the
authoritative gate: if the proof landed, propose; if not, discard.
@spalladino spalladino force-pushed the palla/optimistic-build-boundary branch from 3cc390b to 23404e4 Compare May 7, 2026 18:22
@PhilWindle PhilWindle merged commit d14b8a0 into merge-train/spartan May 8, 2026
14 checks passed
@PhilWindle PhilWindle deleted the palla/optimistic-build-boundary branch May 8, 2026 15:23
spalladino pushed a commit that referenced this pull request May 8, 2026
…dary (#23108)

## Summary

Fixes flaky CI on `merge-train/spartan`
([run](https://github.com/AztecProtocol/aztec-packages/actions/runs/25570963690),
[log](http://ci.aztec-labs.com/1778262953204813)) where
`epochs_proof_at_boundary.parallel.test.ts > proof never lands so no
checkpoint submission is attempted` failed with:

```
expect(received).toBe(expected)
  Expected: 31
  Received: 32
> 312 |     expect(Number(firstPostBoundary.slot)).toBe(Number(boundarySlot) + 1);
```

## Root cause

The assertion's inline comment explicitly acknowledges this is
*empirical*: whether the on-chain prune fires in-tx at `boundarySlot+1`
or only at `boundarySlot+2` depends on real-time L1 / proposer-rebuild
timing. In this run, slot 31's pipelined propose still failed
(`Rollup__InvalidArchive`) and slot 32 was the first slot where the
propose was accepted and the checkpoint published.

The merge-train head — #23098 (one-line log-context fix) — cannot
influence this timing. The flake originated from #23056
(`feat(sequencer): build optimistically across pruning epoch boundary`)
earlier in the same train.

## Fix

Relax `toBe(boundarySlot + 1)` → `toBeLessThanOrEqual(boundarySlot + 2)`
for both the no-parent and with-parent variants of "proof never lands".
The lower bound is already enforced by
`waitForFirstCheckpointAfterBoundary` filtering for `slot >
boundarySlot`. The test's intent (a checkpoint lands in the new epoch
shortly after the boundary) is preserved.

The other two boundary tests where the proof DOES land use
`checkpointNumber >= boundaryPublished.checkpoint`, not slot equality,
so they aren't affected.

Full analysis:
https://gist.github.com/AztecBot/b4010e694332cca93a51024915867e9a

## Test plan

CI on this PR. The container ClaudeBox runs in lacks docker / writeable
cache, so local `./bootstrap.sh ci` could not be executed.


ClaudeBox log: https://claudebox.work/s/d49b46d7e0cb49a6?run=1
rangozd pushed a commit to rangozd/aztec-packages that referenced this pull request May 16, 2026
BEGIN_COMMIT_OVERRIDE
fix(test): warp L1 forward when proposer scan hits EpochNotStable
(AztecProtocol#22967)
test(e2e): fail epochs tests on proposer-rollup-check-failed (AztecProtocol#22965)
fix: grafana switch to aztec_status="proposed" (AztecProtocol#22978)
chore: update benchmark scraper (AztecProtocol#22984)
test(e2e): migrate simple epoch tests to pipelining (AztecProtocol#22973)
chore: remove top-level yarn.lock (AztecProtocol#22987)
refactor(archiver)!: unify L2BlockSource checkpoint lookups via query
objects (AztecProtocol#22933)
fix(sequencer): bounded sweep instead of event scan for governance
proposal check (AztecProtocol#22989)
fix(docs): allow webapp-tutorial yarn install to populate empty lockfile
in CI (AztecProtocol#23000)
test(e2e): enable pipelining in l1-reorgs and mbps redistribution tests
(AztecProtocol#23009)
fix(archiver): restore pending block height metric under pipelining
(AztecProtocol#22994)
chore(p2p): remove skipped validation result option (AztecProtocol#23034)
refactor(p2p)!: remove slow tx collection flow (AztecProtocol#22878)
chore(spartan): add next-net-clone environment config (AztecProtocol#22995)
chore(sequencer): add context to proposer-rollup-check-failed logs
(AztecProtocol#23071)
test(e2e): wait for archiver sync before asserting pipelining (AztecProtocol#22997)
refactor(node-rpc)!: remove deprecated AztecNode methods and
L2BlockSource tip helpers (AztecProtocol#22934)
feat(p2p): detect and track announce IP changes at runtime (AztecProtocol#22405)
test: mark tx_stats_bench 10 TPS as flake-retryable on
merge-train/spartan (AztecProtocol#23083)
fix(sequencer): bind vote-only multicalls to target slot under
pipelining (AztecProtocol#23090)
feat(sequencer): build optimistically across pruning epoch boundary
(AztecProtocol#23056)
fix(sequencer): use chainTipsOverride.pending for log context (AztecProtocol#23098)
test(e2e): relax post-boundary slot assertion in
epochs_proof_at_boundary (AztecProtocol#23108)
fix(bb-prover): pool long-lived bb verifier processes instead of
spawning per-call (AztecProtocol#23093)
fix(sequencer): anchor fee asset price modifier to predicted parent
(AztecProtocol#23113)
chore: error log when L1 head timestamp drifts (AztecProtocol#22947)
fix(sequencer): override full parent checkpoint cell in pipelined
simulation (AztecProtocol#23073)
test(e2e): enable pipelining on missed l1 slot test (AztecProtocol#23068)
fix: more robust metrics reporting in IRM monitor (AztecProtocol#23038)
fix: preserve LMDB slashing protection (AztecProtocol#23145)
test(e2e): enable pipelining on p2p tests (AztecProtocol#23070)
fix(archiver): move L2 tips cache refresh out of write transactions
(AztecProtocol#23110)
test(e2e): fix data_withholding_slash flake by freezing L1 across
restart (AztecProtocol#23162)
fix(validator): include proposed checkpoint out-hashes when validating
checkpoint proposals (AztecProtocol#23119)
refactor(config): drop nested config option, flatten l1Contracts
(AztecProtocol#23143)
test(e2e): bump bash TIMEOUT for e2e_p2p/add_rollup to match jest 20m
(AztecProtocol#23177)
fix(p2p): chunk archive of mined txs on block finalization (A-969)
(AztecProtocol#23085)
fix(p2p): stream tx pool hydration to bound startup memory (A-968)
(AztecProtocol#23086)
chore: remove orphan --archiver flag usages from start invocations
(AztecProtocol#23186)
feat(ci): daily merge-train/spartan stale-PR notifier (AztecProtocol#23189)
fix: preserve contract artifact permissions (AztecProtocol#23174)
fix(ci3): accept slashes in /list/<path:key> for merge-train
history (AztecProtocol#23160)
feat(ci): route merge-train/spartan flake notifications to
#team-alpha-ci (AztecProtocol#23219)
fix(cheat-codes): wait for post-warp L2 block in warpL2TimeAtLeastTo
(AztecProtocol#23213)
feat: slash attesters signing over bad checkpoints (AztecProtocol#23180)
refactor(prover-client): split orchestrator into sub-tree + top-tree
pair (AztecProtocol#22996)
fix(srs): retry transient CRS HTTP downloads with exponential backoff
(AztecProtocol#23244)
refactor(p2p): remove old reqresp mode (AztecProtocol#23158)
docs(sequencer-client): rewrite top-level and timing READMEs (AztecProtocol#23149)
fix(aztec-node): include upcoming checkpoint's L1 to L2 messages in
simulatePublicCalls (AztecProtocol#23163)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants