feat: merge-train/spartan#23580
Merged
Merged
Conversation
…23502) ## Motivation `archiver/src/modules/l1_synchronizer.ts` skipped checkpoints with insufficient/invalid attestations under the assumption that the next proposer would invalidate them before publishing. When that assumption was violated — i.e., proposer P2 published a valid-attestations checkpoint that extended P1's invalid one — the archiver hit `InitialCheckpointNumberNotSequentialError` in `block_store.addCheckpoints`, the catch handler rolled back the L1 sync point, and the next poll re-fetched the same range and re-threw. The archiver looped indefinitely. The protocol already defines `OffenseType.PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS` for exactly this case but the slasher couldn't see valid-attestations descendants because the archiver threw before emitting any event. ### Human Note This is particularly relevant under pipelining. Attestors now attest to a checkpoint _before_ the previous one is pushed to L1, so they can be inadvertently attesting to a checkpoint built on top of one that became invalid as it was published to the rollup the contract with wrong attestations. So an honest attestor could get slashed if the proposer was malicious. ## Approach In the synchronizer, persist rejected ancestors in the block store keyed by archive root. On each new checkpoint, before attestation validation, compare its `header.lastArchiveRoot` against the persisted set — if it matches, skip the checkpoint as a descendant of an invalid ancestor and emit a new `L2BlockSourceEvents.CheckpointBuiltOnInvalidAncestorDetected` event with enough metadata to resolve the proposer. The slasher's `AttestationsBlockWatcher` is fixed to slash the proposer (not the attestors) under the new event. Fixes A-1072
## Summary - Run `gcp_auth` before `setup_gcp_secrets` in `source_network_env` so EC2 benchmark jobs can read Secret Manager (e.g. `otel-collector-url`). - Improve `setup_gcp_secrets.sh` diagnostics and activate the CI service account before secret fetches. - Install Terraform on Linux in `install_deps.sh`; add `setup-terraform` on nightly wait jobs. - Fix `deploy-network` checkout for pinned submodules (`fetch-depth: 0`, `lfs: true`). - Checkout `github.sha` on the benchmark job so workflow_dispatch from a feature branch runs that branch on EC2 (not `next`). Validated manually via Nightly Bench 10 TPS workflow_dispatch on this branch (run succeeded). ## Test plan - [x] Nightly Bench 10 TPS workflow_dispatch from `spy/10tps-bench-terraform` (deploy, wait, benchmark)
## Summary - Stabilizes the multiple-validator sentinel e2e by waiting for a post-warmup checkpoint before recording the assertion window. - Reuses the same warm-up helper in the second test so isolated runs avoid the same fresh-network startup noise before stopping a validator. ## Failed run Failed CI run: http://ci.aztec-labs.com/07fb31bc0706159f The failing test was `e2e_p2p_multiple_validators_sentinel > collects attestations for all validators on a node`. The test expected no `attestation-missed` entries, but the assertion window started while the network was still in the first pipelined slots after startup. In the failed run, slot 8 was built on a pending, not-yet-checkpointed parent, so some remote validators could not validate/attest in time and the sentinel recorded a missed attestation. ## Fix The test now waits for one warm-up slot and then waits for the observed checkpoint number to advance before capturing `initialSlot`. That keeps startup pipelining behavior out of the strict sentinel assertion window while preserving the test's actual coverage: once the network is past warm-up, every validator should be observed attesting or proposing as expected. ## Verification - `yarn format end-to-end` - `yarn build` - `yarn workspace @aztec/end-to-end test:e2e e2e_p2p/multiple_validators_sentinel.parallel.test.ts -t 'collects attestations for all validators on a node'`
…est.ts` (#23568) Fix web3signer e2e `e2e_multi_validator_node_key_store.test.ts` by removing the minTxsPerBlock override so the pipelining preset can publish empty checkpoints while txs arrive. Also anchors the test PXE to the checkpointed chain tip to prevent checkpoint prunes from killing sent txs.
Running ci.sh grind was failing with `sethostname: invalid argument`. Codex attributed the failure to a long branch name, causing a long instance name, which was too long for `sethostname`. Confirmed that switching to a shorter branch name fixed the issue. ``` --- request build instance (SSH) --- Requesting m6a.48xlarge spot instance (name: spl_fix-web3signer-pipelining-test_amd64_grind-test-cdfb13e6637062de) (type: m6a.48xlarge) (ami: ami-067627aa971a1dcbb) (bid: 8.3136)... Waiting for instance id for spot request: sir-dvtzjepj... Timeout waiting for spot request. Requesting m6a.48xlarge on-demand instance (name: spl_fix-web3signer-pipelining-test_amd64_grind-test-cdfb13e6637062de) (type: m6a.48xlarge) (ami: ami-067627aa971a1dcbb) (bid: 8.3136)... Instance id: i-0fd2be01d28ec47e5 Waiting for SSH at 13.58.96.227... --- connect via SSH --- Stdout is not a tty, running in background... Host processes pinned to OS CPUs: 88-95,184-191 HOST: fetching EC2 metadata token... HOST: metadata token acquired. HOST: decoding credentials... HOST: starting devbox container... HOST: devbox container launched (pid=10513). Monitoring for spot termination... HOST: preparing devbox (uid/gid, docker run)... docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: sethostname: invalid argument ```
Fixes the invalid checkpoint descendant e2e timing by keeping sequencers stopped until the test has selected adjacent target proposers, installed listeners, applied malicious configs, and warped to the intended pipelined build window. This avoids applying malicious config to an earlier slot owned by the same validator, which is what caused the CI run for PR #23502 to miss the intended P1/P2 checkpoint pair.
…iple checkpoints` (#23590) Summary: - Scan for consecutive bad checkpoint slots whose prior pipelined target slot is not owned by either intended bad proposer. - Keep the malicious-config injection tied to the selected bad proposers and remove the now-unnecessary non-null assertion. - Add an inline comment documenting why the prior pipelined target slot matters. Why: The test applies malicious checkpoint config while sequencers are already running. With proposer pipelining, the previous target slot can snapshot that config before the intended bad slots are built. If that prior proposer is one of the intended bad proposers, the test may spend the malicious config on the wrong checkpoint and stop validating the intended two-checkpoint invalidation path. This mirrors the slot-selection issue fixed for the invalid proposal slashing test, but applies it to the consecutive checkpoint invalidation scenario. Testing: - yarn format end-to-end - yarn build - LOG_LEVEL="info; debug:sequencer,publisher,validator" yarn workspace @aztec/end-to-end test:e2e e2e_epochs/epochs_invalidate_block.parallel.test.ts -t "proposer invalidates multiple checkpoints"
…ed_invalid_proposal` (#23589) ## Summary - skip target slots in attested invalid proposal slashing when the previous pipelined target slot has the same bad proposer - log the previous pipelined target proposer while selecting the test slot ## Why CI run http://ci.aztec-labs.com/bf99262466eae1dd selected slot 21 for the invalid checkpoint scenario, but the same bad proposer could first run a prior pipelined slot and build only a partial checkpoint. That left the test waiting for block-proposed events on the intended slot that never arrived. Requiring the previous pipelined target slot to have a different proposer keeps the malicious config from being consumed by the wrong slot after the epoch warp. ## Testing - yarn format end-to-end - yarn build - LOG_LEVEL='info; debug:sequencer,publisher,validator' yarn workspace @aztec/end-to-end test:e2e e2e_slashing/attested_invalid_proposal.test.ts
## Summary
Adds a zero fast-path to `toBufferBE`, the bigint→big-endian-buffer
conversion underlying `Fr.toBuffer()`. Field elements serialized in
protocol structs are overwhelmingly zero (kernel public inputs are
mostly fixed-size zero-padding), so short-circuiting the zero case
avoids a wasteful `bigint → hex string → Buffer.from(hex)` round-trip.
```ts
if (num === 0n) {
return Buffer.alloc(width);
}
```
## Why
Profiling `Tx.toBuffer()` showed it spends ~6.7ms almost entirely in
per-field `Fr.toBuffer()` across ~3900 fields, and **96% of those fields
are zero**. The scalar conversion is already near-optimal otherwise — a
64-bit-words variant (`writeBigUInt64BE`×4) is actually *slower* on real
(non-zero) field elements because V8's bigint shifts allocate.
Micro-benchmark of `toBufferBE` variants (width=32, correctness-checked
against current):
| variant | 96%-zero (real) | all-random (worst case) |
|---|---|---|
| current | 452 ns | 382 ns |
| 64-bit words | 215 ns | 503 ns (slower) |
| **zero fast-path** | **55 ns** | 387 ns (free) |
The fast-path is ~8× on the real workload and costs one `=== 0n` compare
on the worst case.
## Impact
End-to-end on `mockTx(42)`:
| | before | after |
|---|---|---|
| `tx.toBuffer()` total | 6.66 ms | 4.20 ms (−37%) |
| `data.toBuffer()` | 4.34 ms | 2.25 ms (−48%) |
`data.toBuffer()` (the kernel public inputs) is the production-relevant
figure: the mock serializes an uncompressed proof, whereas real txs
carry a compressed proof that serializes as a single blob. The benefit
applies to every `Fr.toBuffer()` / serialization path in the monorepo,
not just txs.
The remaining cost is structural — a Buffer is allocated per field and
then `Buffer.concat`'d across thousands of them. Eliminating that needs
a single-preallocated-buffer serializer; this change is the safe,
broadly-beneficial first step.
## Testing
`toBufferBE` previously had no direct unit tests; added coverage for the
zero path, big-endian left-padding, exact-width values, and the
negative-input throw. The conversion is otherwise byte-identical to
before.
Fix A-1109
This causes Codex sandbox to fail and the apply_patch command to fail. Fix is to remove the symlinks for all the .codex folders, and instead create actual folders with symlinks in their contents. A pre-commit hook checks that all contents are symlinked. > The issue is the tracked symlink: > > yarn-project/.codex -> .claude > > The sandbox is trying to enforce /home/santiago/Projects/aztec-4/yarn-project/.codex as a read-only > path, but yarn-project is also a writable root. Since .codex is a symlink inside that writable root, > bubblewrap refuses to set up the sandbox: > > Fatal error: cannot enforce sandbox read-only path .../.codex > because it crosses writable symlink .../.codex > > So apply_patch is not uniquely broken. I reproduced the same sandbox setup failure with simple > sandboxed commands like pwd and ls. Commands that are already approved or explicitly escalated can > still run because they bypass that sandbox path setup. This issue had been introduced in #23400.
Fixes issue introduced in #23593. Also fixes the content hash so they run on any change to claude or codex folders, which caused the test failure to go unnoticed in the PR where it was introduced.
…l` (#23604) Instead of checking a range of slots, we only check the slot we're interested in. This prevents any build errors that occured until things got stable from interfering. For instance, the sequencer we stop could cause the _next_ sequencer to miss their block. Looking just into the `sentinelSlot` removes this indeterminism.
…23608) Fixes flake in `proposer invalidates multiple checkpoints` `e2e_epochs/epochs_invalidate_block.parallel.test.ts` test that caused a timeout (see [this run](http://ci.aztec-labs.com/8b1c0f4ec6031f2b)). See below for the Codex analysis and fix. --- **Test Summary** `proposer invalidates multiple checkpoints` verifies that two intended bad checkpoints land with insufficient attestations, a later good proposer invalidates the first bad checkpoint, and the chain then progresses. **Failed Run Error** CI run `8b1c0f4ec6031f2b` timed out at Jest’s 600s limit. The failure was not the shutdown L1 send error; that happened after the timeout while teardown was interrupting pending work. **Failed vs Successful Divergence** First meaningful divergence: checkpoint 4 at slot 23. Failed log: slot 23 published checkpoint 4 with only 1 attestation, then archivers reported `Insufficient attestations ... actualAttestations:1`. Successful log: slot 23 collected all 5 attestations before publishing checkpoint 4, so the first intentionally bad checkpoints were later. **Timeline** Failed: - `15:59:11` selected intended bad slots 25/26, applied bad config to proposer `0x15...` - `15:59:35` slot 23 job prepared by that same proposer - `16:00:15` checkpoint 4 at slot 23 landed with 1 attestation - repeated rollback/retry consumed enough time to hit Jest timeout Successful: - slot 23 checkpoint landed cleanly with 5 attestations - intended bad checkpoints at slots 24/25 landed with 1 attestation - checkpoint 5 was invalidated - test completed successfully **Hypothesis** High confidence: the test’s bad-slot selection only excluded `candidateSlot1 - 1` as a pre-bad pipelined target. In the failed run, `candidateSlot1 - 2` was still unsnapshotted and owned by a bad proposer, so applying malicious config leaked into slot 23. **Evidence** - Logs: failed run selected slots 25/26 but slot 23 later published with 1 attestation from the newly bad proposer. - Source: pipelined checkpoint jobs snapshot sequencer config when the target-slot job is created, so applying config while sequencers are running can affect any not-yet-created pre-bad job. - Skeptic check: no contradiction found; it also caught a broken local timeout race. **Proposed Fix** Implemented in [epochs_invalidate_block.parallel.test.ts](/home/santiago/Projects/aztec-1/yarn-project/end-to-end/src/e2e_epochs/epochs_invalidate_block.parallel.test.ts:393): the selector now excludes bad proposers from every pre-bad target slot from `currentSlot + 2` through `candidateSlot1 - 1`, not just the immediately prior slot. Also fixed the broken timeout race at [line 475](/home/santiago/Projects/aztec-1/yarn-project/end-to-end/src/e2e_epochs/epochs_invalidate_block.parallel.test.ts:475) by removing the accidental inner `await`.
- record zero-amount slashing offenses instead of treating penalty 0 as disabled - keep slash vote signaling gated by existing summed vote output - upgrade stored duplicate offenses when a later observation has a higher amount Fix A-1075
Fix A-1075
## Overview Adds a tx validation cache to the p2p layer so that repeated validation of the same transaction by the same validator reuses the prior result instead of redoing the work (notably the expensive proof verification). Downside: Using this cache for validations adds up to 7ms overhead for each validation, when the object needs to be hashed. This is actually entirely (+90%) dominated by `.toBuffer()` time. Cached validation is added for on-demand tx collection, but NOT for gossip and RPC ingress. ## Changes **Cache core (`p2p/src/msg_validators/tx_validator/`)** - `TxValidationCache` — bounded, LRU-evicting cache keyed by `(validatorSymbol, txHash)`. Stores the in-flight promise before awaiting, so concurrent validations of the same tx coalesce into a single call. `get`/`set`/`delete` take the cache key directly; `key(validatorSymbol, tx)` builds it. - `CachedTxValidator` — wraps any `TxValidator` to route `validateTx` through the cache using the validator's `identifier` symbol. `DataTxValidator` and `TxProofValidator` gained stable `identifier`s. - `factory.ts` — threads an optional `TxValidationCache` through the gossip (first/second stage), block-proposal, on-demand, and RPC validator builders, wrapping the state-independent validators (`DataTxValidator`, `TxProofValidator`, and the minimum-integrity aggregate) in `CachedTxValidator`. **LRU map extracted to foundation (`foundation/src/collection/lru_map.ts`)** - The hand-rolled doubly-linked-list LRU bookkeeping was factored out of `TxValidationCache` into a generic `LruMap<K, V>`, mirroring the existing `LruSet`. `TxValidationCache` now composes an `LruMap<string, Promise<TxValidationResult>>`. Added `LruMap` unit tests. **Wiring** - New `P2P_TX_VALIDATION_CACHE_SIZE` env var / `txValidationCacheSize` config (cache disabled when `0`). - `createP2PClient` constructs the cache and passes it to `LibP2PService` (gossip + block-proposal paths) and to the batch-tx-requester's on-demand validator config. **Benchmarks** - Added a sha256-based TX hash benchmark. Closes https://linear.app/aztec-labs/issue/A-934/dont-repeatedly-verify-retrieved-transactions .
delete vm + ip + private-keys + dry run
Adds `bbapi/chonk_pinned_inputs.test` to `.test_patterns.yml` with `skip: true`. The test fails in CI with `No execution steps in ivc-inputs.msgpack` (pinned Chonk IVC inputs missing/stale), e.g. http://ci.aztec-labs.com/d7647a841ee811a0 — both the native and wasm backend cases fail at `loadPinnedFlow`. Requested by Santiago Palladino to unblock the merge train; owner set to palla for follow-up. --- *Created by [claudebox](https://claudebox.work/v2/sessions/08e2c5b593fedd23) · group: `slackbot`*
spalladino
pushed a commit
that referenced
this pull request
May 29, 2026
…lake (#23653) ## Why The `merge-train/spartan` train PR (#23580) was dequeued from the merge queue. The merge-queue CI3 run ([run 26608568295](https://github.com/AztecProtocol/aztec-packages/actions/runs/26608568295)) failed in the `x2-full amd64 ci-full-no-test-cache` grind during the **aztec-nr warnings check**, after only ~11s: ``` Checking aztec-nr for warnings... Cloning into '.../noir-lang/poseidon/v0.3.0'... Cloning into '.../noir-lang/sha256/v0.3.0'... fatal: unable to access 'https://github.com/noir-lang/sha256/': Could not resolve host: github.com Cannot read file .../noir-lang/sha256/v0.3.0/Nargo.toml - does it exist? make: *** [Makefile:303: aztec-nr] Error 1 ``` A transient DNS/network flake on the runner — not a code defect. `aztec-nr/aztec/Nargo.toml` declares external git dependencies (`noir-lang/sha256`, `noir-lang/poseidon`, pinned at `v0.3.0`) which `nargo check` resolves by cloning from `github.com` on a cold cache. When the runner momentarily can't resolve `github.com`, the clone fails and dequeues the whole train. ## What A blanket `retry` around `nargo check` would also re-run on genuine check failures (type errors, denied warnings) — wasting CI time and masking intent. So instead: - **`ci3/retry` gains a `-p <regex>` option.** It captures the command's combined output and only retries when a failure matches the regex; any non-matching failure exits immediately with the original code. Without `-p`, behavior is unchanged (the heavily-used default path is untouched). `pipefail` ensures the wrapped command's exit code (not `tee`'s) is what's checked, and `tee` finishes before inspection so the captured output is complete. - **`aztec-nr/bootstrap.sh`** wraps its two network-touching nargo calls (`check`, `doc --check`) with `retry -p "<git transport errors>"`, matching only `Could not resolve host`, `unable to access`, `Connection timed out/refused`, `Failed to connect`, `TLS connect error`, `early EOF`, `RPC failed`. These never overlap with nargo's `error:`/`warning:` output, so a genuine check failure still fails on the first attempt. `nargo` has no standalone dependency-install/fetch command (its subcommands are `check, compile, dap, debug, doc, execute, expand, export, fmt, fuzz, info, init, interpret, lsp, new, test`); resolution only happens inside `check`/`compile`/`test`, so the regex-gated retry is the workable option of the two suggested. ## Verification - `bash -n` on all three files; `ci3/tests/retry_test` (new, auto-discovered by the ci3 test runner) passes all 6 cases: - default mode retries a transient failure then succeeds / gives up after 3 attempts - pattern mode retries a matching network failure then succeeds - **pattern mode fails fast (1 attempt) on a non-matching genuine error** ← the behavior requested - pattern mode gives up after 3 attempts on a persistent matching failure - `RETRY_DISABLED` runs the command exactly once The full `./bootstrap.sh ci` run is the same orchestrated remote-EC2 CI that failed here and isn't reproducible on a dev host; the transient DNS failure also can't be reproduced where DNS works. Verification is therefore at the retry/wrapper level, which is exactly what this change touches.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(archiver): skip descendants of invalid-attestations checkpoints (#23502)
chore: scale network validators (#23579)
fix(ci): nightly 10 TPS bench GCP auth and checkout (#23586)
chore: set eth node resource profile (#23583)
fix: wait for checkpoint before sentinel assertions (#23573)
fix: slash attestations for invalid checkpoint proposals (#23506)
test: fix web3signer pipelining
e2e_multi_validator_node_key_store.test.ts(#23568)fix: cap CI devbox hostname (#23591)
test: stabilize invalid checkpoint descendant e2e (#23582)
test(e2e): stabilize invalidation slots in
proposer invalidates multiple checkpoints(#23590)test(e2e): stabilize invalid proposal slashing target slot in
attested_invalid_proposal(#23589)chore(foundation): faster toBufferBE via zero fast-path (#23592)
fix: honour BB_BINARY_PATH (#23570)
chore: bump reth and lighthouse (#23588)
chore: add web3signer and postgres node selectors (#23598)
fix: do not symlink .codex folders (#23593)
chore: fix claude and codex symlinking tests (#23599)
test(e2e): narrow down sentinel check in
multiple_validators_sentinel(#23604)test(e2e): fix
proposer invalidates multiple checkpointstimeout (#23608)fix: record zero-amount slashing offenses (#23556)
fix: log slashing offense names (#23565)
feat(p2p): tx validation cache (#23585)
chore: add KEDA deployment module (#23553)
chore: add KEDA prover agent autoscaling (#23554)
chore: update destroy_bootnode.sh (#23626)
chore: skip failing chonk_pinned_inputs.test in CI (#23643)
chore(ci): tolerate public authwit P2P receipt flake (#23648)
END_COMMIT_OVERRIDE