Skip to content

fix(ci): retry nargo dep + solc downloads to survive transient DNS drops#23490

Merged
alexghr merged 1 commit into
merge-train/spartanfrom
cb/2c63a974cfc6
May 22, 2026
Merged

fix(ci): retry nargo dep + solc downloads to survive transient DNS drops#23490
alexghr merged 1 commit into
merge-train/spartanfrom
cb/2c63a974cfc6

Conversation

@AztecBot

@AztecBot AztecBot commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Why

Merge-train/spartan keeps failing on transient DNS resolution errors, e.g.:

Cloning into '/home/aztec-dev/nargo/github.com/noir-lang/poseidon/v0.3.0'...
fatal: unable to access 'https://github.com/noir-lang/poseidon/': Could not resolve host: github.com
Cannot read file .../poseidon/v0.3.0/Nargo.toml - does it exist?

7 such failures in the last week (github.com ×3 via nargo, binaries.soliditylang.org ×1 via solc; release-assets.githubusercontent.com ×3 already fixed by #23333). Root cause is almost certainly the EC2 VPC resolver's ~1024 packets/sec-per-ENI cap being exhausted by heavy parallel builds, so lookups are silently dropped.

This is the cheap, immediate mitigation: retry the two un-retried network fetches that bite the merge train. It does not fix the root cause — a host-local caching resolver does (dnsmasq spike linked below, for the future).

What

  • noir-projects/bootstrap.sh — wrap the nargo dependency-download (fmt --check prep step) in ci3/retry. On failure it wipes the partial dependency cache ($HOME/nargo) before retrying: a half-finished clone is exactly what produces the Cannot read file .../Nargo.toml error, so a naive retry would just re-hit the poisoned dir. A warm cache is left intact on success.
  • l1-contracts/bootstrap.sh — wrap the forge build --use svm solc download in ci3/retry. The merge queue disables the S3 cache, so this download path runs on every merge-train build.

Both use the existing ci3/retry helper (3 attempts; RETRY_SLEEP=10 for the nargo step to give DNS a little longer to recover).

Dropped

An earlier commit hardened the runner's /etc/resolv.conf (options timeout:1 attempts:5 rotate + a public fallback nameserver). Removed — rotate to a public resolver is brittle and risks breaking any VPC-private name resolution. Not worth the blast radius for a mitigation.

Future: root-cause fix

Host-local dnsmasq caching resolver on the runner — what it would look like: https://gist.github.com/AztecBot/a22cc18bd30ec0bd3dff72b70d675304

@AztecBot AztecBot added ci claudebox Owned by claudebox. it can push to this PR. labels May 22, 2026
@AztecBot AztecBot changed the title fix(ci): retry nargo dep + solc downloads to survive transient DNS drops fix(ci): DNS relief for merge-train — retry flaky downloads + harden resolv.conf May 22, 2026
@AztecBot AztecBot changed the title fix(ci): DNS relief for merge-train — retry flaky downloads + harden resolv.conf fix(ci): retry nargo dep + solc downloads to survive transient DNS drops May 22, 2026
@alexghr alexghr marked this pull request as ready for review May 22, 2026 09:43
@alexghr alexghr enabled auto-merge (squash) May 22, 2026 09:43
@alexghr alexghr disabled auto-merge May 22, 2026 09:43
@alexghr alexghr enabled auto-merge (squash) May 22, 2026 09:43
@alexghr alexghr disabled auto-merge May 22, 2026 11:08
@alexghr alexghr enabled auto-merge (squash) May 22, 2026 11:08
@alexghr alexghr force-pushed the cb/2c63a974cfc6 branch from 9380c26 to 6cea711 Compare May 22, 2026 11:08
@alexghr alexghr merged commit 9675beb into merge-train/spartan May 22, 2026
21 of 27 checks passed
@alexghr alexghr deleted the cb/2c63a974cfc6 branch May 22, 2026 11:09
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
BEGIN_COMMIT_OVERRIDE
refactor(p2p): merge FastTxCollection into TxCollection with sequential
pipeline (AztecProtocol#23245)
refactor(publisher): bundle-level simulate; drop per-action enqueue sims
(AztecProtocol#23165)
refactor(stdlib): remove deprecated RevertCode/TxExecutionResult aliases
(AztecProtocol#23249)
test(e2e): fix race in 'proposer invalidates multiple checkpoints'
(AztecProtocol#23259)
fix: clean up old jobs regardless of pending status (AztecProtocol#23260)
refactor(p2p): remove unused sendBatchRequest (AztecProtocol#23273)
chore(p2p): remove proposal_tx_collector leftovers (AztecProtocol#23276)
feat: slash truncated checkpoint proposals (AztecProtocol#23250)
refactor: remove unused map in attestation pool (AztecProtocol#23284)
chore(p2p): assert last block in checkpoint proposal is correct (AztecProtocol#23274)
refactor(l1-tx-utils): use DateProvider for fail-fast timeout check
(AztecProtocol#23257)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23277)
test(e2e): fix race in broadcasted_invalid_block_proposal_slash under
pipelining (AztecProtocol#23302)
fix(archiver): atomic getter for L2 tips (AztecProtocol#23295)
fix(sequencer): use targetSlot in tryVoteWhenEscapeHatchOpen under
pipelining (AztecProtocol#23296)
fix(world-state): make fork close idempotent for pruned forks (AztecProtocol#23298)
test(e2e): migrate passing tests to proposer pipelining (AztecProtocol#23275)
chore: update dashboard (AztecProtocol#23312)
chore: Revert "feat(sandbox): support proposer pipelining in local
network" (AztecProtocol#23313)
test: slash on bad attestation (AztecProtocol#23184)
feat(slasher): per-slot data-withholding watcher (A-523, A-525) (AztecProtocol#23116)
test(e2e): enable pipelining on e2e debug trace (AztecProtocol#23301)
test(e2e): enable pipelining on l1-to-l2 test (AztecProtocol#23300)
test(e2e): switch fee_settings to organic fee bumps under pipelining
(AztecProtocol#23303)
fix(ci): retry sqlite3mc-wasm download on transient DNS/TLS failures
(AztecProtocol#23333)
test(e2e): wait for real oracle rotation in fee_settings inflate helper
(AztecProtocol#23334)
test(e2e): anchor e2e_amm PXE to checkpointed tip under pipelining
(AztecProtocol#23336)
fix(spartan-bench): tolerate older node images in SlasherConfig schema
(AztecProtocol#23351)
fix: interrupt prover jobs in stop (AztecProtocol#23358)
test(e2e): enable pipelining on bot, fees, and avm simulator tests
(AztecProtocol#23329)
feat(sentinel): end-of-epoch evaluation with re-execution outcomes
(AztecProtocol#23286)
feat: slash for invalid checkpoint proposals (AztecProtocol#23270)
fix: fork closure in epoch proving jobs (AztecProtocol#23390)
fix(slasher): anchor watcher scans at archiver synced L2 slot (AztecProtocol#23394)
fix: avoid npm uplink for aztec-up local publishes (AztecProtocol#23396)
test(e2e): ignore benign 'Insufficient valid txs' block-build-failed in
epochs tests (AztecProtocol#23424)
chore: refactor weekly proving test wait (AztecProtocol#23395)
refactor: add fifo set (AztecProtocol#23271)
feat(sandbox): support proposer pipelining in local network (AztecProtocol#23327)
fix(p2p): validate BLOCK_TXS in BatchTxRequester (AztecProtocol#23371)
chore(p2p): simplify IBatchRequestTxValidator (AztecProtocol#23373)
feat(sequencer): AutomineSequencer for single-sequencer e2e tests
(AztecProtocol#23354)
fix(prover): wait for previous epoch to be proven (AztecProtocol#23458)
chore: collocate provers (AztecProtocol#23439)
chore: rm staging-ignition (AztecProtocol#23440)
chore: rm unused networks (AztecProtocol#23441)
test(e2e): migrate block_building, multi_validator_node,
publisher_funding, invalid_checkpoint_proposal to pipelining (AztecProtocol#23414)
fix(archiver): reconcile local blocks with L1 checkpoints by block
number (AztecProtocol#23461)
feat: Updated slash conditions on block proposals (AztecProtocol#23466)
test(e2e): migrate HA full test to pipelining (AztecProtocol#23463)
chore: update resource profiles (AztecProtocol#23442)
chore: update debug log levels (AztecProtocol#23456)
test: fix flaky sentinel_status_slash by asserting the fault on the
checkpoint slot (AztecProtocol#23483)
feat(slasher): slash checkpoint equivocation between P2P and L1 (A-980)
(AztecProtocol#23436)
refactor(slasher): rename ATTESTED_DESCENDANT_OF_INVALID ->
PROPOSED_DESCENDANT_OF_CHECKPOINT_WITH_INVALID_ATTESTATIONS (AztecProtocol#23468)
fix: reject block proposals in poisoned slots (AztecProtocol#23411)
fix: retry nargo dep + solc downloads to survive transient DNS drops
(AztecProtocol#23490)
fix: enrich json-rpc tracing (AztecProtocol#23412)
feat: add trace export controls (AztecProtocol#23413)
test(e2e): assert no equivocation offenses in HA full test (AztecProtocol#23496)
test: cover invalid checkpoint proposal slashing (AztecProtocol#23503)
test(e2e): migrate more e2e suites to proposer pipelining (AztecProtocol#23482)
test: flag e2e_slashing_attested_invalid_proposal as flake under
pipelining (AztecProtocol#23501)
test: flag e2e_p2p_duplicate_proposal_slash as flake under pipelining
(AztecProtocol#23515)
test(e2e): require cross-observer agreement on sentinel fault slot
(AztecProtocol#23513)
test: flag e2e_ha_full afterAll hook timeout as flake under pipelining
(AztecProtocol#23524)
fix(e2e): propagate l1ContractsArgs into node config so archiver matches
L1 (AztecProtocol#23514)
test: flag e2e_multi_validator_node_key_store P2P tx-dropped failure as
flake (AztecProtocol#23528)
test(cheat-codes): retry warpL2TimeAtLeastTo in-current-slot test on L1
race (AztecProtocol#23533)
test(e2e_ha_full): parallel HA peer node teardown with per-node deadline
(AztecProtocol#23539)
test: flag e2e_ha_full as flake under HA pipelining (AztecProtocol#23541)
test(ci): skip e2e_ha_full entirely on merge-train/spartan (AztecProtocol#23542)
test(ci): skip e2e_multi_validator_node_key_store entirely on
merge-train/spartan (AztecProtocol#23544)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci ci-skip claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants