chore: conflicts by alexghr · Pull Request #24019 · AztecProtocol/aztec-packages

alexghr · 2026-06-11T14:39:03Z

.

Merge-queue runs route through `multi_job_run`, which pipes the runner-side orchestration into a parent dashboard log (`cache_log "CI run" $RUN_ID`) — so the spot/instance request is visible on ci.aztec-labs.com. Single-instance PR modes called `bootstrap_ec2` directly, so that output only reached the GitHub Actions console; you had to leave the dashboard to see which instance was created. Route the PR-facing single-instance modes (fast/docs/barretenberg/ barretenberg-full, full/full-no-test-cache, chonk-input-update) through `multi_job_run` with a single job, matching merge-queue. The job id is kept as `x-$cmd` so the `ci/<job>` GitHub status check name is unchanged. socket-fix keeps its raw (un-denoised) output but now pipes through `cache_log` so it too gets a parent log.

Fix A-1163

bootstrap_ec2 terminates any existing instance sharing the target Name tag, to reap orphans left by a cancelled GA run on the same ref. But the name was just <ref>_<arch>[_postfix], with no repo component — so aztec-packages and aztec-packages-private, which build the same tags/refs concurrently under the same OIDC role, computed identical names and reaped each other's live instances. Observed: nightly tag v5.0.0-nightly.20260610 built in both repos; the public run's pre-launch reap terminated the private run's in-progress arm64 release instance ~7 min in, failing that build. Prefix the instance name with the repo basename (GITHUB_REPOSITORY##*/, default aztec-packages). The key stays stable across re-runs within a repo, so the intended orphan cleanup still works; it only stops the two repos from colliding. ci.sh's helper instance_name (shell/kill/get-ip) is kept in sync.

#23987) ## Problem `ci3/bootstrap_ec2` terminates any existing instance that shares the target `Name` tag before launching — this intentionally reaps orphans left when a GA run is cancelled (e.g. by a new push) on the same ref. But the name was `<ref>_<arch>[_<postfix>]` with **no repo component**, so `aztec-packages` and `aztec-packages-private` — which build the same tags/refs concurrently under the **same OIDC role** — computed identical names and reaped each other's live instances. ### Observed incident Nightly tag `v5.0.0-nightly.20260610` was built in **both** repos. Instance `i-02e5d6a6c148ec726` (`v5_0_0-nightly_20260610_arm64_a-release`) was launched by the private repo's run at 03:06:01 UTC and **terminated at 03:13:12 UTC by the public repo's run** for the same tag (its pre-launch reap step), ~7 min in — failing the private build. CloudTrail confirms a `TerminateInstances` from a different `ci3-<run_id>` session, not a spot interruption. ## Fix Prefix the instance name with the repo basename (`${GITHUB_REPOSITORY##*/}`, defaulting to `aztec-packages` for local runs): - **Within a repo**, the key is unchanged in spirit (`<repo>_<ref>_<arch>`) and stays stable across re-runs/new-pushes of the same ref — so the intended orphan-on-cancel cleanup still works. - **Across repos**, public → `aztec-packages_…` and private → `aztec-packages-private_…`, so they no longer match and can't reap each other. `ci.sh`'s helper `instance_name` (used by the `shell`/`kill`/`get-ip` dev commands) is kept in sync so it still resolves instances launched by a CI run for the same repo. ### Notes - The EC2 `Name` tag limit is 256 chars; the longest prefixed name is ~61 chars. The reap match uses the full `Name` tag, so the cosmetic 63-char `docker_hostname` truncation doesn't affect correctness. - One-time transition: instances launched by the old (un-prefixed) code won't be reaped by name-match from new runs; they fall back to the shutdown timer / 1.5h reaper. Self-heals within a couple hours. - This stops the *collision*. Whether public **and** private *should* both build the same nightly tag (duplicated work) is a separate question — happy to follow up if you want one gated off.

## Problem Merge-queue runs show a top-level "parent log" on the CI dashboard that includes the **spot/instance request** (which instance type was created, spot vs on-demand). Standard PR runs don't — to see what instance a PR run got, you have to leave the dashboard and dig into the GitHub Actions console. ## Cause The runner-side orchestration output (the `Requesting spot fleet…` line from `aws_request_instance_type`) is printed on the GA runner, *before* the remote build streams to its per-instance `CI_LOG_ID` log. Where that runner-side output lands depends on the path in `ci.sh`: - **Merge-queue** goes through `multi_job_run`, which pipes everything into a parent dashboard log: `parallel … 'run …' | DUP=1 cache_log "CI run" $RUN_ID`. Each `run()` wraps `bootstrap_ec2` with `PARENT_LOG_ID=$RUN_ID`, so the instance request lands in the parent log and the build log links underneath it. - **PR modes** called `bootstrap_ec2` directly — no `cache_log`, no parent log — so the instance request only reached the GA console. ## Change Route the PR-facing single-instance modes through the same `multi_job_run` path (with a single job), so they get an identical `"CI run" $RUN_ID` parent log with the instance request visible and the build log linked beneath it: - `fast` / `docs` / `barretenberg` / `barretenberg-full` - `full` / `full-no-test-cache` - `chonk-input-update` The job id is kept as `x-$cmd`, so the `ci/<job>` GitHub commit-status name is **unchanged** (no impact on required checks). `socket-fix` (which takes extra args and is an interactive debug mode) keeps its raw, un-denoised output but now pipes through `cache_log` so it also gets a parent log. ## Behavior notes - PR-run GA console output is now denoised (condensed progress) for the converted modes, matching merge-queue; the full log lives in the dashboard parent log. - Instances for these modes now carry an `INSTANCE_POSTFIX` equal to the job id (e.g. `x-fast`), so the EC2 `Name` tag becomes `<ref>_amd64_x-fast`. Same-mode re-runs still dedupe correctly. ## Validation This is a structural reuse of the already-proven merge-queue path (`multi_job_run`), and `bash -n ci.sh` passes. It can't be exercised locally (needs the GA + AWS orchestration), but **this PR's own CI run is the test**: the `fast` job should now produce a `CI run` parent log on the dashboard showing the instance request, reachable without opening GitHub Actions.

## What Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged. ### 1. Network scenario tests run only on the private v5-next nightly `ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**: ```yaml ( needs.validate-nightly-tag.outputs.is_current == 'true' && github.repository == 'AztecProtocol/aztec-packages-private' && startsWith(github.ref_name, 'v5.') ) || contains(github.event.pull_request.labels.*.name, 'ci-network-scenario') ``` `v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs. ### 2. Stop tagging `next` with a nightly tag in public `nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched. ## Why Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.

) ## What Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged. ### 1. Network scenario tests run only on the private v5-next nightly `ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**: ```yaml ( needs.validate-nightly-tag.outputs.is_current == 'true' && github.repository == 'AztecProtocol/aztec-packages-private' && startsWith(github.ref_name, 'v5.') ) || contains(github.event.pull_request.labels.*.name, 'ci-network-scenario') ``` `v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs. ### 2. Stop tagging `next` with a nightly tag in public `nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched. ## Why Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.

…letes ## Problem A devnet deploy failed waiting on CI3 for two distinct reasons: 1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed. 2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine. ## Fix 1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is. 2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag. The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.

…letes (#24012) ## Problem A devnet deploy failed waiting on CI3 for two distinct reasons: 1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed. 2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine. ## Fix 1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is. 2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag. The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.

charlielye and others added 11 commits June 9, 2026 16:50

chore: deployments

01f1bc5

Fix A-1163

chore: deployments (#23959)

7e94c2c

Fix A-1163

Merge remote-tracking branch 'origin/next' into ag/fix-mt-conflicts

73f9207

alexghr requested a review from charlielye as a code owner June 11, 2026 14:39

alexghr enabled auto-merge (squash) June 11, 2026 14:39

alexghr mentioned this pull request Jun 11, 2026

feat: merge-train/spartan #23971

Open

ludamad approved these changes Jun 11, 2026

View reviewed changes

alexghr merged commit d6187fc into merge-train/spartan Jun 11, 2026
12 checks passed

alexghr deleted the ag/fix-mt-conflicts branch June 11, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: conflicts#24019

chore: conflicts#24019
alexghr merged 11 commits into
merge-train/spartanfrom
ag/fix-mt-conflicts

alexghr commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

alexghr commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants