chore: conflicts#24019
Merged
Merged
Conversation
Merge-queue runs route through `multi_job_run`, which pipes the runner-side orchestration into a parent dashboard log (`cache_log "CI run" $RUN_ID`) — so the spot/instance request is visible on ci.aztec-labs.com. Single-instance PR modes called `bootstrap_ec2` directly, so that output only reached the GitHub Actions console; you had to leave the dashboard to see which instance was created. Route the PR-facing single-instance modes (fast/docs/barretenberg/ barretenberg-full, full/full-no-test-cache, chonk-input-update) through `multi_job_run` with a single job, matching merge-queue. The job id is kept as `x-$cmd` so the `ci/<job>` GitHub status check name is unchanged. socket-fix keeps its raw (un-denoised) output but now pipes through `cache_log` so it too gets a parent log.
Fix A-1163
Fix A-1163
bootstrap_ec2 terminates any existing instance sharing the target Name tag, to reap orphans left by a cancelled GA run on the same ref. But the name was just <ref>_<arch>[_postfix], with no repo component — so aztec-packages and aztec-packages-private, which build the same tags/refs concurrently under the same OIDC role, computed identical names and reaped each other's live instances. Observed: nightly tag v5.0.0-nightly.20260610 built in both repos; the public run's pre-launch reap terminated the private run's in-progress arm64 release instance ~7 min in, failing that build. Prefix the instance name with the repo basename (GITHUB_REPOSITORY##*/, default aztec-packages). The key stays stable across re-runs within a repo, so the intended orphan cleanup still works; it only stops the two repos from colliding. ci.sh's helper instance_name (shell/kill/get-ip) is kept in sync.
#23987) ## Problem `ci3/bootstrap_ec2` terminates any existing instance that shares the target `Name` tag before launching — this intentionally reaps orphans left when a GA run is cancelled (e.g. by a new push) on the same ref. But the name was `<ref>_<arch>[_<postfix>]` with **no repo component**, so `aztec-packages` and `aztec-packages-private` — which build the same tags/refs concurrently under the **same OIDC role** — computed identical names and reaped each other's live instances. ### Observed incident Nightly tag `v5.0.0-nightly.20260610` was built in **both** repos. Instance `i-02e5d6a6c148ec726` (`v5_0_0-nightly_20260610_arm64_a-release`) was launched by the private repo's run at 03:06:01 UTC and **terminated at 03:13:12 UTC by the public repo's run** for the same tag (its pre-launch reap step), ~7 min in — failing the private build. CloudTrail confirms a `TerminateInstances` from a different `ci3-<run_id>` session, not a spot interruption. ## Fix Prefix the instance name with the repo basename (`${GITHUB_REPOSITORY##*/}`, defaulting to `aztec-packages` for local runs): - **Within a repo**, the key is unchanged in spirit (`<repo>_<ref>_<arch>`) and stays stable across re-runs/new-pushes of the same ref — so the intended orphan-on-cancel cleanup still works. - **Across repos**, public → `aztec-packages_…` and private → `aztec-packages-private_…`, so they no longer match and can't reap each other. `ci.sh`'s helper `instance_name` (used by the `shell`/`kill`/`get-ip` dev commands) is kept in sync so it still resolves instances launched by a CI run for the same repo. ### Notes - The EC2 `Name` tag limit is 256 chars; the longest prefixed name is ~61 chars. The reap match uses the full `Name` tag, so the cosmetic 63-char `docker_hostname` truncation doesn't affect correctness. - One-time transition: instances launched by the old (un-prefixed) code won't be reaped by name-match from new runs; they fall back to the shutdown timer / 1.5h reaper. Self-heals within a couple hours. - This stops the *collision*. Whether public **and** private *should* both build the same nightly tag (duplicated work) is a separate question — happy to follow up if you want one gated off.
## Problem Merge-queue runs show a top-level "parent log" on the CI dashboard that includes the **spot/instance request** (which instance type was created, spot vs on-demand). Standard PR runs don't — to see what instance a PR run got, you have to leave the dashboard and dig into the GitHub Actions console. ## Cause The runner-side orchestration output (the `Requesting spot fleet…` line from `aws_request_instance_type`) is printed on the GA runner, *before* the remote build streams to its per-instance `CI_LOG_ID` log. Where that runner-side output lands depends on the path in `ci.sh`: - **Merge-queue** goes through `multi_job_run`, which pipes everything into a parent dashboard log: `parallel … 'run …' | DUP=1 cache_log "CI run" $RUN_ID`. Each `run()` wraps `bootstrap_ec2` with `PARENT_LOG_ID=$RUN_ID`, so the instance request lands in the parent log and the build log links underneath it. - **PR modes** called `bootstrap_ec2` directly — no `cache_log`, no parent log — so the instance request only reached the GA console. ## Change Route the PR-facing single-instance modes through the same `multi_job_run` path (with a single job), so they get an identical `"CI run" $RUN_ID` parent log with the instance request visible and the build log linked beneath it: - `fast` / `docs` / `barretenberg` / `barretenberg-full` - `full` / `full-no-test-cache` - `chonk-input-update` The job id is kept as `x-$cmd`, so the `ci/<job>` GitHub commit-status name is **unchanged** (no impact on required checks). `socket-fix` (which takes extra args and is an interactive debug mode) keeps its raw, un-denoised output but now pipes through `cache_log` so it also gets a parent log. ## Behavior notes - PR-run GA console output is now denoised (condensed progress) for the converted modes, matching merge-queue; the full log lives in the dashboard parent log. - Instances for these modes now carry an `INSTANCE_POSTFIX` equal to the job id (e.g. `x-fast`), so the EC2 `Name` tag becomes `<ref>_amd64_x-fast`. Same-mode re-runs still dedupe correctly. ## Validation This is a structural reuse of the already-proven merge-queue path (`multi_job_run`), and `bash -n ci.sh` passes. It can't be exercised locally (needs the GA + AWS orchestration), but **this PR's own CI run is the test**: the `fast` job should now produce a `CI run` parent log on the dashboard showing the instance request, reachable without opening GitHub Actions.
## What Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged. ### 1. Network scenario tests run only on the private v5-next nightly `ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**: ```yaml ( needs.validate-nightly-tag.outputs.is_current == 'true' && github.repository == 'AztecProtocol/aztec-packages-private' && startsWith(github.ref_name, 'v5.') ) || contains(github.event.pull_request.labels.*.name, 'ci-network-scenario') ``` `v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs. ### 2. Stop tagging `next` with a nightly tag in public `nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched. ## Why Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.
) ## What Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged. ### 1. Network scenario tests run only on the private v5-next nightly `ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**: ```yaml ( needs.validate-nightly-tag.outputs.is_current == 'true' && github.repository == 'AztecProtocol/aztec-packages-private' && startsWith(github.ref_name, 'v5.') ) || contains(github.event.pull_request.labels.*.name, 'ci-network-scenario') ``` `v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs. ### 2. Stop tagging `next` with a nightly tag in public `nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched. ## Why Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.
…letes ## Problem A devnet deploy failed waiting on CI3 for two distinct reasons: 1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed. 2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine. ## Fix 1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is. 2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag. The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.
…letes (#24012) ## Problem A devnet deploy failed waiting on CI3 for two distinct reasons: 1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed. 2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine. ## Fix 1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is. 2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag. The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.
ludamad
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
.