Skip to content

chore: conflicts#24019

Merged
alexghr merged 11 commits into
merge-train/spartanfrom
ag/fix-mt-conflicts
Jun 11, 2026
Merged

chore: conflicts#24019
alexghr merged 11 commits into
merge-train/spartanfrom
ag/fix-mt-conflicts

Conversation

@alexghr

@alexghr alexghr commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

.

charlielye and others added 11 commits June 9, 2026 16:50
Merge-queue runs route through `multi_job_run`, which pipes the runner-side
orchestration into a parent dashboard log (`cache_log "CI run" $RUN_ID`) — so
the spot/instance request is visible on ci.aztec-labs.com. Single-instance PR
modes called `bootstrap_ec2` directly, so that output only reached the GitHub
Actions console; you had to leave the dashboard to see which instance was
created.

Route the PR-facing single-instance modes (fast/docs/barretenberg/
barretenberg-full, full/full-no-test-cache, chonk-input-update) through
`multi_job_run` with a single job, matching merge-queue. The job id is kept as
`x-$cmd` so the `ci/<job>` GitHub status check name is unchanged. socket-fix
keeps its raw (un-denoised) output but now pipes through `cache_log` so it too
gets a parent log.
bootstrap_ec2 terminates any existing instance sharing the target Name tag, to
reap orphans left by a cancelled GA run on the same ref. But the name was just
<ref>_<arch>[_postfix], with no repo component — so aztec-packages and
aztec-packages-private, which build the same tags/refs concurrently under the
same OIDC role, computed identical names and reaped each other's live instances.

Observed: nightly tag v5.0.0-nightly.20260610 built in both repos; the public
run's pre-launch reap terminated the private run's in-progress arm64 release
instance ~7 min in, failing that build.

Prefix the instance name with the repo basename (GITHUB_REPOSITORY##*/, default
aztec-packages). The key stays stable across re-runs within a repo, so the
intended orphan cleanup still works; it only stops the two repos from colliding.
ci.sh's helper instance_name (shell/kill/get-ip) is kept in sync.
#23987)

## Problem

`ci3/bootstrap_ec2` terminates any existing instance that shares the
target `Name` tag before launching — this intentionally reaps orphans
left when a GA run is cancelled (e.g. by a new push) on the same ref.
But the name was `<ref>_<arch>[_<postfix>]` with **no repo component**,
so `aztec-packages` and `aztec-packages-private` — which build the same
tags/refs concurrently under the **same OIDC role** — computed identical
names and reaped each other's live instances.

### Observed incident

Nightly tag `v5.0.0-nightly.20260610` was built in **both** repos.
Instance `i-02e5d6a6c148ec726`
(`v5_0_0-nightly_20260610_arm64_a-release`) was launched by the private
repo's run at 03:06:01 UTC and **terminated at 03:13:12 UTC by the
public repo's run** for the same tag (its pre-launch reap step), ~7 min
in — failing the private build. CloudTrail confirms a
`TerminateInstances` from a different `ci3-<run_id>` session, not a spot
interruption.

## Fix

Prefix the instance name with the repo basename
(`${GITHUB_REPOSITORY##*/}`, defaulting to `aztec-packages` for local
runs):

- **Within a repo**, the key is unchanged in spirit
(`<repo>_<ref>_<arch>`) and stays stable across re-runs/new-pushes of
the same ref — so the intended orphan-on-cancel cleanup still works.
- **Across repos**, public → `aztec-packages_…` and private →
`aztec-packages-private_…`, so they no longer match and can't reap each
other.

`ci.sh`'s helper `instance_name` (used by the `shell`/`kill`/`get-ip`
dev commands) is kept in sync so it still resolves instances launched by
a CI run for the same repo.

### Notes

- The EC2 `Name` tag limit is 256 chars; the longest prefixed name is
~61 chars. The reap match uses the full `Name` tag, so the cosmetic
63-char `docker_hostname` truncation doesn't affect correctness.
- One-time transition: instances launched by the old (un-prefixed) code
won't be reaped by name-match from new runs; they fall back to the
shutdown timer / 1.5h reaper. Self-heals within a couple hours.
- This stops the *collision*. Whether public **and** private *should*
both build the same nightly tag (duplicated work) is a separate question
— happy to follow up if you want one gated off.
## Problem

Merge-queue runs show a top-level "parent log" on the CI dashboard that
includes the **spot/instance request** (which instance type was created,
spot vs on-demand). Standard PR runs don't — to see what instance a PR
run got, you have to leave the dashboard and dig into the GitHub Actions
console.

## Cause

The runner-side orchestration output (the `Requesting spot fleet…` line
from `aws_request_instance_type`) is printed on the GA runner, *before*
the remote build streams to its per-instance `CI_LOG_ID` log. Where that
runner-side output lands depends on the path in `ci.sh`:

- **Merge-queue** goes through `multi_job_run`, which pipes everything
into a parent dashboard log: `parallel … 'run …' | DUP=1 cache_log "CI
run" $RUN_ID`. Each `run()` wraps `bootstrap_ec2` with
`PARENT_LOG_ID=$RUN_ID`, so the instance request lands in the parent log
and the build log links underneath it.
- **PR modes** called `bootstrap_ec2` directly — no `cache_log`, no
parent log — so the instance request only reached the GA console.

## Change

Route the PR-facing single-instance modes through the same
`multi_job_run` path (with a single job), so they get an identical `"CI
run" $RUN_ID` parent log with the instance request visible and the build
log linked beneath it:

- `fast` / `docs` / `barretenberg` / `barretenberg-full`
- `full` / `full-no-test-cache`
- `chonk-input-update`

The job id is kept as `x-$cmd`, so the `ci/<job>` GitHub commit-status
name is **unchanged** (no impact on required checks). `socket-fix`
(which takes extra args and is an interactive debug mode) keeps its raw,
un-denoised output but now pipes through `cache_log` so it also gets a
parent log.

## Behavior notes

- PR-run GA console output is now denoised (condensed progress) for the
converted modes, matching merge-queue; the full log lives in the
dashboard parent log.
- Instances for these modes now carry an `INSTANCE_POSTFIX` equal to the
job id (e.g. `x-fast`), so the EC2 `Name` tag becomes
`<ref>_amd64_x-fast`. Same-mode re-runs still dedupe correctly.

## Validation

This is a structural reuse of the already-proven merge-queue path
(`multi_job_run`), and `bash -n ci.sh` passes. It can't be exercised
locally (needs the GA + AWS orchestration), but **this PR's own CI run
is the test**: the `fast` job should now produce a `CI run` parent log
on the dashboard showing the instance request, reachable without opening
GitHub Actions.
## What

Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged.

### 1. Network scenario tests run only on the private v5-next nightly
`ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**:

```yaml
(
  needs.validate-nightly-tag.outputs.is_current == 'true'
  && github.repository == 'AztecProtocol/aztec-packages-private'
  && startsWith(github.ref_name, 'v5.')
)
|| contains(github.event.pull_request.labels.*.name, 'ci-network-scenario')
```

`v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs.

### 2. Stop tagging `next` with a nightly tag in public
`nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched.

## Why
Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.
)

## What

Two changes scoped to the **public** repo
(`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up
tightening of the scenario-test trigger. Private tagging is unchanged.

### 1. Network scenario tests run only on the private v5-next nightly
`ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag
in both repos. Private produces both a `next` (v6) and a `v5-next` (v5)
nightly tag, so simply gating to the private repo still ran scenarios
against the v6 nightly. The nightly-triggered path is now gated to
**private repo + a `v5.` nightly tag**:

```yaml
(
  needs.validate-nightly-tag.outputs.is_current == 'true'
  && github.repository == 'AztecProtocol/aztec-packages-private'
  && startsWith(github.ref_name, 'v5.')
)
|| contains(github.event.pull_request.labels.*.name, 'ci-network-scenario')
```

`v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at
`6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name,
'v5.')` selects the v5-next nightly only. The manual PR-label path
(`ci-network-scenario`) is preserved for ad-hoc dev runs.

### 2. Stop tagging `next` with a nightly tag in public
`nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both
repos. The branch list is now repo-dependent: private keeps `[next,
v5-next]`, public tags only `v5-next` (and `v4-next` via its existing
dedicated job). Net result: **public tags `v4-next` + `v5-next` only**,
private is untouched.

## Why
Nightly network scenario tests should run only against the private
v5-next nightly, and public should not produce a `next` nightly tag.
…letes

## Problem

A devnet deploy failed waiting on CI3 for two distinct reasons:

1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed.

2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine.

## Fix

1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is.

2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag.

The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.
…letes (#24012)

## Problem

A devnet deploy failed waiting on CI3 for two distinct reasons:

1. **Lookup window bug.** The script used `gh run list --workflow
ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha`
client-side. By the time the deploy polled, the 03:04 nightly run had
aged off that first page, so the match never fired and the script timed
out — even though the run existed.

2. **Conclusion gated the deploy.** Even once found, `gh run watch
--exit-status` would fail the deploy if the CI3 nightly itself was red
(e.g. #2208). The nightly bundles many jobs, so an unrelated red job
blocked release even though the release build was fine.

## Fix

1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>`
via `gh api`, which filters server-side by SHA and finds the run
instantly no matter how old it is.

2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion
no longer gates), and instead gate specifically on the two release jobs
— the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and
arm64 (`ci/a-release`). These are posted as **GitHub commit statuses**
on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status
ci/<job-id>`). The script now waits for both statuses to reach a
terminal state (polling, since the runner posts them asynchronously) and
fails only if either is not `success`. It still fails if no CI3 run ever
appears for the tag.

The deploy now proceeds iff CI3 ran **and** both release-build jobs
succeeded, independent of unrelated nightly failures.
@alexghr alexghr requested a review from charlielye as a code owner June 11, 2026 14:39
@alexghr alexghr enabled auto-merge (squash) June 11, 2026 14:39
@alexghr alexghr merged commit d6187fc into merge-train/spartan Jun 11, 2026
12 checks passed
@alexghr alexghr deleted the ag/fix-mt-conflicts branch June 11, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants