Skip to content

chore: merge next into merge-train/spartan (resolve conflicts)#24052

Merged
PhilWindle merged 23 commits into
merge-train/spartanfrom
cb/merge-train-spartan-resolve
Jun 12, 2026
Merged

chore: merge next into merge-train/spartan (resolve conflicts)#24052
PhilWindle merged 23 commits into
merge-train/spartanfrom
cb/merge-train-spartan-resolve

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Why

merge-train/spartan (PR #23971) has been in state dirty and has not merged into next for ~2 days. This PR merges current next into the train branch and resolves the conflicts so #23971 becomes mergeable again.

Conflicts resolved

Both conflicts were cron-schedule differences. The train branch's commit chore(ci): align nightly scheduled workflow times (#24045) deliberately set these times, and that alignment is exactly what the train is bringing into next — so the train (HEAD) side was kept in both:

  • .github/workflows/deploy-staging-internal.yml — kept cron: "0 6 * * *" (train) over "0 7 * * *" (next)
  • .github/workflows/nightly-release-tag.yml — kept cron: "0 4 * * *" (train) over "0 2 * * *" (next)

After resolution, origin/next is an ancestor of this branch; the only net file change versus the current train tip is spartan/terraform/gke-cluster/iam.tf (+24, brought in from next).

⚠️ Merge with a merge commit, not squash

This PR carries a real merge commit so that next stays an ancestor of merge-train/spartan. It must be landed with Create a merge commit (hence the ci-no-squash label) — squashing would drop the merge and leave #23971 dirty.


Created by claudebox · group: slackbot

charlielye and others added 23 commits June 9, 2026 16:50
Merge-queue runs route through `multi_job_run`, which pipes the runner-side
orchestration into a parent dashboard log (`cache_log "CI run" $RUN_ID`) — so
the spot/instance request is visible on ci.aztec-labs.com. Single-instance PR
modes called `bootstrap_ec2` directly, so that output only reached the GitHub
Actions console; you had to leave the dashboard to see which instance was
created.

Route the PR-facing single-instance modes (fast/docs/barretenberg/
barretenberg-full, full/full-no-test-cache, chonk-input-update) through
`multi_job_run` with a single job, matching merge-queue. The job id is kept as
`x-$cmd` so the `ci/<job>` GitHub status check name is unchanged. socket-fix
keeps its raw (un-denoised) output but now pipes through `cache_log` so it too
gets a parent log.
bootstrap_ec2 terminates any existing instance sharing the target Name tag, to
reap orphans left by a cancelled GA run on the same ref. But the name was just
<ref>_<arch>[_postfix], with no repo component — so aztec-packages and
aztec-packages-private, which build the same tags/refs concurrently under the
same OIDC role, computed identical names and reaped each other's live instances.

Observed: nightly tag v5.0.0-nightly.20260610 built in both repos; the public
run's pre-launch reap terminated the private run's in-progress arm64 release
instance ~7 min in, failing that build.

Prefix the instance name with the repo basename (GITHUB_REPOSITORY##*/, default
aztec-packages). The key stays stable across re-runs within a repo, so the
intended orphan cleanup still works; it only stops the two repos from colliding.
ci.sh's helper instance_name (shell/kill/get-ip) is kept in sync.
#23987)

## Problem

`ci3/bootstrap_ec2` terminates any existing instance that shares the
target `Name` tag before launching — this intentionally reaps orphans
left when a GA run is cancelled (e.g. by a new push) on the same ref.
But the name was `<ref>_<arch>[_<postfix>]` with **no repo component**,
so `aztec-packages` and `aztec-packages-private` — which build the same
tags/refs concurrently under the **same OIDC role** — computed identical
names and reaped each other's live instances.

### Observed incident

Nightly tag `v5.0.0-nightly.20260610` was built in **both** repos.
Instance `i-02e5d6a6c148ec726`
(`v5_0_0-nightly_20260610_arm64_a-release`) was launched by the private
repo's run at 03:06:01 UTC and **terminated at 03:13:12 UTC by the
public repo's run** for the same tag (its pre-launch reap step), ~7 min
in — failing the private build. CloudTrail confirms a
`TerminateInstances` from a different `ci3-<run_id>` session, not a spot
interruption.

## Fix

Prefix the instance name with the repo basename
(`${GITHUB_REPOSITORY##*/}`, defaulting to `aztec-packages` for local
runs):

- **Within a repo**, the key is unchanged in spirit
(`<repo>_<ref>_<arch>`) and stays stable across re-runs/new-pushes of
the same ref — so the intended orphan-on-cancel cleanup still works.
- **Across repos**, public → `aztec-packages_…` and private →
`aztec-packages-private_…`, so they no longer match and can't reap each
other.

`ci.sh`'s helper `instance_name` (used by the `shell`/`kill`/`get-ip`
dev commands) is kept in sync so it still resolves instances launched by
a CI run for the same repo.

### Notes

- The EC2 `Name` tag limit is 256 chars; the longest prefixed name is
~61 chars. The reap match uses the full `Name` tag, so the cosmetic
63-char `docker_hostname` truncation doesn't affect correctness.
- One-time transition: instances launched by the old (un-prefixed) code
won't be reaped by name-match from new runs; they fall back to the
shutdown timer / 1.5h reaper. Self-heals within a couple hours.
- This stops the *collision*. Whether public **and** private *should*
both build the same nightly tag (duplicated work) is a separate question
— happy to follow up if you want one gated off.
## Problem

Merge-queue runs show a top-level "parent log" on the CI dashboard that
includes the **spot/instance request** (which instance type was created,
spot vs on-demand). Standard PR runs don't — to see what instance a PR
run got, you have to leave the dashboard and dig into the GitHub Actions
console.

## Cause

The runner-side orchestration output (the `Requesting spot fleet…` line
from `aws_request_instance_type`) is printed on the GA runner, *before*
the remote build streams to its per-instance `CI_LOG_ID` log. Where that
runner-side output lands depends on the path in `ci.sh`:

- **Merge-queue** goes through `multi_job_run`, which pipes everything
into a parent dashboard log: `parallel … 'run …' | DUP=1 cache_log "CI
run" $RUN_ID`. Each `run()` wraps `bootstrap_ec2` with
`PARENT_LOG_ID=$RUN_ID`, so the instance request lands in the parent log
and the build log links underneath it.
- **PR modes** called `bootstrap_ec2` directly — no `cache_log`, no
parent log — so the instance request only reached the GA console.

## Change

Route the PR-facing single-instance modes through the same
`multi_job_run` path (with a single job), so they get an identical `"CI
run" $RUN_ID` parent log with the instance request visible and the build
log linked beneath it:

- `fast` / `docs` / `barretenberg` / `barretenberg-full`
- `full` / `full-no-test-cache`
- `chonk-input-update`

The job id is kept as `x-$cmd`, so the `ci/<job>` GitHub commit-status
name is **unchanged** (no impact on required checks). `socket-fix`
(which takes extra args and is an interactive debug mode) keeps its raw,
un-denoised output but now pipes through `cache_log` so it also gets a
parent log.

## Behavior notes

- PR-run GA console output is now denoised (condensed progress) for the
converted modes, matching merge-queue; the full log lives in the
dashboard parent log.
- Instances for these modes now carry an `INSTANCE_POSTFIX` equal to the
job id (e.g. `x-fast`), so the EC2 `Name` tag becomes
`<ref>_amd64_x-fast`. Same-mode re-runs still dedupe correctly.

## Validation

This is a structural reuse of the already-proven merge-queue path
(`multi_job_run`), and `bash -n ci.sh` passes. It can't be exercised
locally (needs the GA + AWS orchestration), but **this PR's own CI run
is the test**: the `fast` job should now produce a `CI run` parent log
on the dashboard showing the instance request, reachable without opening
GitHub Actions.
## What

Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged.

### 1. Network scenario tests run only on the private v5-next nightly
`ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**:

```yaml
(
  needs.validate-nightly-tag.outputs.is_current == 'true'
  && github.repository == 'AztecProtocol/aztec-packages-private'
  && startsWith(github.ref_name, 'v5.')
)
|| contains(github.event.pull_request.labels.*.name, 'ci-network-scenario')
```

`v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs.

### 2. Stop tagging `next` with a nightly tag in public
`nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched.

## Why
Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.
)

## What

Two changes scoped to the **public** repo
(`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up
tightening of the scenario-test trigger. Private tagging is unchanged.

### 1. Network scenario tests run only on the private v5-next nightly
`ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag
in both repos. Private produces both a `next` (v6) and a `v5-next` (v5)
nightly tag, so simply gating to the private repo still ran scenarios
against the v6 nightly. The nightly-triggered path is now gated to
**private repo + a `v5.` nightly tag**:

```yaml
(
  needs.validate-nightly-tag.outputs.is_current == 'true'
  && github.repository == 'AztecProtocol/aztec-packages-private'
  && startsWith(github.ref_name, 'v5.')
)
|| contains(github.event.pull_request.labels.*.name, 'ci-network-scenario')
```

`v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at
`6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name,
'v5.')` selects the v5-next nightly only. The manual PR-label path
(`ci-network-scenario`) is preserved for ad-hoc dev runs.

### 2. Stop tagging `next` with a nightly tag in public
`nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both
repos. The branch list is now repo-dependent: private keeps `[next,
v5-next]`, public tags only `v5-next` (and `v4-next` via its existing
dedicated job). Net result: **public tags `v4-next` + `v5-next` only**,
private is untouched.

## Why
Nightly network scenario tests should run only against the private
v5-next nightly, and public should not produce a `next` nightly tag.
…letes

## Problem

A devnet deploy failed waiting on CI3 for two distinct reasons:

1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed.

2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine.

## Fix

1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is.

2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag.

The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.
…letes (#24012)

## Problem

A devnet deploy failed waiting on CI3 for two distinct reasons:

1. **Lookup window bug.** The script used `gh run list --workflow
ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha`
client-side. By the time the deploy polled, the 03:04 nightly run had
aged off that first page, so the match never fired and the script timed
out — even though the run existed.

2. **Conclusion gated the deploy.** Even once found, `gh run watch
--exit-status` would fail the deploy if the CI3 nightly itself was red
(e.g. #2208). The nightly bundles many jobs, so an unrelated red job
blocked release even though the release build was fine.

## Fix

1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>`
via `gh api`, which filters server-side by SHA and finds the run
instantly no matter how old it is.

2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion
no longer gates), and instead gate specifically on the two release jobs
— the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and
arm64 (`ci/a-release`). These are posted as **GitHub commit statuses**
on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status
ci/<job-id>`). The script now waits for both statuses to reach a
terminal state (polling, since the runner posts them asynchronously) and
fails only if either is not `success`. It still fails if no CI3 run ever
appears for the tag.

The deploy now proceeds iff CI3 ran **and** both release-build jobs
succeeded, independent of unrelated nightly failures.
## What

Make all three nightly deployments run the deploy from the tip of `next` (latest scripts + helm) while keeping the correct image for each target network.

### Deploy ref → `next`
- `deploy-staging-internal.yml`, `deploy-staging-public.yml`: pass `ref: next` to `deploy-network.yml` so the `spartan/` deploy scripts and helm charts come from `next`.
- `deploy-next-net.yml` already passed `ref: next` (unchanged).

### `determine-tag` job (staging)
- Checkout a single commit at the tip of `next` (`ref: next`, `fetch-depth: 1`) instead of `v5-next` with full history.
- Tag resolution: if an explicit `tag` input is given, use it as-is. Otherwise construct `v5.0.0-nightly.<date>` and verify it actually exists with `git ls-remote --exit-code --tags origin`, failing the deploy early if the nightly tag is missing rather than proceeding to deploy a non-existent image.

## Why

`deploy-network.yml` checks out `inputs.ref` to run the deploy scripts/helm; when unset it falls back to `github.ref` (default branch on `schedule`, dispatch branch on `workflow_dispatch`), making the scripts/helm implicit and branch-dependent. Pinning to `next` keeps staging on the latest infra while `semver`/`source_tag` continue to select the v5-line image (`v5.0.0-nightly.<date>`), which is the correct image for the staging networks.

The `v5.0.0-nightly.<date>` tag is created on both the public and private repos (the nightly tagger tags `v5-next` on both), so the `git ls-remote origin` check resolves against whichever repo the workflow runs in.
## What

Make all three nightly deployments run the deploy from the tip of `next`
(latest scripts + helm) while keeping the correct image for each target
network.

### Deploy ref → `next`
- `deploy-staging-internal.yml`, `deploy-staging-public.yml`: pass `ref:
next` to `deploy-network.yml` so the `spartan/` deploy scripts and helm
charts come from `next`.
- `deploy-next-net.yml` already passed `ref: next` (unchanged).

### `determine-tag` job (staging)
- Checkout a single commit at the tip of `next` (`ref: next`,
`fetch-depth: 1`) instead of `v5-next` with full history.
- Tag resolution: if an explicit `tag` input is given, use it as-is.
Otherwise construct `v5.0.0-nightly.<date>` and verify it actually
exists with `git ls-remote --exit-code --tags origin`, failing the
deploy early if the nightly tag is missing rather than proceeding to
deploy a non-existent image.

## Why

`deploy-network.yml` checks out `inputs.ref` to run the deploy
scripts/helm; when unset it falls back to `github.ref` (default branch
on `schedule`, dispatch branch on `workflow_dispatch`), making the
scripts/helm implicit and branch-dependent. Pinning to `next` keeps
staging on the latest infra while `semver`/`source_tag` continue to
select the v5-line image (`v5.0.0-nightly.<date>`), which is the correct
image for the staging networks.

The `v5.0.0-nightly.<date>` tag is created on both the public and
private repos (the nightly tagger tags `v5-next` on both), so the `git
ls-remote origin` check resolves against whichever repo the workflow
runs in.
## Summary

- Update the testnet SponsoredFPC address in the networks table and
getting-started guides.
- Adjust release docs guidance to reflect that SponsoredFPC is deployed
on testnet and devnet, but not mainnet.

## Validation

- `yarn spellcheck` from `docs/`
#24026)

Adds an `<agent_and_workflow_restraint>` block to the root `CLAUDE.md`
telling Claude to do work inline in the current session and not spawn
parallel subagents or launch dynamic workflows unless the user
explicitly asks.

## Why

Operators have reported burning through their token budget from a single
prompt that quietly fanned out — in one case a "summarize recent ZK
advancements" query started ~30 agents, and another exhausted a 5h
budget spinning up subagents. Parallel agents and dynamic workflows
multiply spend (≈2x for one helper, far more for a swarm) and the user
can neither see the fan-out coming nor stop it. This appears to be a
current tendency of Fable. The guidance reasserts: handle
search/summarize/research/multi-file edits inline, reserve subagents for
explicit user requests or a single read-heavy isolation case, and never
start a dynamic workflow by default.

Passes the repo's `<editorial_test>`: the line would have prevented the
~30-agent fan-out on an ordinary research prompt described above.

Same change is being opened against `v5-next`, and an equivalent shared
rule is being added in the claudebox repo so it applies to every managed
session.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/7d5ecfdd5f37c5cd) ·
group: `slackbot`*
…4033)

Replicates the change in #24024 directly on a repo branch, and also
applies it to the versioned docs snapshot.

Updates the **Aztec & Noir Developer Office Hours** Google Meet link
from `https://meet.google.com/sdd-rdsr-shu` →
`https://meet.google.com/vev-waao-mab` in:

- `docs/docs-developers/docs/resources/community_calls.md` (current docs
— identical to #24024)
-
`docs/developer_versioned_docs/version-v4.3.1/docs/resources/community_calls.md`
(versioned snapshot — the only versioned copy that still carried the old
link)

No occurrences of the old `sdd-rdsr-shu` link remain anywhere under
`docs/`.
…ing (#24039)

## Problem

The dashboard `grind` option always fails to SSH into the build
instance:

```
Waiting for SSH at 3.144.255.68...
Timeout: SSH could not login to 3.144.255.68 within 60 seconds.
```

The instance launches fine (spot/on-demand fulfilled, IP assigned) but
SSH never connects, so grind cycles through every instance type and
gives up.

## Root cause

CI build boxes were migrated from SSH to **SSM**. In `ci3/bootstrap_ec2`
the default is now `CI_USE_SSH=0` (SSM); only `shell-new` forces SSH,
and `grind-test` does not. So on current `next`, grind runs over SSM
like the rest of CI.

But the dashboard launches grind from a long-lived checkout at
`REPO_PATH` (the `/grind` handler in `rk.py` shells out to `cd
$REPO_PATH && ./ci.sh grind-test ...`). That checkout had drifted to a
pre-SSM commit, so grind alone still took the legacy SSH branch —
launching into the retired SSH security group + `build-instance` key
pair, whose port-22 / key-injection preconditions were torn down during
the SSM lockdown. The stale checkout also explains the old AMI
(`ami-09d27244b23be8891`) in the logs vs. current `next`'s
`ami-067627aa971a1dcbb`.

Nothing kept `REPO_PATH` current: the `ci3-dashboard-deploy.yml`
workflow only rebuilds the `rkapp` Flask container (and is path-filtered
to `ci3/dashboard/**`), so changes to the `ci3/` launcher scripts never
refreshed it.

## Fix

Refresh the launcher checkout to `origin/next` at grind launch time,
before shelling out. This is self-healing and independent of deploys. It
matches the existing design where the launcher always runs
current-`next` orchestration scripts while the grind *target commit* is
checked out on the remote box — so this does **not** restrict which
branch/commit you can grind. If the refresh fails (e.g. transient
network), the error is surfaced in the run log instead of silently
grinding on a stale tree.

## Testing

`python3 -m py_compile ci3/dashboard/rk.py` passes. The behavior change
is host-side (requires the dashboard's `REPO_PATH` checkout) and can't
be exercised in unit CI; it will take effect on the next dashboard
deploy. The immediate one-time unblock is still to refresh `REPO_PATH`
on `ci.aztec-labs.com` and restart `rkapp`.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/1c05a513cb601b21) ·
group: `slackbot`*
…n-resolve

# Conflicts:
#	.github/workflows/deploy-staging-internal.yml
#	.github/workflows/nightly-release-tag.yml
@AztecBot AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure ci-no-squash claudebox Owned by claudebox. it can push to this PR. labels Jun 12, 2026
@PhilWindle PhilWindle marked this pull request as ready for review June 12, 2026 11:55
@PhilWindle PhilWindle enabled auto-merge June 12, 2026 14:30
@PhilWindle PhilWindle merged commit 0fcd113 into merge-train/spartan Jun 12, 2026
51 of 55 checks passed
@PhilWindle PhilWindle deleted the cb/merge-train-spartan-resolve branch June 12, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure ci-no-squash claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants