chore(ci): align nightly scheduled workflow times#24045
Merged
PhilWindle merged 24 commits intoJun 12, 2026
Conversation
Merge-queue runs route through `multi_job_run`, which pipes the runner-side orchestration into a parent dashboard log (`cache_log "CI run" $RUN_ID`) — so the spot/instance request is visible on ci.aztec-labs.com. Single-instance PR modes called `bootstrap_ec2` directly, so that output only reached the GitHub Actions console; you had to leave the dashboard to see which instance was created. Route the PR-facing single-instance modes (fast/docs/barretenberg/ barretenberg-full, full/full-no-test-cache, chonk-input-update) through `multi_job_run` with a single job, matching merge-queue. The job id is kept as `x-$cmd` so the `ci/<job>` GitHub status check name is unchanged. socket-fix keeps its raw (un-denoised) output but now pipes through `cache_log` so it too gets a parent log.
Fix A-1163
Fix A-1163
bootstrap_ec2 terminates any existing instance sharing the target Name tag, to reap orphans left by a cancelled GA run on the same ref. But the name was just <ref>_<arch>[_postfix], with no repo component — so aztec-packages and aztec-packages-private, which build the same tags/refs concurrently under the same OIDC role, computed identical names and reaped each other's live instances. Observed: nightly tag v5.0.0-nightly.20260610 built in both repos; the public run's pre-launch reap terminated the private run's in-progress arm64 release instance ~7 min in, failing that build. Prefix the instance name with the repo basename (GITHUB_REPOSITORY##*/, default aztec-packages). The key stays stable across re-runs within a repo, so the intended orphan cleanup still works; it only stops the two repos from colliding. ci.sh's helper instance_name (shell/kill/get-ip) is kept in sync.
#23987) ## Problem `ci3/bootstrap_ec2` terminates any existing instance that shares the target `Name` tag before launching — this intentionally reaps orphans left when a GA run is cancelled (e.g. by a new push) on the same ref. But the name was `<ref>_<arch>[_<postfix>]` with **no repo component**, so `aztec-packages` and `aztec-packages-private` — which build the same tags/refs concurrently under the **same OIDC role** — computed identical names and reaped each other's live instances. ### Observed incident Nightly tag `v5.0.0-nightly.20260610` was built in **both** repos. Instance `i-02e5d6a6c148ec726` (`v5_0_0-nightly_20260610_arm64_a-release`) was launched by the private repo's run at 03:06:01 UTC and **terminated at 03:13:12 UTC by the public repo's run** for the same tag (its pre-launch reap step), ~7 min in — failing the private build. CloudTrail confirms a `TerminateInstances` from a different `ci3-<run_id>` session, not a spot interruption. ## Fix Prefix the instance name with the repo basename (`${GITHUB_REPOSITORY##*/}`, defaulting to `aztec-packages` for local runs): - **Within a repo**, the key is unchanged in spirit (`<repo>_<ref>_<arch>`) and stays stable across re-runs/new-pushes of the same ref — so the intended orphan-on-cancel cleanup still works. - **Across repos**, public → `aztec-packages_…` and private → `aztec-packages-private_…`, so they no longer match and can't reap each other. `ci.sh`'s helper `instance_name` (used by the `shell`/`kill`/`get-ip` dev commands) is kept in sync so it still resolves instances launched by a CI run for the same repo. ### Notes - The EC2 `Name` tag limit is 256 chars; the longest prefixed name is ~61 chars. The reap match uses the full `Name` tag, so the cosmetic 63-char `docker_hostname` truncation doesn't affect correctness. - One-time transition: instances launched by the old (un-prefixed) code won't be reaped by name-match from new runs; they fall back to the shutdown timer / 1.5h reaper. Self-heals within a couple hours. - This stops the *collision*. Whether public **and** private *should* both build the same nightly tag (duplicated work) is a separate question — happy to follow up if you want one gated off.
## Problem Merge-queue runs show a top-level "parent log" on the CI dashboard that includes the **spot/instance request** (which instance type was created, spot vs on-demand). Standard PR runs don't — to see what instance a PR run got, you have to leave the dashboard and dig into the GitHub Actions console. ## Cause The runner-side orchestration output (the `Requesting spot fleet…` line from `aws_request_instance_type`) is printed on the GA runner, *before* the remote build streams to its per-instance `CI_LOG_ID` log. Where that runner-side output lands depends on the path in `ci.sh`: - **Merge-queue** goes through `multi_job_run`, which pipes everything into a parent dashboard log: `parallel … 'run …' | DUP=1 cache_log "CI run" $RUN_ID`. Each `run()` wraps `bootstrap_ec2` with `PARENT_LOG_ID=$RUN_ID`, so the instance request lands in the parent log and the build log links underneath it. - **PR modes** called `bootstrap_ec2` directly — no `cache_log`, no parent log — so the instance request only reached the GA console. ## Change Route the PR-facing single-instance modes through the same `multi_job_run` path (with a single job), so they get an identical `"CI run" $RUN_ID` parent log with the instance request visible and the build log linked beneath it: - `fast` / `docs` / `barretenberg` / `barretenberg-full` - `full` / `full-no-test-cache` - `chonk-input-update` The job id is kept as `x-$cmd`, so the `ci/<job>` GitHub commit-status name is **unchanged** (no impact on required checks). `socket-fix` (which takes extra args and is an interactive debug mode) keeps its raw, un-denoised output but now pipes through `cache_log` so it also gets a parent log. ## Behavior notes - PR-run GA console output is now denoised (condensed progress) for the converted modes, matching merge-queue; the full log lives in the dashboard parent log. - Instances for these modes now carry an `INSTANCE_POSTFIX` equal to the job id (e.g. `x-fast`), so the EC2 `Name` tag becomes `<ref>_amd64_x-fast`. Same-mode re-runs still dedupe correctly. ## Validation This is a structural reuse of the already-proven merge-queue path (`multi_job_run`), and `bash -n ci.sh` passes. It can't be exercised locally (needs the GA + AWS orchestration), but **this PR's own CI run is the test**: the `fast` job should now produce a `CI run` parent log on the dashboard showing the instance request, reachable without opening GitHub Actions.
## What Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged. ### 1. Network scenario tests run only on the private v5-next nightly `ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**: ```yaml ( needs.validate-nightly-tag.outputs.is_current == 'true' && github.repository == 'AztecProtocol/aztec-packages-private' && startsWith(github.ref_name, 'v5.') ) || contains(github.event.pull_request.labels.*.name, 'ci-network-scenario') ``` `v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs. ### 2. Stop tagging `next` with a nightly tag in public `nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched. ## Why Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.
) ## What Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged. ### 1. Network scenario tests run only on the private v5-next nightly `ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**: ```yaml ( needs.validate-nightly-tag.outputs.is_current == 'true' && github.repository == 'AztecProtocol/aztec-packages-private' && startsWith(github.ref_name, 'v5.') ) || contains(github.event.pull_request.labels.*.name, 'ci-network-scenario') ``` `v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs. ### 2. Stop tagging `next` with a nightly tag in public `nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched. ## Why Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.
…letes ## Problem A devnet deploy failed waiting on CI3 for two distinct reasons: 1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed. 2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine. ## Fix 1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is. 2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag. The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.
…letes (#24012) ## Problem A devnet deploy failed waiting on CI3 for two distinct reasons: 1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed. 2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine. ## Fix 1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is. 2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag. The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.
## What Make all three nightly deployments run the deploy from the tip of `next` (latest scripts + helm) while keeping the correct image for each target network. ### Deploy ref → `next` - `deploy-staging-internal.yml`, `deploy-staging-public.yml`: pass `ref: next` to `deploy-network.yml` so the `spartan/` deploy scripts and helm charts come from `next`. - `deploy-next-net.yml` already passed `ref: next` (unchanged). ### `determine-tag` job (staging) - Checkout a single commit at the tip of `next` (`ref: next`, `fetch-depth: 1`) instead of `v5-next` with full history. - Tag resolution: if an explicit `tag` input is given, use it as-is. Otherwise construct `v5.0.0-nightly.<date>` and verify it actually exists with `git ls-remote --exit-code --tags origin`, failing the deploy early if the nightly tag is missing rather than proceeding to deploy a non-existent image. ## Why `deploy-network.yml` checks out `inputs.ref` to run the deploy scripts/helm; when unset it falls back to `github.ref` (default branch on `schedule`, dispatch branch on `workflow_dispatch`), making the scripts/helm implicit and branch-dependent. Pinning to `next` keeps staging on the latest infra while `semver`/`source_tag` continue to select the v5-line image (`v5.0.0-nightly.<date>`), which is the correct image for the staging networks. The `v5.0.0-nightly.<date>` tag is created on both the public and private repos (the nightly tagger tags `v5-next` on both), so the `git ls-remote origin` check resolves against whichever repo the workflow runs in.
## What Make all three nightly deployments run the deploy from the tip of `next` (latest scripts + helm) while keeping the correct image for each target network. ### Deploy ref → `next` - `deploy-staging-internal.yml`, `deploy-staging-public.yml`: pass `ref: next` to `deploy-network.yml` so the `spartan/` deploy scripts and helm charts come from `next`. - `deploy-next-net.yml` already passed `ref: next` (unchanged). ### `determine-tag` job (staging) - Checkout a single commit at the tip of `next` (`ref: next`, `fetch-depth: 1`) instead of `v5-next` with full history. - Tag resolution: if an explicit `tag` input is given, use it as-is. Otherwise construct `v5.0.0-nightly.<date>` and verify it actually exists with `git ls-remote --exit-code --tags origin`, failing the deploy early if the nightly tag is missing rather than proceeding to deploy a non-existent image. ## Why `deploy-network.yml` checks out `inputs.ref` to run the deploy scripts/helm; when unset it falls back to `github.ref` (default branch on `schedule`, dispatch branch on `workflow_dispatch`), making the scripts/helm implicit and branch-dependent. Pinning to `next` keeps staging on the latest infra while `semver`/`source_tag` continue to select the v5-line image (`v5.0.0-nightly.<date>`), which is the correct image for the staging networks. The `v5.0.0-nightly.<date>` tag is created on both the public and private repos (the nightly tagger tags `v5-next` on both), so the `git ls-remote origin` check resolves against whichever repo the workflow runs in.
## Summary - Update the testnet SponsoredFPC address in the networks table and getting-started guides. - Adjust release docs guidance to reflect that SponsoredFPC is deployed on testnet and devnet, but not mainnet. ## Validation - `yarn spellcheck` from `docs/`
#24026) Adds an `<agent_and_workflow_restraint>` block to the root `CLAUDE.md` telling Claude to do work inline in the current session and not spawn parallel subagents or launch dynamic workflows unless the user explicitly asks. ## Why Operators have reported burning through their token budget from a single prompt that quietly fanned out — in one case a "summarize recent ZK advancements" query started ~30 agents, and another exhausted a 5h budget spinning up subagents. Parallel agents and dynamic workflows multiply spend (≈2x for one helper, far more for a swarm) and the user can neither see the fan-out coming nor stop it. This appears to be a current tendency of Fable. The guidance reasserts: handle search/summarize/research/multi-file edits inline, reserve subagents for explicit user requests or a single read-heavy isolation case, and never start a dynamic workflow by default. Passes the repo's `<editorial_test>`: the line would have prevented the ~30-agent fan-out on an ordinary research prompt described above. Same change is being opened against `v5-next`, and an equivalent shared rule is being added in the claudebox repo so it applies to every managed session. --- *Created by [claudebox](https://claudebox.work/v2/sessions/7d5ecfdd5f37c5cd) · group: `slackbot`*
…4033) Replicates the change in #24024 directly on a repo branch, and also applies it to the versioned docs snapshot. Updates the **Aztec & Noir Developer Office Hours** Google Meet link from `https://meet.google.com/sdd-rdsr-shu` → `https://meet.google.com/vev-waao-mab` in: - `docs/docs-developers/docs/resources/community_calls.md` (current docs — identical to #24024) - `docs/developer_versioned_docs/version-v4.3.1/docs/resources/community_calls.md` (versioned snapshot — the only versioned copy that still carried the old link) No occurrences of the old `sdd-rdsr-shu` link remain anywhere under `docs/`.
…ing (#24039) ## Problem The dashboard `grind` option always fails to SSH into the build instance: ``` Waiting for SSH at 3.144.255.68... Timeout: SSH could not login to 3.144.255.68 within 60 seconds. ``` The instance launches fine (spot/on-demand fulfilled, IP assigned) but SSH never connects, so grind cycles through every instance type and gives up. ## Root cause CI build boxes were migrated from SSH to **SSM**. In `ci3/bootstrap_ec2` the default is now `CI_USE_SSH=0` (SSM); only `shell-new` forces SSH, and `grind-test` does not. So on current `next`, grind runs over SSM like the rest of CI. But the dashboard launches grind from a long-lived checkout at `REPO_PATH` (the `/grind` handler in `rk.py` shells out to `cd $REPO_PATH && ./ci.sh grind-test ...`). That checkout had drifted to a pre-SSM commit, so grind alone still took the legacy SSH branch — launching into the retired SSH security group + `build-instance` key pair, whose port-22 / key-injection preconditions were torn down during the SSM lockdown. The stale checkout also explains the old AMI (`ami-09d27244b23be8891`) in the logs vs. current `next`'s `ami-067627aa971a1dcbb`. Nothing kept `REPO_PATH` current: the `ci3-dashboard-deploy.yml` workflow only rebuilds the `rkapp` Flask container (and is path-filtered to `ci3/dashboard/**`), so changes to the `ci3/` launcher scripts never refreshed it. ## Fix Refresh the launcher checkout to `origin/next` at grind launch time, before shelling out. This is self-healing and independent of deploys. It matches the existing design where the launcher always runs current-`next` orchestration scripts while the grind *target commit* is checked out on the remote box — so this does **not** restrict which branch/commit you can grind. If the refresh fails (e.g. transient network), the error is surfaced in the run log instead of silently grinding on a stale tree. ## Testing `python3 -m py_compile ci3/dashboard/rk.py` passes. The behavior change is host-side (requires the dashboard's `REPO_PATH` checkout) and can't be exercised in unit CI; it will take effect on the next dashboard deploy. The immediate one-time unblock is still to refresh `REPO_PATH` on `ci.aztec-labs.com` and restart `rkapp`. --- *Created by [claudebox](https://claudebox.work/v2/sessions/1c05a513cb601b21) · group: `slackbot`*
alexghr
approved these changes
Jun 12, 2026
Contributor
|
This PR is stacked on top of #24044 |
PhilWindle
approved these changes
Jun 12, 2026
PhilWindle
added a commit
that referenced
this pull request
Jun 12, 2026
## Why `merge-train/spartan` ([PR #23971](#23971)) has been in state `dirty` and has not merged into `next` for ~2 days. This PR merges current `next` into the train branch and resolves the conflicts so #23971 becomes mergeable again. ## Conflicts resolved Both conflicts were cron-schedule differences. The train branch's commit `chore(ci): align nightly scheduled workflow times (#24045)` deliberately set these times, and that alignment is exactly what the train is bringing into `next` — so the train (HEAD) side was kept in both: - `.github/workflows/deploy-staging-internal.yml` — kept `cron: "0 6 * * *"` (train) over `"0 7 * * *"` (next) - `.github/workflows/nightly-release-tag.yml` — kept `cron: "0 4 * * *"` (train) over `"0 2 * * *"` (next) After resolution, `origin/next` is an ancestor of this branch; the only net file change versus the current train tip is `spartan/terraform/gke-cluster/iam.tf` (+24, brought in from `next`). ##⚠️ Merge with a merge commit, not squash This PR carries a real merge commit so that `next` stays an ancestor of `merge-train/spartan`. It must be landed with **Create a merge commit** (hence the `ci-no-squash` label) — squashing would drop the merge and leave #23971 dirty. --- *Created by [claudebox](https://claudebox.work/v2/sessions/011e27d5de05b232) · group: `slackbot`*
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Updates the scheduled (cron) times for the nightly workflows to the agreed schedule. Stacked on top of #24044 (base branch
ag/chore-mt-conflicts-2).All times UTC:
nightly-release-tag.yml, next + v5-next)0 2→0 4ensure-funded-environments.yml)0 2→0 4deploy-next-net.yml)0 6, unchangeddeploy-staging-internal.yml)0 7→0 6deploy-staging-public.yml)0 7→0 6nightly-spartan-bench.yml)30 7→0 6The tag/fund jobs now run at 04:00 and the deploy/bench jobs at 06:00, so deploys still happen after the nightly tag is created.
Created by claudebox · group:
slackbot