Skip to content

chore(ci): align nightly scheduled workflow times#24045

Merged
PhilWindle merged 24 commits into
merge-train/spartanfrom
cb/update-nightly-schedule-times
Jun 12, 2026
Merged

chore(ci): align nightly scheduled workflow times#24045
PhilWindle merged 24 commits into
merge-train/spartanfrom
cb/update-nightly-schedule-times

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Updates the scheduled (cron) times for the nightly workflows to the agreed schedule. Stacked on top of #24044 (base branch ag/chore-mt-conflicts-2).

All times UTC:

Time Workflow Change
04:00 Nightly Release Tag (nightly-release-tag.yml, next + v5-next) 0 20 4
04:00 Ensure Funded Environments (ensure-funded-environments.yml) 0 20 4
06:00 Deploy Next Net (deploy-next-net.yml) already 0 6, unchanged
06:00 Deploy to staging internal (deploy-staging-internal.yml) 0 70 6
06:00 Deploy to staging public (deploy-staging-public.yml) 0 70 6
06:00 Nightly Spartan Benchmarks (nightly-spartan-bench.yml) 30 70 6

The tag/fund jobs now run at 04:00 and the deploy/bench jobs at 06:00, so deploys still happen after the nightly tag is created.


Created by claudebox · group: slackbot

charlielye and others added 24 commits June 9, 2026 16:50
Merge-queue runs route through `multi_job_run`, which pipes the runner-side
orchestration into a parent dashboard log (`cache_log "CI run" $RUN_ID`) — so
the spot/instance request is visible on ci.aztec-labs.com. Single-instance PR
modes called `bootstrap_ec2` directly, so that output only reached the GitHub
Actions console; you had to leave the dashboard to see which instance was
created.

Route the PR-facing single-instance modes (fast/docs/barretenberg/
barretenberg-full, full/full-no-test-cache, chonk-input-update) through
`multi_job_run` with a single job, matching merge-queue. The job id is kept as
`x-$cmd` so the `ci/<job>` GitHub status check name is unchanged. socket-fix
keeps its raw (un-denoised) output but now pipes through `cache_log` so it too
gets a parent log.
bootstrap_ec2 terminates any existing instance sharing the target Name tag, to
reap orphans left by a cancelled GA run on the same ref. But the name was just
<ref>_<arch>[_postfix], with no repo component — so aztec-packages and
aztec-packages-private, which build the same tags/refs concurrently under the
same OIDC role, computed identical names and reaped each other's live instances.

Observed: nightly tag v5.0.0-nightly.20260610 built in both repos; the public
run's pre-launch reap terminated the private run's in-progress arm64 release
instance ~7 min in, failing that build.

Prefix the instance name with the repo basename (GITHUB_REPOSITORY##*/, default
aztec-packages). The key stays stable across re-runs within a repo, so the
intended orphan cleanup still works; it only stops the two repos from colliding.
ci.sh's helper instance_name (shell/kill/get-ip) is kept in sync.
#23987)

## Problem

`ci3/bootstrap_ec2` terminates any existing instance that shares the
target `Name` tag before launching — this intentionally reaps orphans
left when a GA run is cancelled (e.g. by a new push) on the same ref.
But the name was `<ref>_<arch>[_<postfix>]` with **no repo component**,
so `aztec-packages` and `aztec-packages-private` — which build the same
tags/refs concurrently under the **same OIDC role** — computed identical
names and reaped each other's live instances.

### Observed incident

Nightly tag `v5.0.0-nightly.20260610` was built in **both** repos.
Instance `i-02e5d6a6c148ec726`
(`v5_0_0-nightly_20260610_arm64_a-release`) was launched by the private
repo's run at 03:06:01 UTC and **terminated at 03:13:12 UTC by the
public repo's run** for the same tag (its pre-launch reap step), ~7 min
in — failing the private build. CloudTrail confirms a
`TerminateInstances` from a different `ci3-<run_id>` session, not a spot
interruption.

## Fix

Prefix the instance name with the repo basename
(`${GITHUB_REPOSITORY##*/}`, defaulting to `aztec-packages` for local
runs):

- **Within a repo**, the key is unchanged in spirit
(`<repo>_<ref>_<arch>`) and stays stable across re-runs/new-pushes of
the same ref — so the intended orphan-on-cancel cleanup still works.
- **Across repos**, public → `aztec-packages_…` and private →
`aztec-packages-private_…`, so they no longer match and can't reap each
other.

`ci.sh`'s helper `instance_name` (used by the `shell`/`kill`/`get-ip`
dev commands) is kept in sync so it still resolves instances launched by
a CI run for the same repo.

### Notes

- The EC2 `Name` tag limit is 256 chars; the longest prefixed name is
~61 chars. The reap match uses the full `Name` tag, so the cosmetic
63-char `docker_hostname` truncation doesn't affect correctness.
- One-time transition: instances launched by the old (un-prefixed) code
won't be reaped by name-match from new runs; they fall back to the
shutdown timer / 1.5h reaper. Self-heals within a couple hours.
- This stops the *collision*. Whether public **and** private *should*
both build the same nightly tag (duplicated work) is a separate question
— happy to follow up if you want one gated off.
## Problem

Merge-queue runs show a top-level "parent log" on the CI dashboard that
includes the **spot/instance request** (which instance type was created,
spot vs on-demand). Standard PR runs don't — to see what instance a PR
run got, you have to leave the dashboard and dig into the GitHub Actions
console.

## Cause

The runner-side orchestration output (the `Requesting spot fleet…` line
from `aws_request_instance_type`) is printed on the GA runner, *before*
the remote build streams to its per-instance `CI_LOG_ID` log. Where that
runner-side output lands depends on the path in `ci.sh`:

- **Merge-queue** goes through `multi_job_run`, which pipes everything
into a parent dashboard log: `parallel … 'run …' | DUP=1 cache_log "CI
run" $RUN_ID`. Each `run()` wraps `bootstrap_ec2` with
`PARENT_LOG_ID=$RUN_ID`, so the instance request lands in the parent log
and the build log links underneath it.
- **PR modes** called `bootstrap_ec2` directly — no `cache_log`, no
parent log — so the instance request only reached the GA console.

## Change

Route the PR-facing single-instance modes through the same
`multi_job_run` path (with a single job), so they get an identical `"CI
run" $RUN_ID` parent log with the instance request visible and the build
log linked beneath it:

- `fast` / `docs` / `barretenberg` / `barretenberg-full`
- `full` / `full-no-test-cache`
- `chonk-input-update`

The job id is kept as `x-$cmd`, so the `ci/<job>` GitHub commit-status
name is **unchanged** (no impact on required checks). `socket-fix`
(which takes extra args and is an interactive debug mode) keeps its raw,
un-denoised output but now pipes through `cache_log` so it also gets a
parent log.

## Behavior notes

- PR-run GA console output is now denoised (condensed progress) for the
converted modes, matching merge-queue; the full log lives in the
dashboard parent log.
- Instances for these modes now carry an `INSTANCE_POSTFIX` equal to the
job id (e.g. `x-fast`), so the EC2 `Name` tag becomes
`<ref>_amd64_x-fast`. Same-mode re-runs still dedupe correctly.

## Validation

This is a structural reuse of the already-proven merge-queue path
(`multi_job_run`), and `bash -n ci.sh` passes. It can't be exercised
locally (needs the GA + AWS orchestration), but **this PR's own CI run
is the test**: the `fast` job should now produce a `CI run` parent log
on the dashboard showing the instance request, reachable without opening
GitHub Actions.
## What

Two changes scoped to the **public** repo (`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up tightening of the scenario-test trigger. Private tagging is unchanged.

### 1. Network scenario tests run only on the private v5-next nightly
`ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag in both repos. Private produces both a `next` (v6) and a `v5-next` (v5) nightly tag, so simply gating to the private repo still ran scenarios against the v6 nightly. The nightly-triggered path is now gated to **private repo + a `v5.` nightly tag**:

```yaml
(
  needs.validate-nightly-tag.outputs.is_current == 'true'
  && github.repository == 'AztecProtocol/aztec-packages-private'
  && startsWith(github.ref_name, 'v5.')
)
|| contains(github.event.pull_request.labels.*.name, 'ci-network-scenario')
```

`v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at `6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name, 'v5.')` selects the v5-next nightly only. The manual PR-label path (`ci-network-scenario`) is preserved for ad-hoc dev runs.

### 2. Stop tagging `next` with a nightly tag in public
`nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both repos. The branch list is now repo-dependent: private keeps `[next, v5-next]`, public tags only `v5-next` (and `v4-next` via its existing dedicated job). Net result: **public tags `v4-next` + `v5-next` only**, private is untouched.

## Why
Nightly network scenario tests should run only against the private v5-next nightly, and public should not produce a `next` nightly tag.
)

## What

Two changes scoped to the **public** repo
(`AztecProtocol/aztec-packages`) nightly flow, plus a follow-up
tightening of the scenario-test trigger. Private tagging is unchanged.

### 1. Network scenario tests run only on the private v5-next nightly
`ci3.yml`'s `ci-network-scenario` job fired on any current nightly tag
in both repos. Private produces both a `next` (v6) and a `v5-next` (v5)
nightly tag, so simply gating to the private repo still ran scenarios
against the v6 nightly. The nightly-triggered path is now gated to
**private repo + a `v5.` nightly tag**:

```yaml
(
  needs.validate-nightly-tag.outputs.is_current == 'true'
  && github.repository == 'AztecProtocol/aztec-packages-private'
  && startsWith(github.ref_name, 'v5.')
)
|| contains(github.event.pull_request.labels.*.name, 'ci-network-scenario')
```

`v5-next` is at `5.x.x` (tag `v5.x.x-nightly.*`) and `next` is at
`6.x.x` (tag `v6.x.x-nightly.*`), so `startsWith(github.ref_name,
'v5.')` selects the v5-next nightly only. The manual PR-label path
(`ci-network-scenario`) is preserved for ad-hoc dev runs.

### 2. Stop tagging `next` with a nightly tag in public
`nightly-release-tag.yml`'s matrix tagged `[next, v5-next]` in both
repos. The branch list is now repo-dependent: private keeps `[next,
v5-next]`, public tags only `v5-next` (and `v4-next` via its existing
dedicated job). Net result: **public tags `v4-next` + `v5-next` only**,
private is untouched.

## Why
Nightly network scenario tests should run only against the private
v5-next nightly, and public should not produce a `next` nightly tag.
…letes

## Problem

A devnet deploy failed waiting on CI3 for two distinct reasons:

1. **Lookup window bug.** The script used `gh run list --workflow ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha` client-side. By the time the deploy polled, the 03:04 nightly run had aged off that first page, so the match never fired and the script timed out — even though the run existed.

2. **Conclusion gated the deploy.** Even once found, `gh run watch --exit-status` would fail the deploy if the CI3 nightly itself was red (e.g. #2208). The nightly bundles many jobs, so an unrelated red job blocked release even though the release build was fine.

## Fix

1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>` via `gh api`, which filters server-side by SHA and finds the run instantly no matter how old it is.

2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion no longer gates), and instead gate specifically on the two release jobs — the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and arm64 (`ci/a-release`). These are posted as **GitHub commit statuses** on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status ci/<job-id>`). The script now waits for both statuses to reach a terminal state (polling, since the runner posts them asynchronously) and fails only if either is not `success`. It still fails if no CI3 run ever appears for the tag.

The deploy now proceeds iff CI3 ran **and** both release-build jobs succeeded, independent of unrelated nightly failures.
…letes (#24012)

## Problem

A devnet deploy failed waiting on CI3 for two distinct reasons:

1. **Lookup window bug.** The script used `gh run list --workflow
ci3.yml` (which returns only ~20 newest runs) and filtered by `headSha`
client-side. By the time the deploy polled, the 03:04 nightly run had
aged off that first page, so the match never fired and the script timed
out — even though the run existed.

2. **Conclusion gated the deploy.** Even once found, `gh run watch
--exit-status` would fail the deploy if the CI3 nightly itself was red
(e.g. #2208). The nightly bundles many jobs, so an unrelated red job
blocked release even though the release build was fine.

## Fix

1. Query `repos/<repo>/actions/workflows/ci3.yml/runs?head_sha=<sha>`
via `gh api`, which filters server-side by SHA and finds the run
instantly no matter how old it is.

2. Drop `--exit-status` from `gh run watch` (so the whole-run conclusion
no longer gates), and instead gate specifically on the two release jobs
— the `./bootstrap.sh ci-release` builds on amd64 (`ci/x-release`) and
arm64 (`ci/a-release`). These are posted as **GitHub commit statuses**
on the tag's commit by `ci3/bootstrap_ec2` (`post_github_status
ci/<job-id>`). The script now waits for both statuses to reach a
terminal state (polling, since the runner posts them asynchronously) and
fails only if either is not `success`. It still fails if no CI3 run ever
appears for the tag.

The deploy now proceeds iff CI3 ran **and** both release-build jobs
succeeded, independent of unrelated nightly failures.
## What

Make all three nightly deployments run the deploy from the tip of `next` (latest scripts + helm) while keeping the correct image for each target network.

### Deploy ref → `next`
- `deploy-staging-internal.yml`, `deploy-staging-public.yml`: pass `ref: next` to `deploy-network.yml` so the `spartan/` deploy scripts and helm charts come from `next`.
- `deploy-next-net.yml` already passed `ref: next` (unchanged).

### `determine-tag` job (staging)
- Checkout a single commit at the tip of `next` (`ref: next`, `fetch-depth: 1`) instead of `v5-next` with full history.
- Tag resolution: if an explicit `tag` input is given, use it as-is. Otherwise construct `v5.0.0-nightly.<date>` and verify it actually exists with `git ls-remote --exit-code --tags origin`, failing the deploy early if the nightly tag is missing rather than proceeding to deploy a non-existent image.

## Why

`deploy-network.yml` checks out `inputs.ref` to run the deploy scripts/helm; when unset it falls back to `github.ref` (default branch on `schedule`, dispatch branch on `workflow_dispatch`), making the scripts/helm implicit and branch-dependent. Pinning to `next` keeps staging on the latest infra while `semver`/`source_tag` continue to select the v5-line image (`v5.0.0-nightly.<date>`), which is the correct image for the staging networks.

The `v5.0.0-nightly.<date>` tag is created on both the public and private repos (the nightly tagger tags `v5-next` on both), so the `git ls-remote origin` check resolves against whichever repo the workflow runs in.
## What

Make all three nightly deployments run the deploy from the tip of `next`
(latest scripts + helm) while keeping the correct image for each target
network.

### Deploy ref → `next`
- `deploy-staging-internal.yml`, `deploy-staging-public.yml`: pass `ref:
next` to `deploy-network.yml` so the `spartan/` deploy scripts and helm
charts come from `next`.
- `deploy-next-net.yml` already passed `ref: next` (unchanged).

### `determine-tag` job (staging)
- Checkout a single commit at the tip of `next` (`ref: next`,
`fetch-depth: 1`) instead of `v5-next` with full history.
- Tag resolution: if an explicit `tag` input is given, use it as-is.
Otherwise construct `v5.0.0-nightly.<date>` and verify it actually
exists with `git ls-remote --exit-code --tags origin`, failing the
deploy early if the nightly tag is missing rather than proceeding to
deploy a non-existent image.

## Why

`deploy-network.yml` checks out `inputs.ref` to run the deploy
scripts/helm; when unset it falls back to `github.ref` (default branch
on `schedule`, dispatch branch on `workflow_dispatch`), making the
scripts/helm implicit and branch-dependent. Pinning to `next` keeps
staging on the latest infra while `semver`/`source_tag` continue to
select the v5-line image (`v5.0.0-nightly.<date>`), which is the correct
image for the staging networks.

The `v5.0.0-nightly.<date>` tag is created on both the public and
private repos (the nightly tagger tags `v5-next` on both), so the `git
ls-remote origin` check resolves against whichever repo the workflow
runs in.
## Summary

- Update the testnet SponsoredFPC address in the networks table and
getting-started guides.
- Adjust release docs guidance to reflect that SponsoredFPC is deployed
on testnet and devnet, but not mainnet.

## Validation

- `yarn spellcheck` from `docs/`
#24026)

Adds an `<agent_and_workflow_restraint>` block to the root `CLAUDE.md`
telling Claude to do work inline in the current session and not spawn
parallel subagents or launch dynamic workflows unless the user
explicitly asks.

## Why

Operators have reported burning through their token budget from a single
prompt that quietly fanned out — in one case a "summarize recent ZK
advancements" query started ~30 agents, and another exhausted a 5h
budget spinning up subagents. Parallel agents and dynamic workflows
multiply spend (≈2x for one helper, far more for a swarm) and the user
can neither see the fan-out coming nor stop it. This appears to be a
current tendency of Fable. The guidance reasserts: handle
search/summarize/research/multi-file edits inline, reserve subagents for
explicit user requests or a single read-heavy isolation case, and never
start a dynamic workflow by default.

Passes the repo's `<editorial_test>`: the line would have prevented the
~30-agent fan-out on an ordinary research prompt described above.

Same change is being opened against `v5-next`, and an equivalent shared
rule is being added in the claudebox repo so it applies to every managed
session.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/7d5ecfdd5f37c5cd) ·
group: `slackbot`*
…4033)

Replicates the change in #24024 directly on a repo branch, and also
applies it to the versioned docs snapshot.

Updates the **Aztec & Noir Developer Office Hours** Google Meet link
from `https://meet.google.com/sdd-rdsr-shu` →
`https://meet.google.com/vev-waao-mab` in:

- `docs/docs-developers/docs/resources/community_calls.md` (current docs
— identical to #24024)
-
`docs/developer_versioned_docs/version-v4.3.1/docs/resources/community_calls.md`
(versioned snapshot — the only versioned copy that still carried the old
link)

No occurrences of the old `sdd-rdsr-shu` link remain anywhere under
`docs/`.
…ing (#24039)

## Problem

The dashboard `grind` option always fails to SSH into the build
instance:

```
Waiting for SSH at 3.144.255.68...
Timeout: SSH could not login to 3.144.255.68 within 60 seconds.
```

The instance launches fine (spot/on-demand fulfilled, IP assigned) but
SSH never connects, so grind cycles through every instance type and
gives up.

## Root cause

CI build boxes were migrated from SSH to **SSM**. In `ci3/bootstrap_ec2`
the default is now `CI_USE_SSH=0` (SSM); only `shell-new` forces SSH,
and `grind-test` does not. So on current `next`, grind runs over SSM
like the rest of CI.

But the dashboard launches grind from a long-lived checkout at
`REPO_PATH` (the `/grind` handler in `rk.py` shells out to `cd
$REPO_PATH && ./ci.sh grind-test ...`). That checkout had drifted to a
pre-SSM commit, so grind alone still took the legacy SSH branch —
launching into the retired SSH security group + `build-instance` key
pair, whose port-22 / key-injection preconditions were torn down during
the SSM lockdown. The stale checkout also explains the old AMI
(`ami-09d27244b23be8891`) in the logs vs. current `next`'s
`ami-067627aa971a1dcbb`.

Nothing kept `REPO_PATH` current: the `ci3-dashboard-deploy.yml`
workflow only rebuilds the `rkapp` Flask container (and is path-filtered
to `ci3/dashboard/**`), so changes to the `ci3/` launcher scripts never
refreshed it.

## Fix

Refresh the launcher checkout to `origin/next` at grind launch time,
before shelling out. This is self-healing and independent of deploys. It
matches the existing design where the launcher always runs
current-`next` orchestration scripts while the grind *target commit* is
checked out on the remote box — so this does **not** restrict which
branch/commit you can grind. If the refresh fails (e.g. transient
network), the error is surfaced in the run log instead of silently
grinding on a stale tree.

## Testing

`python3 -m py_compile ci3/dashboard/rk.py` passes. The behavior change
is host-side (requires the dashboard's `REPO_PATH` checkout) and can't
be exercised in unit CI; it will take effect on the next dashboard
deploy. The immediate one-time unblock is still to refresh `REPO_PATH`
on `ci.aztec-labs.com` and restart `rkapp`.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/1c05a513cb601b21) ·
group: `slackbot`*
@AztecBot AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jun 12, 2026
@alexghr alexghr marked this pull request as ready for review June 12, 2026 06:17
@alexghr alexghr requested a review from charlielye as a code owner June 12, 2026 06:17
@alexghr

alexghr commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This PR is stacked on top of #24044

Base automatically changed from ag/chore-mt-conflicts-2 to merge-train/spartan June 12, 2026 09:12
@PhilWindle PhilWindle merged commit e09e0a7 into merge-train/spartan Jun 12, 2026
39 of 47 checks passed
@PhilWindle PhilWindle deleted the cb/update-nightly-schedule-times branch June 12, 2026 09:14
PhilWindle added a commit that referenced this pull request Jun 12, 2026
## Why

`merge-train/spartan` ([PR
#23971](#23971)) has
been in state `dirty` and has not merged into `next` for ~2 days. This
PR merges current `next` into the train branch and resolves the
conflicts so #23971 becomes mergeable again.

## Conflicts resolved

Both conflicts were cron-schedule differences. The train branch's commit
`chore(ci): align nightly scheduled workflow times (#24045)`
deliberately set these times, and that alignment is exactly what the
train is bringing into `next` — so the train (HEAD) side was kept in
both:

- `.github/workflows/deploy-staging-internal.yml` — kept `cron: "0 6 * *
*"` (train) over `"0 7 * * *"` (next)
- `.github/workflows/nightly-release-tag.yml` — kept `cron: "0 4 * * *"`
(train) over `"0 2 * * *"` (next)

After resolution, `origin/next` is an ancestor of this branch; the only
net file change versus the current train tip is
`spartan/terraform/gke-cluster/iam.tf` (+24, brought in from `next`).

## ⚠️ Merge with a merge commit, not squash

This PR carries a real merge commit so that `next` stays an ancestor of
`merge-train/spartan`. It must be landed with **Create a merge commit**
(hence the `ci-no-squash` label) — squashing would drop the merge and
leave #23971 dirty.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/011e27d5de05b232) ·
group: `slackbot`*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants