Skip to content

fix(ci): refresh grind launcher checkout to origin/next before launching#24039

Merged
spalladino merged 1 commit into
nextfrom
cb/grind-refresh-launcher-checkout
Jun 11, 2026
Merged

fix(ci): refresh grind launcher checkout to origin/next before launching#24039
spalladino merged 1 commit into
nextfrom
cb/grind-refresh-launcher-checkout

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Problem

The dashboard grind option always fails to SSH into the build instance:

Waiting for SSH at 3.144.255.68...
Timeout: SSH could not login to 3.144.255.68 within 60 seconds.

The instance launches fine (spot/on-demand fulfilled, IP assigned) but SSH never connects, so grind cycles through every instance type and gives up.

Root cause

CI build boxes were migrated from SSH to SSM. In ci3/bootstrap_ec2 the default is now CI_USE_SSH=0 (SSM); only shell-new forces SSH, and grind-test does not. So on current next, grind runs over SSM like the rest of CI.

But the dashboard launches grind from a long-lived checkout at REPO_PATH (the /grind handler in rk.py shells out to cd $REPO_PATH && ./ci.sh grind-test ...). That checkout had drifted to a pre-SSM commit, so grind alone still took the legacy SSH branch — launching into the retired SSH security group + build-instance key pair, whose port-22 / key-injection preconditions were torn down during the SSM lockdown. The stale checkout also explains the old AMI (ami-09d27244b23be8891) in the logs vs. current next's ami-067627aa971a1dcbb.

Nothing kept REPO_PATH current: the ci3-dashboard-deploy.yml workflow only rebuilds the rkapp Flask container (and is path-filtered to ci3/dashboard/**), so changes to the ci3/ launcher scripts never refreshed it.

Fix

Refresh the launcher checkout to origin/next at grind launch time, before shelling out. This is self-healing and independent of deploys. It matches the existing design where the launcher always runs current-next orchestration scripts while the grind target commit is checked out on the remote box — so this does not restrict which branch/commit you can grind. If the refresh fails (e.g. transient network), the error is surfaced in the run log instead of silently grinding on a stale tree.

Testing

python3 -m py_compile ci3/dashboard/rk.py passes. The behavior change is host-side (requires the dashboard's REPO_PATH checkout) and can't be exercised in unit CI; it will take effect on the next dashboard deploy. The immediate one-time unblock is still to refresh REPO_PATH on ci.aztec-labs.com and restart rkapp.


Created by claudebox · group: slackbot

@AztecBot AztecBot added ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR. labels Jun 11, 2026
@spalladino spalladino marked this pull request as ready for review June 11, 2026 21:58
@spalladino spalladino requested a review from charlielye as a code owner June 11, 2026 21:58
@spalladino spalladino enabled auto-merge June 11, 2026 21:58
@spalladino spalladino added this pull request to the merge queue Jun 11, 2026
@AztecBot

Copy link
Copy Markdown
Collaborator Author

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/3a45cda231c3747c�3a45cda231c3747c8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_p2p/multiple_validators_sentinel.parallel.test.ts "collects attestations for all validators on a node" (400s) (code: 0) group:e2e-p2p-epoch-flakes

Merged via the queue into next with commit fa06339 Jun 11, 2026
51 of 55 checks passed
@spalladino spalladino deleted the cb/grind-refresh-launcher-checkout branch June 11, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-draft Run CI on draft PRs. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants