Skip to content

ci: pilot-migrate clippy job to smithy self-hosted runners#201

Merged
avrabe merged 14 commits into
mainfrom
smithy-clippy-pilot
May 11, 2026
Merged

ci: pilot-migrate clippy job to smithy self-hosted runners#201
avrabe merged 14 commits into
mainfrom
smithy-clippy-pilot

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 3, 2026

Summary

First pilot migration of a CI job from GitHub-hosted to the
pulseengine self-hosted fleet (hetzner-private runner group on
pulseengine-ci-01). Scope deliberately small: just the clippy
job, switched to [self-hosted, linux, x64, rust-cpu]. Other jobs
(fmt, test, proofs) stay on ubuntu-latest.

Rationale

  • Spar's recent CI runs show 400-600 min completion times, much of
    which is GitHub-hosted runner queueing on the org-free tier
    (20-concurrent cap).
  • Clippy is meaningful compile work (good sccache integration test)
    but bounded — failure doesn't block format checks or tests.
  • No sudo, apt, or container needed → no friction with our
    rootless runner setup.
  • Spar already pins nightly via dtolnay/rust-toolchain, so the
    toolchain version matches between hosted and self-hosted.

Test plan

  • CI run completes — clippy job lands on a rust-cpu runner (1 of 5/6/7) within seconds (no GitHub queue)
  • Compile succeeds end-to-end with no clippy warnings
  • Other jobs (fmt, test) still run on ubuntu-latest as before
  • Second push to this branch should be much faster on clippy thanks to sccache hit

Rollback

Revert this commit. runs-on: flips back to ubuntu-latest and
the next run uses GitHub-hosted compute.

Follow-ups (if green)

  • Migrate fmt and test next (separate PRs).
  • Add a heavy-quality workflow (mutants-weekly.yml) that targets
    lean-mem runners, separate from gating CI.

avrabe added 6 commits May 3, 2026 07:54
Switches just the clippy job from ubuntu-latest to
[self-hosted, linux, x64, rust-cpu] — one of the three rust-cpu
runners on pulseengine-ci-01 (hetzner-private group).

Other jobs (fmt, test) stay on ubuntu-latest for now; once we have
a few green clippy runs and timing data, the rest can follow.

Why clippy first:
- meaningful compile work (good sccache test)
- bounded scope — failure doesn't block fmt or test
- no sudo, apt, or container needed
- spar already tracks nightly via dtolnay/rust-toolchain so the
  toolchain matches between hosted and self-hosted

If this PR's clippy job goes red on the self-hosted runner but
passes locally / on hosted, that's a smithy bug, not a code bug.
The previous clippy run on the self-hosted runner failed at
highs-sys build because cmake wasn't on the host. smithy main now
ships the common Rust build-dep set (cmake, clang, lld, perl, m4,
protobuf-compiler, libclang-dev, zlib1g-dev). Pushing an empty
commit to re-trigger CI; clippy should now finish on rust-cpu.
Builds on the proven clippy migration (PR description, original
commit on this branch). Two separate concerns:

1) ci.yml — broaden the migration

Migrate every gating job that doesn't need infra we don't have on
the smithy host. Two stay on ubuntu-latest with explicit comments
explaining why; everything else now targets the matching smithy
runner class:

  rust-cpu (12G MemoryHigh)        clippy, test, bench-smoke,
                                   coverage, proptest, fuzz-smoke,
                                   rivet-validate
  lean-mem (24G MemoryHigh)        miri, mutants
  light    (4G  MemoryHigh)        fmt, audit, deny, supply-chain
  ubuntu-latest (kept)             bazel-test (no Bazel on host),
                                   kani (kani-verifier bundles CBMC,
                                   ~100 MB install — not worth pre-
                                   provisioning until kani sees more
                                   use)

The lean-mem class for miri / mutants is deliberate: both are
RAM-aggressive (Miri's borrow tracker, mutants' parallel cargo
invocations). The 24G MemoryHigh ceiling on smithy lean-mem
runners is comfortably above the 12G rust-cpu cap.

2) mutants-weekly.yml — new heavy-quality workflow

Counterpart to the gating `mutants:` job in ci.yml. Different
operational pattern (smithy DD-pattern for "heavy quality"):

  - schedule: 02:00 UTC every Sunday + workflow_dispatch on demand
  - runs-on: lean-mem (24G), timeout-minutes: 720
  - concurrency.cancel-in-progress: false (never cancel a quality run)
  - workflow_dispatch inputs: `shard` (default 0/8 for sanity, "all"
    for the full ~hours pass) + `packages` (space-separated -p list)
  - results land in GITHUB_STEP_SUMMARY (markdown table of
    missed/caught/timeout/unviable) plus an uploaded artefact with
    90-day retention
  - no PR red lights; no auto-Issue filing yet (that's a follow-up
    once the report shape stabilises)

This is the second-pattern pilot the smithy fleet was sized for —
the lean-mem runners have been idle since registration; this puts
them on the work they were labelled for.
GitHub limits workflow_dispatch and schedule triggers to workflows
that already exist on the default branch. Adding a path-filtered
push trigger lets us exercise the workflow on this PR before merge.
The push: block carries a TEMPORARY marker; remove it before merge.
Prior run hit 'Permission denied (os error 13)' on .d files in
target/. Direct file-write tests as the runner user succeed; the
files are owned correctly with mode 640. Suspect: stale state
left by a cancelled run interacting badly with concurrent jobs
landing on the same runner via cache restoration. Clearing all
runner _work and the shared sccache to bisect: if a clean run
also fails, it's not stale state.
Disabled RUSTC_WRAPPER in runner env (smithy commit 65e57a2);
runners restarted to pick up the new environment.
bpftrace running on host capturing every openat returning EACCES
with PID/UID/comm/filename. Pushing this empty commit to fire CI.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

avrabe added 5 commits May 3, 2026 11:32
The action bundles an older cargo-audit that can't parse CVSS 4.0
advisories like RUSTSEC-2026-0037 and exits non-zero on the parse
error before evaluating spar's Cargo.lock. cargo-audit is pre-
installed on smithy at v0.21.2 (toolchains role) which handles
CVSS 4.0 fine.

Same effect (audit blocks PRs on advisory hits) without the wrapper.
Smithy main now ships:
  - subuid/subgid for runner1..8 (Cargo Deny rootless container fix)
  - CARGO_HOME/bin on the runner env PATH (Rivet validate fix)
  - always-on bpftrace EACCES tracing (smithy-trace-eacces.service)

Plus this branch carries:
  - cargo audit invoked directly (replaces broken rustsec/audit-check)

All runners restarted with new env. This commit fires fresh CI.
…roken)

Two adjustments after the smithy subuid + PATH fixes landed:

1. cargo-deny: drop EmbarkStudios/cargo-deny-action@v2 (which runs
   in a rootless container) in favour of direct `cargo deny check`.
   Smithy has cargo-deny installed (toolchains role v0.16.4). The
   container action fails on our hardened runner systemd unit:
   newuidmap is setuid but NoNewPrivileges=true blocks the
   escalation, so the rootless namespace can't be set up. Going
   direct sidesteps the entire interaction; we'd otherwise need to
   weaken the runner hardening for this single workflow.

2. audit: back to ubuntu-latest temporarily. Smithy ships cargo-audit
   v0.21.2 which still rejects RUSTSEC-2026-0037 ('unsupported CVSS
   version: 4.0') even though upstream rustsec 0.30+ supports CVSS
   4.0. v0.22.1 would fix it but that build trips on our
   sccache-on-cc setup (aws-lc-sys C compile through sccache fails).
   Move back once smithy ships an upgraded cargo-audit.
Surfaced when running `cargo deny check` directly with the
toolchains-role-installed cargo-deny v0.16.4 on smithy:

  error[deprecated]: this key has been removed, see
  EmbarkStudios/cargo-deny#611

The yanked + licenses + bans + sources sections still gate
normally. Unmaintained-crate detection moved out of the static
config in newer cargo-deny; revisit if/when we want to re-enable
that signal.
cargo-deny and cargo-audit share the same rustsec advisory parser.
Both fail at the same point on RUSTSEC-2026-0037 because the
embedded rustsec rejects CVSS 4.0 strings. The audit job (on
hosted) still covers vulnerability matching; cargo-deny here keeps
gating bans, licenses, and sources, which is what it actually adds
beyond audit. Drop the workaround once smithy ships an upgraded
rustsec parser (tracked alongside the cargo-audit upgrade).
avrabe added a commit to pulseengine/rivet that referenced this pull request May 10, 2026
* ci: migrate 16 of 21 ci.yml jobs to smithy self-hosted runners

Builds on the spar pilot (pulseengine/spar#201) — same runner-class
mapping, same workarounds for the rustsec parser CVSS 4.0 issue,
same direct-cargo-deny pattern.

Migrated to smithy:

  rust-cpu      clippy, docs-check, test, semver-checks, coverage,
                proptest, fuzz, msrv
  lean-mem      miri, mutants, verus
  light         fmt, yaml-lint, deny, supply-chain, release-results

Stay on ubuntu-latest (each with explanatory comment in-place):

  - playwright       (--with-deps does sudo apt-get; smithy runners no sudo)
  - vscode-extension (xvfb-run + downloaded VS Code Test setup)
  - audit            (cargo-audit 0.21 rustsec parser rejects CVSS 4.0)
  - kani             (kani-verifier bundles CBMC, ~100 MB install)
  - rocq             (Coq install, not on smithy yet)

Two non-trivial fixes inside migrated jobs:

  - test: actionlint install moved from `sudo mv /tmp/actionlint
    /usr/local/bin` to `mv /tmp/actionlint $HOME/.local/bin` plus
    GITHUB_PATH update. Smithy runners have no sudo; same binary,
    different writable location.
  - deny: dropped the `cargo deny check` (which would fail loading
    advisory-db with CVSS 4.0) for `cargo deny check bans licenses
    sources`. The audit job (still on hosted) covers vulnerability
    matching meanwhile.

Expected improvement: spar's broad migration showed ~470x end-to-end
speedup on clippy (~470 min → 1 min) thanks to queue elimination.
Rivet should see similar — its recent runs showed 600+ min total.

* ci(miri): bump timeout-minutes 15->30 after smithy run hit limit

First migration run timed out exactly at 15:00 with tests still
progressing (last printed test at ~11:00). Smithy's lean-mem class
appears to run the slow tail tests slower than the previous hosted
runner did — could be cgroup memory pressure (24G MemoryHigh under
Miri's shadow allocations) or just longer tail test perf. Bumping
the budget conservatively; revisit once we have a few green runs
to dial it back closer to actual.

Semver Checks is also failing on this PR — upstream issue
('unsupported rustdoc format v57', the action ships a too-old
cargo-semver-checks). NOT a smithy-migration issue; would fail on
hosted too. Tracked as a separate followup; doesn't block this PR.

* ci: retrigger after smithy TMPDIR fix

Smithy main now points TMPDIR / TMP / TEMP at the per-runner
/var/lib/runners/runnerN/_tmp on lv_runners (500 G), instead of
the host's /tmp on lv_root (80 G). Previous run hit 'no space
left on device' when the rivet HTML-export test ran out of root
FS budget. Runners restarted; this commit triggers a fresh CI.

* ci(semver-checks): replace stale wrapper action with direct cargo install

obi1kenobi/cargo-semver-checks-action@v2 bundles an older
cargo-semver-checks that doesn't recognise rustdoc JSON v57
(the format current stable rustdoc emits). Every PR run failed
with 'unsupported rustdoc format v57 for file: rivet_core.json'.

Going direct: install the latest cargo-semver-checks at job time
and invoke it. Slightly slower on cold cache but tracks the
upstream rustdoc format. Same end-effect as the wrapper.

Caught during the rivet broad-CI smithy migration (PR #262); not
related to self-hosted vs hosted.

* ci: retrigger after smithy disk cleanup + tmpfiles policy

lv_runners had filled to 100% from accumulated per-runner _tmp.
Smithy main commit b4af61e adds /etc/tmpfiles.d/smithy-runner-tmp.conf
to age files >24 h out of those dirs daily. Manual cleanup ran today
(466G -> 92G used). Re-triggering CI to confirm Miri / Mutation /
Verus jobs land green now that disk is back to 20% used.
@avrabe avrabe merged commit 2104cc6 into main May 11, 2026
15 of 30 checks passed
@avrabe avrabe deleted the smithy-clippy-pilot branch May 11, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant