ci: pilot-migrate clippy job to smithy self-hosted runners#201
Merged
Conversation
Switches just the clippy job from ubuntu-latest to [self-hosted, linux, x64, rust-cpu] — one of the three rust-cpu runners on pulseengine-ci-01 (hetzner-private group). Other jobs (fmt, test) stay on ubuntu-latest for now; once we have a few green clippy runs and timing data, the rest can follow. Why clippy first: - meaningful compile work (good sccache test) - bounded scope — failure doesn't block fmt or test - no sudo, apt, or container needed - spar already tracks nightly via dtolnay/rust-toolchain so the toolchain matches between hosted and self-hosted If this PR's clippy job goes red on the self-hosted runner but passes locally / on hosted, that's a smithy bug, not a code bug.
The previous clippy run on the self-hosted runner failed at highs-sys build because cmake wasn't on the host. smithy main now ships the common Rust build-dep set (cmake, clang, lld, perl, m4, protobuf-compiler, libclang-dev, zlib1g-dev). Pushing an empty commit to re-trigger CI; clippy should now finish on rust-cpu.
Builds on the proven clippy migration (PR description, original
commit on this branch). Two separate concerns:
1) ci.yml — broaden the migration
Migrate every gating job that doesn't need infra we don't have on
the smithy host. Two stay on ubuntu-latest with explicit comments
explaining why; everything else now targets the matching smithy
runner class:
rust-cpu (12G MemoryHigh) clippy, test, bench-smoke,
coverage, proptest, fuzz-smoke,
rivet-validate
lean-mem (24G MemoryHigh) miri, mutants
light (4G MemoryHigh) fmt, audit, deny, supply-chain
ubuntu-latest (kept) bazel-test (no Bazel on host),
kani (kani-verifier bundles CBMC,
~100 MB install — not worth pre-
provisioning until kani sees more
use)
The lean-mem class for miri / mutants is deliberate: both are
RAM-aggressive (Miri's borrow tracker, mutants' parallel cargo
invocations). The 24G MemoryHigh ceiling on smithy lean-mem
runners is comfortably above the 12G rust-cpu cap.
2) mutants-weekly.yml — new heavy-quality workflow
Counterpart to the gating `mutants:` job in ci.yml. Different
operational pattern (smithy DD-pattern for "heavy quality"):
- schedule: 02:00 UTC every Sunday + workflow_dispatch on demand
- runs-on: lean-mem (24G), timeout-minutes: 720
- concurrency.cancel-in-progress: false (never cancel a quality run)
- workflow_dispatch inputs: `shard` (default 0/8 for sanity, "all"
for the full ~hours pass) + `packages` (space-separated -p list)
- results land in GITHUB_STEP_SUMMARY (markdown table of
missed/caught/timeout/unviable) plus an uploaded artefact with
90-day retention
- no PR red lights; no auto-Issue filing yet (that's a follow-up
once the report shape stabilises)
This is the second-pattern pilot the smithy fleet was sized for —
the lean-mem runners have been idle since registration; this puts
them on the work they were labelled for.
GitHub limits workflow_dispatch and schedule triggers to workflows that already exist on the default branch. Adding a path-filtered push trigger lets us exercise the workflow on this PR before merge. The push: block carries a TEMPORARY marker; remove it before merge.
Prior run hit 'Permission denied (os error 13)' on .d files in target/. Direct file-write tests as the runner user succeed; the files are owned correctly with mode 640. Suspect: stale state left by a cancelled run interacting badly with concurrent jobs landing on the same runner via cache restoration. Clearing all runner _work and the shared sccache to bisect: if a clean run also fails, it's not stale state.
Disabled RUSTC_WRAPPER in runner env (smithy commit 65e57a2); runners restarted to pick up the new environment. bpftrace running on host capturing every openat returning EACCES with PID/UID/comm/filename. Pushing this empty commit to fire CI.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The action bundles an older cargo-audit that can't parse CVSS 4.0 advisories like RUSTSEC-2026-0037 and exits non-zero on the parse error before evaluating spar's Cargo.lock. cargo-audit is pre- installed on smithy at v0.21.2 (toolchains role) which handles CVSS 4.0 fine. Same effect (audit blocks PRs on advisory hits) without the wrapper.
Smithy main now ships: - subuid/subgid for runner1..8 (Cargo Deny rootless container fix) - CARGO_HOME/bin on the runner env PATH (Rivet validate fix) - always-on bpftrace EACCES tracing (smithy-trace-eacces.service) Plus this branch carries: - cargo audit invoked directly (replaces broken rustsec/audit-check) All runners restarted with new env. This commit fires fresh CI.
…roken)
Two adjustments after the smithy subuid + PATH fixes landed:
1. cargo-deny: drop EmbarkStudios/cargo-deny-action@v2 (which runs
in a rootless container) in favour of direct `cargo deny check`.
Smithy has cargo-deny installed (toolchains role v0.16.4). The
container action fails on our hardened runner systemd unit:
newuidmap is setuid but NoNewPrivileges=true blocks the
escalation, so the rootless namespace can't be set up. Going
direct sidesteps the entire interaction; we'd otherwise need to
weaken the runner hardening for this single workflow.
2. audit: back to ubuntu-latest temporarily. Smithy ships cargo-audit
v0.21.2 which still rejects RUSTSEC-2026-0037 ('unsupported CVSS
version: 4.0') even though upstream rustsec 0.30+ supports CVSS
4.0. v0.22.1 would fix it but that build trips on our
sccache-on-cc setup (aws-lc-sys C compile through sccache fails).
Move back once smithy ships an upgraded cargo-audit.
Surfaced when running `cargo deny check` directly with the toolchains-role-installed cargo-deny v0.16.4 on smithy: error[deprecated]: this key has been removed, see EmbarkStudios/cargo-deny#611 The yanked + licenses + bans + sources sections still gate normally. Unmaintained-crate detection moved out of the static config in newer cargo-deny; revisit if/when we want to re-enable that signal.
cargo-deny and cargo-audit share the same rustsec advisory parser. Both fail at the same point on RUSTSEC-2026-0037 because the embedded rustsec rejects CVSS 4.0 strings. The audit job (on hosted) still covers vulnerability matching; cargo-deny here keeps gating bans, licenses, and sources, which is what it actually adds beyond audit. Drop the workaround once smithy ships an upgraded rustsec parser (tracked alongside the cargo-audit upgrade).
This was referenced May 3, 2026
avrabe
added a commit
to pulseengine/rivet
that referenced
this pull request
May 10, 2026
* ci: migrate 16 of 21 ci.yml jobs to smithy self-hosted runners Builds on the spar pilot (pulseengine/spar#201) — same runner-class mapping, same workarounds for the rustsec parser CVSS 4.0 issue, same direct-cargo-deny pattern. Migrated to smithy: rust-cpu clippy, docs-check, test, semver-checks, coverage, proptest, fuzz, msrv lean-mem miri, mutants, verus light fmt, yaml-lint, deny, supply-chain, release-results Stay on ubuntu-latest (each with explanatory comment in-place): - playwright (--with-deps does sudo apt-get; smithy runners no sudo) - vscode-extension (xvfb-run + downloaded VS Code Test setup) - audit (cargo-audit 0.21 rustsec parser rejects CVSS 4.0) - kani (kani-verifier bundles CBMC, ~100 MB install) - rocq (Coq install, not on smithy yet) Two non-trivial fixes inside migrated jobs: - test: actionlint install moved from `sudo mv /tmp/actionlint /usr/local/bin` to `mv /tmp/actionlint $HOME/.local/bin` plus GITHUB_PATH update. Smithy runners have no sudo; same binary, different writable location. - deny: dropped the `cargo deny check` (which would fail loading advisory-db with CVSS 4.0) for `cargo deny check bans licenses sources`. The audit job (still on hosted) covers vulnerability matching meanwhile. Expected improvement: spar's broad migration showed ~470x end-to-end speedup on clippy (~470 min → 1 min) thanks to queue elimination. Rivet should see similar — its recent runs showed 600+ min total. * ci(miri): bump timeout-minutes 15->30 after smithy run hit limit First migration run timed out exactly at 15:00 with tests still progressing (last printed test at ~11:00). Smithy's lean-mem class appears to run the slow tail tests slower than the previous hosted runner did — could be cgroup memory pressure (24G MemoryHigh under Miri's shadow allocations) or just longer tail test perf. Bumping the budget conservatively; revisit once we have a few green runs to dial it back closer to actual. Semver Checks is also failing on this PR — upstream issue ('unsupported rustdoc format v57', the action ships a too-old cargo-semver-checks). NOT a smithy-migration issue; would fail on hosted too. Tracked as a separate followup; doesn't block this PR. * ci: retrigger after smithy TMPDIR fix Smithy main now points TMPDIR / TMP / TEMP at the per-runner /var/lib/runners/runnerN/_tmp on lv_runners (500 G), instead of the host's /tmp on lv_root (80 G). Previous run hit 'no space left on device' when the rivet HTML-export test ran out of root FS budget. Runners restarted; this commit triggers a fresh CI. * ci(semver-checks): replace stale wrapper action with direct cargo install obi1kenobi/cargo-semver-checks-action@v2 bundles an older cargo-semver-checks that doesn't recognise rustdoc JSON v57 (the format current stable rustdoc emits). Every PR run failed with 'unsupported rustdoc format v57 for file: rivet_core.json'. Going direct: install the latest cargo-semver-checks at job time and invoke it. Slightly slower on cold cache but tracks the upstream rustdoc format. Same end-effect as the wrapper. Caught during the rivet broad-CI smithy migration (PR #262); not related to self-hosted vs hosted. * ci: retrigger after smithy disk cleanup + tmpfiles policy lv_runners had filled to 100% from accumulated per-runner _tmp. Smithy main commit b4af61e adds /etc/tmpfiles.d/smithy-runner-tmp.conf to age files >24 h out of those dirs daily. Manual cleanup ran today (466G -> 92G used). Re-triggering CI to confirm Miri / Mutation / Verus jobs land green now that disk is back to 20% used.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First pilot migration of a CI job from GitHub-hosted to the
pulseengine self-hosted fleet (
hetzner-privaterunner group onpulseengine-ci-01). Scope deliberately small: just theclippyjob, switched to
[self-hosted, linux, x64, rust-cpu]. Other jobs(
fmt,test,proofs) stay onubuntu-latest.Rationale
which is GitHub-hosted runner queueing on the org-free tier
(20-concurrent cap).
but bounded — failure doesn't block format checks or tests.
sudo,apt, or container needed → no friction with ourrootless runner setup.
dtolnay/rust-toolchain, so thetoolchain version matches between hosted and self-hosted.
Test plan
rust-cpurunner (1 of 5/6/7) within seconds (no GitHub queue)ubuntu-latestas beforeRollback
Revert this commit.
runs-on:flips back toubuntu-latestandthe next run uses GitHub-hosted compute.
Follow-ups (if green)
fmtandtestnext (separate PRs).mutants-weekly.yml) that targetslean-memrunners, separate from gating CI.