ci: pilot-migrate clippy job to smithy self-hosted runners by avrabe · Pull Request #201 · pulseengine/spar

avrabe · 2026-05-03T06:01:38Z

Summary

First pilot migration of a CI job from GitHub-hosted to the
pulseengine self-hosted fleet (hetzner-private runner group on
pulseengine-ci-01). Scope deliberately small: just the clippy
job, switched to [self-hosted, linux, x64, rust-cpu]. Other jobs
(fmt, test, proofs) stay on ubuntu-latest.

Rationale

Spar's recent CI runs show 400-600 min completion times, much of
which is GitHub-hosted runner queueing on the org-free tier
(20-concurrent cap).
Clippy is meaningful compile work (good sccache integration test)
but bounded — failure doesn't block format checks or tests.
No sudo, apt, or container needed → no friction with our
rootless runner setup.
Spar already pins nightly via dtolnay/rust-toolchain, so the
toolchain version matches between hosted and self-hosted.

Test plan

CI run completes — clippy job lands on a rust-cpu runner (1 of 5/6/7) within seconds (no GitHub queue)
Compile succeeds end-to-end with no clippy warnings
Other jobs (fmt, test) still run on ubuntu-latest as before
Second push to this branch should be much faster on clippy thanks to sccache hit

Rollback

Revert this commit. runs-on: flips back to ubuntu-latest and
the next run uses GitHub-hosted compute.

Follow-ups (if green)

Migrate fmt and test next (separate PRs).
Add a heavy-quality workflow (mutants-weekly.yml) that targets
lean-mem runners, separate from gating CI.

Switches just the clippy job from ubuntu-latest to [self-hosted, linux, x64, rust-cpu] — one of the three rust-cpu runners on pulseengine-ci-01 (hetzner-private group). Other jobs (fmt, test) stay on ubuntu-latest for now; once we have a few green clippy runs and timing data, the rest can follow. Why clippy first: - meaningful compile work (good sccache test) - bounded scope — failure doesn't block fmt or test - no sudo, apt, or container needed - spar already tracks nightly via dtolnay/rust-toolchain so the toolchain matches between hosted and self-hosted If this PR's clippy job goes red on the self-hosted runner but passes locally / on hosted, that's a smithy bug, not a code bug.

The previous clippy run on the self-hosted runner failed at highs-sys build because cmake wasn't on the host. smithy main now ships the common Rust build-dep set (cmake, clang, lld, perl, m4, protobuf-compiler, libclang-dev, zlib1g-dev). Pushing an empty commit to re-trigger CI; clippy should now finish on rust-cpu.

Builds on the proven clippy migration (PR description, original commit on this branch). Two separate concerns: 1) ci.yml — broaden the migration Migrate every gating job that doesn't need infra we don't have on the smithy host. Two stay on ubuntu-latest with explicit comments explaining why; everything else now targets the matching smithy runner class: rust-cpu (12G MemoryHigh) clippy, test, bench-smoke, coverage, proptest, fuzz-smoke, rivet-validate lean-mem (24G MemoryHigh) miri, mutants light (4G MemoryHigh) fmt, audit, deny, supply-chain ubuntu-latest (kept) bazel-test (no Bazel on host), kani (kani-verifier bundles CBMC, ~100 MB install — not worth pre- provisioning until kani sees more use) The lean-mem class for miri / mutants is deliberate: both are RAM-aggressive (Miri's borrow tracker, mutants' parallel cargo invocations). The 24G MemoryHigh ceiling on smithy lean-mem runners is comfortably above the 12G rust-cpu cap. 2) mutants-weekly.yml — new heavy-quality workflow Counterpart to the gating `mutants:` job in ci.yml. Different operational pattern (smithy DD-pattern for "heavy quality"): - schedule: 02:00 UTC every Sunday + workflow_dispatch on demand - runs-on: lean-mem (24G), timeout-minutes: 720 - concurrency.cancel-in-progress: false (never cancel a quality run) - workflow_dispatch inputs: `shard` (default 0/8 for sanity, "all" for the full ~hours pass) + `packages` (space-separated -p list) - results land in GITHUB_STEP_SUMMARY (markdown table of missed/caught/timeout/unviable) plus an uploaded artefact with 90-day retention - no PR red lights; no auto-Issue filing yet (that's a follow-up once the report shape stabilises) This is the second-pattern pilot the smithy fleet was sized for — the lean-mem runners have been idle since registration; this puts them on the work they were labelled for.

GitHub limits workflow_dispatch and schedule triggers to workflows that already exist on the default branch. Adding a path-filtered push trigger lets us exercise the workflow on this PR before merge. The push: block carries a TEMPORARY marker; remove it before merge.

Prior run hit 'Permission denied (os error 13)' on .d files in target/. Direct file-write tests as the runner user succeed; the files are owned correctly with mode 640. Suspect: stale state left by a cancelled run interacting badly with concurrent jobs landing on the same runner via cache restoration. Clearing all runner _work and the shared sccache to bisect: if a clean run also fails, it's not stale state.

Disabled RUSTC_WRAPPER in runner env (smithy commit 65e57a2); runners restarted to pick up the new environment. bpftrace running on host capturing every openat returning EACCES with PID/UID/comm/filename. Pushing this empty commit to fire CI.

codecov · 2026-05-03T08:55:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

The action bundles an older cargo-audit that can't parse CVSS 4.0 advisories like RUSTSEC-2026-0037 and exits non-zero on the parse error before evaluating spar's Cargo.lock. cargo-audit is pre- installed on smithy at v0.21.2 (toolchains role) which handles CVSS 4.0 fine. Same effect (audit blocks PRs on advisory hits) without the wrapper.

Smithy main now ships: - subuid/subgid for runner1..8 (Cargo Deny rootless container fix) - CARGO_HOME/bin on the runner env PATH (Rivet validate fix) - always-on bpftrace EACCES tracing (smithy-trace-eacces.service) Plus this branch carries: - cargo audit invoked directly (replaces broken rustsec/audit-check) All runners restarted with new env. This commit fires fresh CI.

…roken) Two adjustments after the smithy subuid + PATH fixes landed: 1. cargo-deny: drop EmbarkStudios/cargo-deny-action@v2 (which runs in a rootless container) in favour of direct `cargo deny check`. Smithy has cargo-deny installed (toolchains role v0.16.4). The container action fails on our hardened runner systemd unit: newuidmap is setuid but NoNewPrivileges=true blocks the escalation, so the rootless namespace can't be set up. Going direct sidesteps the entire interaction; we'd otherwise need to weaken the runner hardening for this single workflow. 2. audit: back to ubuntu-latest temporarily. Smithy ships cargo-audit v0.21.2 which still rejects RUSTSEC-2026-0037 ('unsupported CVSS version: 4.0') even though upstream rustsec 0.30+ supports CVSS 4.0. v0.22.1 would fix it but that build trips on our sccache-on-cc setup (aws-lc-sys C compile through sccache fails). Move back once smithy ships an upgraded cargo-audit.

Surfaced when running `cargo deny check` directly with the toolchains-role-installed cargo-deny v0.16.4 on smithy: error[deprecated]: this key has been removed, see EmbarkStudios/cargo-deny#611 The yanked + licenses + bans + sources sections still gate normally. Unmaintained-crate detection moved out of the static config in newer cargo-deny; revisit if/when we want to re-enable that signal.

cargo-deny and cargo-audit share the same rustsec advisory parser. Both fail at the same point on RUSTSEC-2026-0037 because the embedded rustsec rejects CVSS 4.0 strings. The audit job (on hosted) still covers vulnerability matching; cargo-deny here keeps gating bans, licenses, and sources, which is what it actually adds beyond audit. Drop the workaround once smithy ships an upgraded rustsec parser (tracked alongside the cargo-audit upgrade).

* ci: migrate 16 of 21 ci.yml jobs to smithy self-hosted runners Builds on the spar pilot (pulseengine/spar#201) — same runner-class mapping, same workarounds for the rustsec parser CVSS 4.0 issue, same direct-cargo-deny pattern. Migrated to smithy: rust-cpu clippy, docs-check, test, semver-checks, coverage, proptest, fuzz, msrv lean-mem miri, mutants, verus light fmt, yaml-lint, deny, supply-chain, release-results Stay on ubuntu-latest (each with explanatory comment in-place): - playwright (--with-deps does sudo apt-get; smithy runners no sudo) - vscode-extension (xvfb-run + downloaded VS Code Test setup) - audit (cargo-audit 0.21 rustsec parser rejects CVSS 4.0) - kani (kani-verifier bundles CBMC, ~100 MB install) - rocq (Coq install, not on smithy yet) Two non-trivial fixes inside migrated jobs: - test: actionlint install moved from `sudo mv /tmp/actionlint /usr/local/bin` to `mv /tmp/actionlint $HOME/.local/bin` plus GITHUB_PATH update. Smithy runners have no sudo; same binary, different writable location. - deny: dropped the `cargo deny check` (which would fail loading advisory-db with CVSS 4.0) for `cargo deny check bans licenses sources`. The audit job (still on hosted) covers vulnerability matching meanwhile. Expected improvement: spar's broad migration showed ~470x end-to-end speedup on clippy (~470 min → 1 min) thanks to queue elimination. Rivet should see similar — its recent runs showed 600+ min total. * ci(miri): bump timeout-minutes 15->30 after smithy run hit limit First migration run timed out exactly at 15:00 with tests still progressing (last printed test at ~11:00). Smithy's lean-mem class appears to run the slow tail tests slower than the previous hosted runner did — could be cgroup memory pressure (24G MemoryHigh under Miri's shadow allocations) or just longer tail test perf. Bumping the budget conservatively; revisit once we have a few green runs to dial it back closer to actual. Semver Checks is also failing on this PR — upstream issue ('unsupported rustdoc format v57', the action ships a too-old cargo-semver-checks). NOT a smithy-migration issue; would fail on hosted too. Tracked as a separate followup; doesn't block this PR. * ci: retrigger after smithy TMPDIR fix Smithy main now points TMPDIR / TMP / TEMP at the per-runner /var/lib/runners/runnerN/_tmp on lv_runners (500 G), instead of the host's /tmp on lv_root (80 G). Previous run hit 'no space left on device' when the rivet HTML-export test ran out of root FS budget. Runners restarted; this commit triggers a fresh CI. * ci(semver-checks): replace stale wrapper action with direct cargo install obi1kenobi/cargo-semver-checks-action@v2 bundles an older cargo-semver-checks that doesn't recognise rustdoc JSON v57 (the format current stable rustdoc emits). Every PR run failed with 'unsupported rustdoc format v57 for file: rivet_core.json'. Going direct: install the latest cargo-semver-checks at job time and invoke it. Slightly slower on cold cache but tracks the upstream rustdoc format. Same end-effect as the wrapper. Caught during the rivet broad-CI smithy migration (PR #262); not related to self-hosted vs hosted. * ci: retrigger after smithy disk cleanup + tmpfiles policy lv_runners had filled to 100% from accumulated per-runner _tmp. Smithy main commit b4af61e adds /etc/tmpfiles.d/smithy-runner-tmp.conf to age files >24 h out of those dirs daily. Manual cleanup ran today (466G -> 92G used). Re-triggering CI to confirm Miri / Mutation / Verus jobs land green now that disk is back to 20% used.

avrabe added 6 commits May 3, 2026 07:54

avrabe added 5 commits May 3, 2026 11:32

avrabe enabled auto-merge (squash) May 3, 2026 13:35

Merge branch 'main' into smithy-clippy-pilot

cd06c70

avrabe added 2 commits May 10, 2026 09:16

Merge branch 'main' into smithy-clippy-pilot

002e0dc

Merge branch 'main' into smithy-clippy-pilot

1c7e8a2

avrabe merged commit 2104cc6 into main May 11, 2026
15 of 30 checks passed

avrabe deleted the smithy-clippy-pilot branch May 11, 2026 04:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: pilot-migrate clippy job to smithy self-hosted runners#201

ci: pilot-migrate clippy job to smithy self-hosted runners#201
avrabe merged 14 commits into
mainfrom
smithy-clippy-pilot

avrabe commented May 3, 2026

Uh oh!

codecov Bot commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 3, 2026

Summary

Rationale

Test plan

Rollback

Follow-ups (if green)

Uh oh!

codecov Bot commented May 3, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant