ci: stop the Verus/Mutation/cargo-vet self-hosted runner flakes by avrabe · Pull Request #281 · pulseengine/rivet

avrabe · 2026-05-14T18:42:03Z

Summary

Three flake classes have been hitting every PR today on self-hosted runners, blocking CI green on changes that have nothing to do with the underlying subsystem.

Flake	Job	Symptom	Fix
1	Verus Proofs + Rocq Proofs	`sudo: "no new privileges" flag is set`	Switch from `cachix/install-nix-action` to `DeterminateSystems/nix-installer-action` (daemonless, no sudo)
2	Mutation Testing (rivet-cli)	`No space left on device` during artifact upload	Prune `mutants-out/` + restrict upload path to text/JSON reports only
3	Supply Chain (cargo-vet)	`The runner has received a shutdown signal`	Wrap in `nick-fields/retry@v3` (2 attempts, retry on error)

1. Verus + Rocq

The self-hosted runners run with systemd `NoNewPrivileges=true`. `cachix/install-nix-action@v31` shells out to `sudo` mid-install, which dies with "no new privileges". Switch to `DeterminateSystems/nix-installer-action@main` with `init: none` (daemonless single-user mode). Add a follow-up step to put `~/.nix-profile/bin` on PATH explicitly. Cost: ~30s install, comparable speed once cached. Keeps the lean-mem requirement intact.

2. Mutation Testing

cargo-mutants writes a per-mutant target directory under `mutants-out/`. Seventeen shards landing on the same pool with the Swatinem cache hot leave 5-15 GB behind each. The next shard's upload step dies with ENOSPC during the upload itself — after cargo-mutants finished cleanly. Two changes:

Add a `Prune stale mutants artefacts` step before the run.
Restrict `upload-artifact` path to text/JSON reports only — skip the per-mutant target dirs that drive the bloat. The text reports are what matter for triage.

3. cargo-vet

GitHub Actions doesn't expose "runner restarted under me" to `if:` conditions, but `nick-fields/retry@v3` with `retry_on: error` handles it correctly — if the runner agent dies mid-step the step exits non-zero, the action retries on a different runner. Two attempts is enough; a third shutdown in ten minutes would point at runner-pool sizing, not this job.

Cross-cutting (host-level follow-ups)

Each fix has a durable host-level counterpart that should land in the runner Ansible repo:

Flake	Host fix
Verus / Rocq	Drop `NoNewPrivileges=` from the runner systemd unit
Mutation Testing	Lower `post-job.sh` disk threshold from 70% to ~50%
cargo-vet	Grow the lean-mem / light pool size

The workflow-level changes here unblock CI today; the host-level work is the durable fix.

Test plan

`python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml').read())"` parses OK
`git diff --stat` confirms only the targeted three jobs changed
CI run on this PR — Verus, Rocq, Mutation Testing, and Supply Chain should all complete cleanly (Verus may still report SMT proof failure under `continue-on-error: true`, that's fine; the flake is the installer, not the proof)

Trailer

Refs: CI infrastructure; no artifact trailers required (CI-only change, exempt per CLAUDE.md).

🤖 Generated with Claude Code

Three flake classes hitting every PR today, all on self-hosted runners: 1. Verus + Rocq Proofs: "sudo: 'no new privileges' flag is set" 2. Mutation Testing (rivet-cli): "No space left on device" 3. Supply Chain (cargo-vet): "runner has received a shutdown signal" All three block green CI on PRs that have nothing to do with the underlying subsystem. Each fix is workflow-level so it lands without runner-host changes. **1) Verus + Rocq: switch Nix installer.** The self-hosted runners run with systemd `NoNewPrivileges=true`, which breaks `cachix/install-nix-action@v31` because it shells out to sudo mid-install. Switch to `DeterminateSystems/nix-installer-action@main` with `init: none` (daemonless single-user mode). Add a follow-up step to put `~/.nix-profile/bin` on PATH explicitly. Cost: ~30s install, comparable speed once cached. Keeps the lean-mem requirement intact — the Verus solver still wants the RAM. **2) Mutation Testing: prune + restrict upload.** cargo-mutants writes a per-mutant target directory under `mutants-out/`. Seventeen shards landing on the same pool, with the Swatinem cache hot, leave 5-15 GB behind each. The next shard's `Upload mutants report` step dies with ENOSPC during the upload itself — after cargo-mutants has finished cleanly. Two changes: - Add a `Prune stale mutants artefacts` step before the run (rm mutants-out, find + rm matching target subdirs older than a day). - Restrict `upload-artifact` path to the text/JSON reports only — skip the per-mutant target directories that drive the bloat. The text reports are what matter for triage. **3) cargo-vet: wrap in retry.** GitHub Actions doesn't expose "runner restarted under me" to `if:` conditions, but `nick-fields/retry@v3` with `retry_on: error` handles it correctly — if the runner agent dies mid-step the step exits non-zero, the action retries on a different runner. Two attempts is enough; a third shutdown in ten minutes would point at runner-pool sizing, not at this job. Same pattern fits other light jobs (`fmt`, `yaml-lint`, `msrv`, `docs-check`) but those aren't currently flaking, so leave them alone — speculative retry on stable jobs hides real bugs. Cross-cutting note for runner-ops: the underlying fixes for these are host-level too — drop `NoNewPrivileges` from the runner systemd unit (fixes #1), lower the `post-job.sh` disk threshold from 70% to ~50% (helps #2), grow the lean-mem pool size (helps #3). The workflow-level changes here unblock CI today; the host-level work is the durable fix and worth a separate Ansible PR.

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite	Current: `b3b1b39`	Previous: `2754ae1`	Ratio
`store_insert/10000`	`15289412` ns/iter (`± 583865`)	`12571152` ns/iter (`± 678868`)	`1.22`
`query/10000`	`154943` ns/iter (`± 610`)	`89985` ns/iter (`± 362`)	`1.72`

This comment was automatically generated by workflow using github-action-benchmark.

…284) Three PRs today (#279, #281, and previously #275/#278) failed CI on the same test: `graph_focused_view_renders_svg`. Cause is structural, not a real-flake: the test's `fetch()` helper hardcodes a 5s read timeout, and the focused /graph endpoint takes ~5s on the dogfood corpus (742 nodes, 1477 edges) — BFS frontier + etch layout pass. The test sits exactly on the edge; CI runner load tips it over. Add a `fetch_with_timeout(port, path, htmx, timeout)` variant. Keep the default 5s `fetch()` for everything else. Bump the graph test's deadline to 15s — well past the genuine wall-clock for this endpoint and short enough that a real performance regression still bubbles up. The wall-clock `Instant::elapsed()` log line stays, so an actual slow regression would still be visible in the test output even though the read timeout no longer blocks it. Verified locally: `cargo test -p rivet-cli --test serve_integration graph_focused_view_renders_svg` passes in 2.79s with the new helper. Refs: REQ-007 (CLI surface)

github-actions Bot reviewed May 15, 2026

View reviewed changes

avrabe mentioned this pull request May 15, 2026

test(serve_integration): graph_focused_view 5s timeout chronic flake #284

Merged

3 tasks

avrabe merged commit fd4cf19 into main May 15, 2026
18 of 21 checks passed

avrabe deleted the ci/runner-flake-fixes branch May 15, 2026 06:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: stop the Verus/Mutation/cargo-vet self-hosted runner flakes#281

ci: stop the Verus/Mutation/cargo-vet self-hosted runner flakes#281
avrabe merged 1 commit into
mainfrom
ci/runner-flake-fixes

avrabe commented May 14, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 14, 2026

Summary

1. Verus + Rocq

2. Mutation Testing

3. cargo-vet

Cross-cutting (host-level follow-ups)

Test plan

Trailer

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant