ci: stop the Verus/Mutation/cargo-vet self-hosted runner flakes#281
Merged
Conversation
Three flake classes hitting every PR today, all on self-hosted runners:
1. Verus + Rocq Proofs: "sudo: 'no new privileges' flag is set"
2. Mutation Testing (rivet-cli): "No space left on device"
3. Supply Chain (cargo-vet): "runner has received a shutdown signal"
All three block green CI on PRs that have nothing to do with the
underlying subsystem. Each fix is workflow-level so it lands without
runner-host changes.
**1) Verus + Rocq: switch Nix installer.**
The self-hosted runners run with systemd `NoNewPrivileges=true`, which
breaks `cachix/install-nix-action@v31` because it shells out to sudo
mid-install. Switch to `DeterminateSystems/nix-installer-action@main`
with `init: none` (daemonless single-user mode). Add a follow-up step
to put `~/.nix-profile/bin` on PATH explicitly. Cost: ~30s install,
comparable speed once cached. Keeps the lean-mem requirement intact —
the Verus solver still wants the RAM.
**2) Mutation Testing: prune + restrict upload.**
cargo-mutants writes a per-mutant target directory under `mutants-out/`.
Seventeen shards landing on the same pool, with the Swatinem cache hot,
leave 5-15 GB behind each. The next shard's `Upload mutants report`
step dies with ENOSPC during the upload itself — after cargo-mutants
has finished cleanly. Two changes:
- Add a `Prune stale mutants artefacts` step before the run (rm
mutants-out, find + rm matching target subdirs older than a day).
- Restrict `upload-artifact` path to the text/JSON reports only —
skip the per-mutant target directories that drive the bloat. The
text reports are what matter for triage.
**3) cargo-vet: wrap in retry.**
GitHub Actions doesn't expose "runner restarted under me" to `if:`
conditions, but `nick-fields/retry@v3` with `retry_on: error` handles
it correctly — if the runner agent dies mid-step the step exits
non-zero, the action retries on a different runner. Two attempts is
enough; a third shutdown in ten minutes would point at runner-pool
sizing, not at this job. Same pattern fits other light jobs (`fmt`,
`yaml-lint`, `msrv`, `docs-check`) but those aren't currently flaking,
so leave them alone — speculative retry on stable jobs hides real
bugs.
Cross-cutting note for runner-ops: the underlying fixes for these are
host-level too — drop `NoNewPrivileges` from the runner systemd unit
(fixes #1), lower the `post-job.sh` disk threshold from 70% to ~50%
(helps #2), grow the lean-mem pool size (helps #3). The workflow-level
changes here unblock CI today; the host-level work is the durable fix
and worth a separate Ansible PR.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.
| Benchmark suite | Current: b3b1b39 | Previous: 2754ae1 | Ratio |
|---|---|---|---|
store_insert/10000 |
15289412 ns/iter (± 583865) |
12571152 ns/iter (± 678868) |
1.22 |
query/10000 |
154943 ns/iter (± 610) |
89985 ns/iter (± 362) |
1.72 |
This comment was automatically generated by workflow using github-action-benchmark.
3 tasks
avrabe
added a commit
that referenced
this pull request
May 15, 2026
…284) Three PRs today (#279, #281, and previously #275/#278) failed CI on the same test: `graph_focused_view_renders_svg`. Cause is structural, not a real-flake: the test's `fetch()` helper hardcodes a 5s read timeout, and the focused /graph endpoint takes ~5s on the dogfood corpus (742 nodes, 1477 edges) — BFS frontier + etch layout pass. The test sits exactly on the edge; CI runner load tips it over. Add a `fetch_with_timeout(port, path, htmx, timeout)` variant. Keep the default 5s `fetch()` for everything else. Bump the graph test's deadline to 15s — well past the genuine wall-clock for this endpoint and short enough that a real performance regression still bubbles up. The wall-clock `Instant::elapsed()` log line stays, so an actual slow regression would still be visible in the test output even though the read timeout no longer blocks it. Verified locally: `cargo test -p rivet-cli --test serve_integration graph_focused_view_renders_svg` passes in 2.79s with the new helper. Refs: REQ-007 (CLI surface)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three flake classes have been hitting every PR today on self-hosted runners, blocking CI green on changes that have nothing to do with the underlying subsystem.
sudo: "no new privileges" flag is set1. Verus + Rocq
The self-hosted runners run with systemd `NoNewPrivileges=true`. `cachix/install-nix-action@v31` shells out to `sudo` mid-install, which dies with "no new privileges". Switch to `DeterminateSystems/nix-installer-action@main` with `init: none` (daemonless single-user mode). Add a follow-up step to put `~/.nix-profile/bin` on PATH explicitly. Cost: ~30s install, comparable speed once cached. Keeps the lean-mem requirement intact.
2. Mutation Testing
cargo-mutants writes a per-mutant target directory under `mutants-out/`. Seventeen shards landing on the same pool with the Swatinem cache hot leave 5-15 GB behind each. The next shard's upload step dies with ENOSPC during the upload itself — after cargo-mutants finished cleanly. Two changes:
3. cargo-vet
GitHub Actions doesn't expose "runner restarted under me" to `if:` conditions, but `nick-fields/retry@v3` with `retry_on: error` handles it correctly — if the runner agent dies mid-step the step exits non-zero, the action retries on a different runner. Two attempts is enough; a third shutdown in ten minutes would point at runner-pool sizing, not this job.
Cross-cutting (host-level follow-ups)
Each fix has a durable host-level counterpart that should land in the runner Ansible repo:
The workflow-level changes here unblock CI today; the host-level work is the durable fix.
Test plan
Trailer
Refs: CI infrastructure; no artifact trailers required (CI-only change, exempt per CLAUDE.md).
🤖 Generated with Claude Code