Skip to content

ci: stop the Verus/Mutation/cargo-vet self-hosted runner flakes#281

Merged
avrabe merged 1 commit into
mainfrom
ci/runner-flake-fixes
May 15, 2026
Merged

ci: stop the Verus/Mutation/cargo-vet self-hosted runner flakes#281
avrabe merged 1 commit into
mainfrom
ci/runner-flake-fixes

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 14, 2026

Summary

Three flake classes have been hitting every PR today on self-hosted runners, blocking CI green on changes that have nothing to do with the underlying subsystem.

Flake Job Symptom Fix
1 Verus Proofs + Rocq Proofs sudo: "no new privileges" flag is set Switch from `cachix/install-nix-action` to `DeterminateSystems/nix-installer-action` (daemonless, no sudo)
2 Mutation Testing (rivet-cli) `No space left on device` during artifact upload Prune `mutants-out/` + restrict upload path to text/JSON reports only
3 Supply Chain (cargo-vet) `The runner has received a shutdown signal` Wrap in `nick-fields/retry@v3` (2 attempts, retry on error)

1. Verus + Rocq

The self-hosted runners run with systemd `NoNewPrivileges=true`. `cachix/install-nix-action@v31` shells out to `sudo` mid-install, which dies with "no new privileges". Switch to `DeterminateSystems/nix-installer-action@main` with `init: none` (daemonless single-user mode). Add a follow-up step to put `~/.nix-profile/bin` on PATH explicitly. Cost: ~30s install, comparable speed once cached. Keeps the lean-mem requirement intact.

2. Mutation Testing

cargo-mutants writes a per-mutant target directory under `mutants-out/`. Seventeen shards landing on the same pool with the Swatinem cache hot leave 5-15 GB behind each. The next shard's upload step dies with ENOSPC during the upload itself — after cargo-mutants finished cleanly. Two changes:

  • Add a `Prune stale mutants artefacts` step before the run.
  • Restrict `upload-artifact` path to text/JSON reports only — skip the per-mutant target dirs that drive the bloat. The text reports are what matter for triage.

3. cargo-vet

GitHub Actions doesn't expose "runner restarted under me" to `if:` conditions, but `nick-fields/retry@v3` with `retry_on: error` handles it correctly — if the runner agent dies mid-step the step exits non-zero, the action retries on a different runner. Two attempts is enough; a third shutdown in ten minutes would point at runner-pool sizing, not this job.

Cross-cutting (host-level follow-ups)

Each fix has a durable host-level counterpart that should land in the runner Ansible repo:

Flake Host fix
Verus / Rocq Drop `NoNewPrivileges=` from the runner systemd unit
Mutation Testing Lower `post-job.sh` disk threshold from 70% to ~50%
cargo-vet Grow the lean-mem / light pool size

The workflow-level changes here unblock CI today; the host-level work is the durable fix.

Test plan

  • `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml').read())"` parses OK
  • `git diff --stat` confirms only the targeted three jobs changed
  • CI run on this PR — Verus, Rocq, Mutation Testing, and Supply Chain should all complete cleanly (Verus may still report SMT proof failure under `continue-on-error: true`, that's fine; the flake is the installer, not the proof)

Trailer

Refs: CI infrastructure; no artifact trailers required (CI-only change, exempt per CLAUDE.md).

🤖 Generated with Claude Code

Three flake classes hitting every PR today, all on self-hosted runners:

1. Verus + Rocq Proofs: "sudo: 'no new privileges' flag is set"
2. Mutation Testing (rivet-cli): "No space left on device"
3. Supply Chain (cargo-vet): "runner has received a shutdown signal"

All three block green CI on PRs that have nothing to do with the
underlying subsystem. Each fix is workflow-level so it lands without
runner-host changes.

**1) Verus + Rocq: switch Nix installer.**
The self-hosted runners run with systemd `NoNewPrivileges=true`, which
breaks `cachix/install-nix-action@v31` because it shells out to sudo
mid-install. Switch to `DeterminateSystems/nix-installer-action@main`
with `init: none` (daemonless single-user mode). Add a follow-up step
to put `~/.nix-profile/bin` on PATH explicitly. Cost: ~30s install,
comparable speed once cached. Keeps the lean-mem requirement intact —
the Verus solver still wants the RAM.

**2) Mutation Testing: prune + restrict upload.**
cargo-mutants writes a per-mutant target directory under `mutants-out/`.
Seventeen shards landing on the same pool, with the Swatinem cache hot,
leave 5-15 GB behind each. The next shard's `Upload mutants report`
step dies with ENOSPC during the upload itself — after cargo-mutants
has finished cleanly. Two changes:
  - Add a `Prune stale mutants artefacts` step before the run (rm
    mutants-out, find + rm matching target subdirs older than a day).
  - Restrict `upload-artifact` path to the text/JSON reports only —
    skip the per-mutant target directories that drive the bloat. The
    text reports are what matter for triage.

**3) cargo-vet: wrap in retry.**
GitHub Actions doesn't expose "runner restarted under me" to `if:`
conditions, but `nick-fields/retry@v3` with `retry_on: error` handles
it correctly — if the runner agent dies mid-step the step exits
non-zero, the action retries on a different runner. Two attempts is
enough; a third shutdown in ten minutes would point at runner-pool
sizing, not at this job. Same pattern fits other light jobs (`fmt`,
`yaml-lint`, `msrv`, `docs-check`) but those aren't currently flaking,
so leave them alone — speculative retry on stable jobs hides real
bugs.

Cross-cutting note for runner-ops: the underlying fixes for these are
host-level too — drop `NoNewPrivileges` from the runner systemd unit
(fixes #1), lower the `post-job.sh` disk threshold from 70% to ~50%
(helps #2), grow the lean-mem pool size (helps #3). The workflow-level
changes here unblock CI today; the host-level work is the durable fix
and worth a separate Ansible PR.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: b3b1b39 Previous: 2754ae1 Ratio
store_insert/10000 15289412 ns/iter (± 583865) 12571152 ns/iter (± 678868) 1.22
query/10000 154943 ns/iter (± 610) 89985 ns/iter (± 362) 1.72

This comment was automatically generated by workflow using github-action-benchmark.

avrabe added a commit that referenced this pull request May 15, 2026
…284)

Three PRs today (#279, #281, and previously #275/#278) failed CI on
the same test: `graph_focused_view_renders_svg`. Cause is structural,
not a real-flake: the test's `fetch()` helper hardcodes a 5s read
timeout, and the focused /graph endpoint takes ~5s on the dogfood
corpus (742 nodes, 1477 edges) — BFS frontier + etch layout pass.
The test sits exactly on the edge; CI runner load tips it over.

Add a `fetch_with_timeout(port, path, htmx, timeout)` variant. Keep
the default 5s `fetch()` for everything else. Bump the graph test's
deadline to 15s — well past the genuine wall-clock for this endpoint
and short enough that a real performance regression still bubbles up.

The wall-clock `Instant::elapsed()` log line stays, so an actual slow
regression would still be visible in the test output even though the
read timeout no longer blocks it.

Verified locally: `cargo test -p rivet-cli --test serve_integration
graph_focused_view_renders_svg` passes in 2.79s with the new helper.

Refs: REQ-007 (CLI surface)
@avrabe avrabe merged commit fd4cf19 into main May 15, 2026
18 of 21 checks passed
@avrabe avrabe deleted the ci/runner-flake-fixes branch May 15, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant