Skip to content

S41: recover stale runner job claims#306

Merged
mfwolffe merged 1 commit into
trunkfrom
s41/actions-stale-run-recovery
May 17, 2026
Merged

S41: recover stale runner job claims#306
mfwolffe merged 1 commit into
trunkfrom
s41/actions-stale-run-recovery

Conversation

@espadonne
Copy link
Copy Markdown
Contributor

Summary

  • add runner heartbeat active_job_ids reconciliation so abandoned running jobs assigned to an idle runner are cancelled before capacity is counted
  • make shithubd-runner send an explicit empty active job set on idle heartbeats
  • update runner API docs and incident/deploy runbooks with the production queue-wedge recovery path

Production finding

On 2026-05-17, production had one stale running job assigned to shithub-runner-shared-linux-1 while the runner was heartbeating idle. That consumed the runner's only capacity slot, backed up queued jobs, and eventually suppressed fresh repo-tree status dots when repo queued-run caps were reached. I manually cancelled run/job 67, verified the queue drained, and confirmed production is now idle with zero running workflow jobs.

Deploy note

Deploy web first because the heartbeat decoder rejects unknown JSON fields. Then redeploy shithubd-runner from the same trunk build so it starts sending active_job_ids.

Tests

  • GOCACHE=/private/tmp/shithub-actions-stale-run-hotfix/.gocache SHITHUB_TEST_DATABASE_URL='postgres://shithub:shithub_dev@127.0.0.1:5432/postgres?sslmode=disable' go test ./internal/actions/lifecycle ./internal/web/handlers/api ./internal/runner/api ./internal/runner
  • GOCACHE=/private/tmp/shithub-actions-stale-run-hotfix/.gocache make build
  • git diff --check

@mfwolffe mfwolffe merged commit 32ce2c9 into trunk May 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants