Skip to content

Inject spawn seam in RunWorkloadDetached to stop orphan test processes#5346

Merged
tgrunnagle merged 2 commits into
mainfrom
issue_5344
May 20, 2026
Merged

Inject spawn seam in RunWorkloadDetached to stop orphan test processes#5346
tgrunnagle merged 2 commits into
mainfrom
issue_5344

Conversation

@tgrunnagle
Copy link
Copy Markdown
Contributor

Summary

pkg/workloads unit tests were spawning orphan workloads.test child processes that survived go test and accumulated on developer machines (one observed at 81% CPU after 11 hours). The cause was DefaultManager.RunWorkloadDetached calling os.Executable() + exec.Command(...).Start() directly with Setsid: true, which under go test re-execs the test binary itself; Go's testing.Main ignores the unknown positional args and reruns the whole suite, recursively spawning more detached children.

This PR introduces a detachedProcessSpawner seam on DefaultManager so tests can inject a no-op spawner instead of re-execing the test binary. Production behavior is unchanged — the default spawner still re-execs via os.Executable() with setsid. The triggering test (TestDefaultManager_updateSingleWorkload "successful stop and delete operations complete correctly") now uses the injected spawner, and two new tests cover the seam's success and error paths.

Fixes #5344

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Manual verification:

  1. Ran task test against the pkg/workloads package on main and confirmed orphan workloads.test start test-workload --foreground processes with PPID 1 via ps -eo pid,ppid,etime,args | grep workloads.test.
  2. Repeated on this branch — no orphan processes remain after the suite finishes.
  3. Added TestDefaultManager_RunWorkloadDetached_SpawnerSuccess and TestDefaultManager_RunWorkloadDetached_SpawnerError to lock in the seam contract (PID forwarding, error wrapping, no PID written on spawn failure).

Changes

File Change
pkg/workloads/manager.go Add detachedProcessSpawner type, withDetachedSpawner option, and detachedSpawnerOrDefault accessor on DefaultManager. Split RunWorkloadDetached so the re-exec logic lives in spawnDetached (the production spawner) and RunWorkloadDetached only validates inputs, calls the spawner, and records the returned PID. Move the WorkloadStatusStarting commit point into the spawner so it sits immediately before the actual exec, and have the spawner own the WorkloadStatusError rollback on Start() failure.
pkg/workloads/manager_test.go Inject a no-op spawner in TestDefaultManager_updateSingleWorkload so the success-path subtest no longer re-execs the test binary. Tighten its SetWorkloadPID expectation from gomock.Any() to a fixed fake PID. Add TestDefaultManager_RunWorkloadDetached_SpawnerSuccess and TestDefaultManager_RunWorkloadDetached_SpawnerError to cover the seam directly.

Does this introduce a user-facing change?

No.

Implementation plan

Approved implementation plan

Use option 1 from the issue: introduce a spawn seam.

  • Define detachedProcessSpawner func(ctx, *runner.RunConfig) (pid int, err error) and store it on DefaultManager alongside the existing mcpRunnerFactory seam.
  • Add a withDetachedSpawner managerOption and a detachedSpawnerOrDefault accessor mirroring the newRunner / retryConfig pattern already used in this file.
  • Keep the production implementation (spawnDetached) bit-for-bit equivalent to the current behavior, including Setsid, log file redirection, secrets password env var, and the --debug / --foreground args.
  • Move the WorkloadStatusStarting set call into spawnDetached so it remains the commit point right before detachedCmd.Start(). Keep WorkloadStatusError rollback there too. RunWorkloadDetached itself should only handle validation, spawner invocation, and PID recording — pre-spawn failures must not mutate workload status.
  • In TestDefaultManager_updateSingleWorkload, inject a no-op spawner that still emits the Starting status transition so mock expectations match production.
  • Add focused tests for the seam: one happy path verifying PID forwarding and one failure path verifying error wrapping and absence of SetWorkloadPID.

Special notes for reviewers

  • spawnDetached deliberately does not propagate ctx to exec.Command — the spawned process must outlive the parent (that is the point of "detached"); exec.CommandContext would kill the child on parent cancellation. ctx is still threaded through for status manager calls. This is called out in a comment on spawnDetached.
  • The status state machine is preserved: Starting is set immediately before detachedCmd.Start(), Error is set if Start() fails, and failures during pre-exec setup (e.g. os.Executable() or log file creation) return without touching workload status — matching the previous behavior.
  • Other RunWorkloadDetached call sites in tests (pkg/api/v1/..., pkg/mcp/server/...) use mocks.MockManager.RunWorkloadDetached, which never spawns, so they are unaffected. The validation-failure-only test TestDefaultManager_RunWorkloadDetached returns before reaching the spawner and also remains unchanged.

DefaultManager.RunWorkloadDetached re-execs os.Executable() with
`start <basename> --foreground`. Under `go test`, os.Executable()
resolves to the test binary, so the unit test for updateSingleWorkload
spawned `workloads.test start test-workload --foreground` as a
detached child via Setsid. Go's testing.Main ignores the positional
args and re-ran the entire test suite, which called
RunWorkloadDetached again — recursive spawn. Each child was orphaned
to launchd/init and survived after the test binary exited, eventually
consuming high CPU on developer machines.

Extract the spawn step into a function-typed field on DefaultManager
(detachedProcessSpawner) with a withDetachedSpawner option, following
the same pattern as withRetryConfig and withRunnerFactory. Production
uses spawnDetached (unchanged behavior); tests inject a no-op spawner
that returns a fake PID.

Closes #5344
- MEDIUM: Preserve original WorkloadStatus ordering. Move SetWorkloadStatus(Starting) and the Error rollback into spawnDetached, so failures during pre-exec setup (os.Executable, xdg.DataFile, GetSecretsPassword) return without any status change — matching the pre-refactor behavior.
- MEDIUM: Add TestDefaultManager_RunWorkloadDetached_SpawnerError to cover the new spawner-error path.
- MEDIUM: Add TestDefaultManager_RunWorkloadDetached_SpawnerSuccess to verify the spawner is invoked and its PID is forwarded.
- LOW: Document that ctx is intentionally not propagated to exec.Command, so the child can outlive the parent.
- LOW: Expand detachedProcessSpawner doc to call out the platform-specific detachment that makes orphan children possible.
@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label May 20, 2026
@tgrunnagle tgrunnagle marked this pull request as ready for review May 20, 2026 16:07
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 58.33333% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.40%. Comparing base (8d22ac5) to head (d70fbd9).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloads/manager.go 58.33% 8 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5346      +/-   ##
==========================================
- Coverage   68.41%   68.40%   -0.01%     
==========================================
  Files         621      624       +3     
  Lines       63278    63442     +164     
==========================================
+ Hits        43293    43399     +106     
- Misses      16757    16807      +50     
- Partials     3228     3236       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tgrunnagle tgrunnagle merged commit 2894982 into main May 20, 2026
45 checks passed
@tgrunnagle tgrunnagle deleted the issue_5344 branch May 20, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unit tests spawn orphan workloads.test processes via RunWorkloadDetached

2 participants