Skip to content

Terminate factory agent processes on release#263

Merged
kjgbot merged 5 commits into
mainfrom
factory-sdk/v2fix11
Jun 12, 2026
Merged

Terminate factory agent processes on release#263
kjgbot merged 5 commits into
mainfrom
factory-sdk/v2fix11

Conversation

@kjgbot

@kjgbot kjgbot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Reuse the factory reaper termination path for factory-owned agent release.
  • Resolve the live broker roster PID by agent name before broker release can de-register the agent.
  • Treat listAgents() liveness as authoritative: present+pid means terminate/persist, present-without-pid is a loud unresolved error, absent means already exited and not an orphan.
  • Persist resolved live agent PIDs/process identities into the durable D9 in-flight registry so the external crash reaper has a positive-ID target.
  • Let factory reap-orphans resolve name-only registry entries by agent name via the fleet before terminating, closing the spawn-registration lag crash window.
  • Add an anchored, fail-closed ps/pgrep launcher-root lookup (--agent-name <name>) in both graceful termination and D9 reaping; broker worker PIDs are supplementary liveness data, not tree roots.
  • Walk the process tree downward at termination time and signal descendants before parents, while protected broker/factory/ancestor PIDs are structurally refused.
  • Apply release-plus-terminate behavior on both stop() rollback and issue completion.

Verification

  • npx vitest run packages/factory-sdk -> 266 passed
  • npx tsc --noEmit -p tsconfig.node.json -> exit 0

V0 coverage

  • Internal fleet: exposes a bounded retrying resolveAgentPid(name) backed by broker listAgents().
  • Internal fleet: PID resolution distinguishes found, missing (absent/exited), and unresolved (present but no pid).
  • Internal fleet: roster PID fallback retries when /api/spawned registration lags.
  • Internal fleet: broker PID is surfaced as protected process state.
  • Process lookup: anchored --agent-name <name> matching rejects loose substrings, excludes protected PIDs, fails closed on zero/disjoint multiple matches, and allows multiple matches only when they form one coherent agent tree.
  • Factory registry: live-shaped pid-less spawn ack plus roster PID resolution persists PID identity signatures; no injected SpawnResult.pids[].
  • Factory registry: live-shaped pid-less spawn ack with registration still pending persists agent names with empty PID/process arrays instead of dropping the entry.
  • Factory reaper: stale name-only registry entry resolves by agent name through the fleet and reaps the resolved PID.
  • Factory reaper: anchored ps lookup selects the topmost matched launcher root before broker worker PID fallback; launcher-root plus worker child are both reaped.
  • Factory reaper: broker unresolved plus ps-found agent shell reaps; broker unresolved plus ps-missing skips; foreign ps identity is rejected.
  • Factory reaper: name-only resolved PID with mismatched identity is skipped and never signaled, pinning the PID-reuse guard on the resolve-by-name path.
  • Factory reaper: broker/protected PID resolution is passed through to terminatePids; a hermetic child-tree test proves a name-only entry resolving to a protected broker PID is skipped instead of signaled or child-walked.
  • Factory stop: anchored ps lookup selects the launcher root even when the broker resolves a worker child; launcher-root plus worker child are both terminated.
  • Factory stop: broker unresolved plus ps-found agent shell terminates with no agentTerminateMissingPid; broker unresolved plus ps-missing is process-less and does not over-count.
  • Factory stop: pid-less spawn ack plus roster PID fallback resolves before release de-registers the agent, then root plus discovered child are terminated.
  • Factory stop: already-exited/absent agents do not increment agentTerminateMissingPid.
  • Missing live terminate roots are logged/counted loudly instead of silently certifying a no-op.
  • Completion path: issue-done releases and terminates tracked pair process trees, including child PIDs discovered at kill time.
  • Reaper helper: recursive downward tree termination kills children before parent roots, never signals the broker/ancestor parent, and continues when an earlier PID is already gone.

Notes

  • Spawn results still may be pid-less live; the registry and graceful termination paths resolve via broker roster liveness instead of trusting SpawnResult.
  • V0 proves the factory-sdk behavior and live-shaped PID fallback. fv2's V1 remains the authoritative proof for real broker roster pid population, graceful zero-survivors, crash registry reaping, and broker-alive safety.
  • node_modules in the worktree is an untracked symlink only.

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

PID-aware termination: ports add process-control hooks and AgentPidResolution; InternalFleetClient resolves broker/connection/roster PIDs; new findAgentProcessByName resolves anchored agent processes; terminatePids() kills PID trees child-first with protected-PID handling; factory integrates termination into release/completion flows and tests exercise many PID-edge cases.

Changes

Agent Process Termination via PID

Layer / File(s) Summary
Port interface contracts and CLI wiring
packages/factory-sdk/src/ports/fleet.ts, packages/factory-sdk/src/ports/index.ts, packages/factory-sdk/src/types.ts, packages/factory-sdk/src/cli/fleet.ts
Adds AgentPidResolution, optional FleetClient.resolveAgentPid() and FleetClient.protectedPids(); FactoryPorts gains processFinder, kill, readChildPids, and terminationGraceMs; CLI forwards fleet into factory commands and reap-orphans.
Fleet client PID modeling and resolution
packages/factory-sdk/src/fleet/internal-fleet-client.ts, packages/factory-sdk/src/fleet/internal-fleet-client.test.ts
InternalFleetClient reads connection PID, exposes protectedPids() and resolveAgentPid(name) with retry/backoff, stores optional #connectionPath, allows spawn-result PID override; test harness updated to model brokerPid and per-agent pids and adds PID-resolution tests.
Generic PID termination helper and reaper refactor
packages/factory-sdk/src/orchestrator/reaper.ts
Adds exported terminatePids() (post-order child-first termination, protected set, liveness checks, SIGTERM→grace→SIGKILL) and rewires reapFactoryOrphansOnce to use it; adds PID-inspection hooks.
Process identity resolver and tests
packages/factory-sdk/src/orchestrator/process-identity.ts, packages/factory-sdk/src/orchestrator/process-identity.test.ts
Adds findAgentProcessByName with anchored --agent-name matching and ancestry coherence checks; tests cover exact/ambiguous/missing/protected cases.
Factory integration and tests
packages/factory-sdk/src/orchestrator/factory.ts, packages/factory-sdk/src/orchestrator/factory.test.ts
FactoryLoop injects process-control hooks and processFinder; stop/completion flows compute termination roots, call terminatePids(), filter protected PIDs, then call fleet.release; extensive tests and fakes validate termination ordering, roster fallbacks, discovery before release, missing-root reporting, and completion termination behavior.
Reaper tests
packages/factory-sdk/src/orchestrator/reaper.test.ts
Adds/updates tests for terminatePids semantics (depth-first ordering, skipped/dead PIDs) and many reapFactoryOrphansOnce scenarios covering selection, identity mismatches, protected PIDs, and fallback logic.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

🐰 I watched the processes sleep and sigh,
I chased their PIDs across the sky,
With gentle SIGTERM and one last shove,
I pruned the trees with careful love 🌿
Now the factory hums — all processes lie.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: terminating factory agent processes during the release lifecycle.
Description check ✅ Passed The description is detailed and directly related to the changeset, covering the objectives, verification, coverage, and implementation notes for the PID termination feature.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch factory-sdk/v2fix11

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed due to a network error.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@agent-relay-code

Copy link
Copy Markdown
Contributor

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

No code changes were needed. I reviewed the PR diff and traced the factory cleanup/reaper changes through callers, types, registry writing, and tests. I did not find a validated PR-scoped defect to fix.

Addressed comments

  • No bot or reviewer comments were present in the supplied .workforce metadata/files, so there were no external review threads to address.

Validation

Passed locally:

  • npm ci
  • npx vitest run packages/factory-sdk/src/orchestrator/reaper.test.ts packages/factory-sdk/src/orchestrator/factory.test.ts
  • npm run typecheck:node
  • npm run lint (warnings only, no errors)
  • npm run typecheck:web
  • npm test
  • npm run verify:mcp-resources-drift
  • npx vitest run
  • npm run build
  • npm run build:web
  • npx playwright test --config playwright.fidelity.config.ts
  • npx playwright test --config playwright.redraw.config.ts

I could not reproduce the macOS-only npm run dist:mac / packaged MCP smoke job on this Linux runner, and I cannot verify GitHub mergeability or remote CI state from the checkout.

@kjgbot kjgbot force-pushed the factory-sdk/v2fix11 branch 2 times, most recently from d116fbc to e4d9fbf Compare June 12, 2026 19:14
@kjgbot kjgbot force-pushed the factory-sdk/v2fix11 branch from e4d9fbf to 5c62ecd Compare June 12, 2026 19:23
@agent-relay-code

Copy link
Copy Markdown
Contributor

⚠️ pr-reviewer push failed (exit 1) — fixes were not applied to the PR. The notes below are advisory and were not pushed.

Implemented one PR-scope fix: factory agent process trees are now terminated before broker sessions are released, so child PID discovery cannot be lost if fleet.release() tears down the root process first. The helper also preserves deterministic release ordering. See factory.ts.

Added a regression test that simulates release() making children undiscoverable and verifies the child PID is still killed. See factory.test.ts.

Addressed comments

  • No bot/reviewer comment artifacts were present in .workforce; there were no external review comments available to validate or address.

Advisory Notes

  • None.

Validation run locally:

  • npx vitest run packages/factory-sdk/src/orchestrator/reaper.test.ts packages/factory-sdk/src/orchestrator/factory.test.ts packages/factory-sdk/src/fleet/internal-fleet-client.test.ts passed.
  • npm test passed.
  • npx vitest run passed: 51 files, 666 tests.
  • npm run verify:mcp-resources-drift passed.
  • npm run lint passed with existing warnings only.
  • npm run typecheck:web passed.
  • npm run typecheck:node passed.
  • npm run build passed.
  • npx playwright test --config playwright.fidelity.config.ts passed when run sequentially.
  • npx playwright test --config playwright.redraw.config.ts passed when run sequentially.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Fixed a PR-scoped type break in internal-fleet-client.ts: HarnessDriverClientLike.listAgents() now returns the local DriverAgentLike[] shape, keeping pid optional. The prior Pick<ListAgent, 'name' | 'pid'> made pid required and broke the fake harness clients even though the runtime fallback explicitly handles missing PIDs. Also removed the now-unused ListAgent import.

Addressed comments

  • gemini-code-assist[bot]: quota-limit notice only; no code finding to address.
  • coderabbitai[bot]: review-limit notice only; no code finding to address.
  • agent-relay-code[bot]: prior review-only note reported no findings; current checkout had a validated TypeScript mismatch, fixed in packages/factory-sdk/src/fleet/internal-fleet-client.ts:10 and :16.

Validation

Passed:

  • npm ci
  • npx vitest run packages/factory-sdk/src/orchestrator/reaper.test.ts packages/factory-sdk/src/orchestrator/factory.test.ts packages/factory-sdk/src/fleet/internal-fleet-client.test.ts
  • npm run typecheck:node
  • npm run verify:mcp-resources-drift
  • npm run typecheck:web
  • npm test
  • npx vitest run
  • npm run build
  • npm run build:web
  • npx playwright test --config playwright.redraw.config.ts on sequential retry

Not fully verified in this sandbox:

  • npm run lint was killed by the environment under both Node 25 and Node 22 before reporting lint results.
  • npx playwright test --config playwright.fidelity.config.ts hung/was not collectable after reporting one passing test; I stopped the local Playwright process to avoid leaving it running.
  • macOS packaging smoke was not runnable on this Linux runner.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Fixed one current PR breakage: HarnessDriverClientLike.listAgents() now allows roster entries with an omitted pid, matching the PR’s own lagging-registration fallback behavior and tests. Fixed in internal-fleet-client.ts.

Addressed comments

  • No bot or reviewer comments were present in .workforce/context.json or other .workforce files, so there were no external review threads to validate or fix.

Advisory Notes

  • First npx vitest run hit an unrelated unhandled async renderer error from src/renderer/src/components/chat/ChatView.dom.test.ts after all tests passed. A full rerun passed cleanly, and this file is outside the PR scope, so I left it unchanged.
  • I could not verify GitHub-reported mergeability or hosted CI status from this checkout without using GitHub tooling.

Validation

Passed locally:

  • npm ci
  • focused factory SDK Vitest files: 105 tests passed
  • npm run verify:mcp-resources-drift
  • npm run lint with warnings only
  • npm run typecheck
  • npm test
  • npx vitest run on rerun: 668 tests passed
  • npm run build
  • npm run build:web
  • npx playwright test --config playwright.fidelity.config.ts
  • npx playwright test --config playwright.redraw.config.ts

I did not run the macOS dist:mac packaged smoke job because this environment is Linux.

@agent-relay-code

Copy link
Copy Markdown
Contributor

No code changes were needed in the current checkout. I validated the prior bot notes against the current code: the DriverAgentLike[] type fix and terminate-before-release cleanup behavior are already present.

Addressed comments

  • gemini-code-assist[bot]: quota-limit notice only; no code finding to address.
  • coderabbitai[bot]: review-limit notice only; no code finding to address.
  • agent-relay-code[bot] comment 4694415206: review-only note, no finding to fix.
  • agent-relay-code[bot] comment 4694569645: stale; the terminate-before-release behavior is already present in packages/factory-sdk/src/orchestrator/factory.ts:755.
  • agent-relay-code[bot] comment 4694584933: stale; the DriverAgentLike[] fix is already present in packages/factory-sdk/src/fleet/internal-fleet-client.ts:12 and packages/factory-sdk/src/fleet/internal-fleet-client.ts:19.

Advisory Notes

  • None.

Validation run locally:

  • npx vitest run packages/factory-sdk/src/orchestrator/reaper.test.ts packages/factory-sdk/src/orchestrator/factory.test.ts packages/factory-sdk/src/fleet/internal-fleet-client.test.ts
  • npm run verify:mcp-resources-drift
  • npm run lint passed with existing warnings only
  • npm run typecheck:web
  • npm run typecheck:node
  • npm test
  • npx vitest run
  • npm run build
  • npm run build:web
  • npx playwright test --config playwright.fidelity.config.ts --workers=1
  • npx playwright test --config playwright.redraw.config.ts --workers=1

npm ci was attempted, but this sandbox killed it before a normal exit; after rerunning the install path and installing Playwright Chromium, the repo gates above completed. GitHub API reported the latest PR head mergeable and required checks completed successfully.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Fixed one PR-scoped issue: factory agent PID trees are now terminated before releasing the broker session, so child PID discovery is not lost when release tears down broker-side state. The helper now processes agents sequentially to preserve release ordering. Fixed in factory.ts.

Added a regression test that makes child PID lookup return empty after release and verifies children are still terminated. Added in factory.test.ts.

Addressed comments

  • gemini-code-assist[bot] comment 4694352679: quota-limit notice only; no code finding to address.
  • coderabbitai[bot] comment 4694353280: review-limit notice only; no code finding to address.
  • agent-relay-code[bot] comment 4694415206: review-only note reported no findings; no code finding to address.
  • agent-relay-code[bot] comment 4694569645: validated current checkout still released before terminating; fixed in packages/factory-sdk/src/orchestrator/factory.ts:755 and covered by packages/factory-sdk/src/orchestrator/factory.test.ts:934.
  • agent-relay-code[bot] comment 4694584933: stale; current checkout already keeps pid optional in packages/factory-sdk/src/fleet/internal-fleet-client.ts:15.
  • agent-relay-code[bot] comment 4694659332: stale; same optional roster PID fix is already present in packages/factory-sdk/src/fleet/internal-fleet-client.ts:15.
  • agent-relay-code[bot] comment 4694758318: prior “no changes needed” note was superseded by the validated release-before-terminate issue fixed above.

Advisory Notes

None.

Validation

Passed locally:

  • npm ci
  • npx vitest run packages/factory-sdk/src/orchestrator/reaper.test.ts packages/factory-sdk/src/orchestrator/factory.test.ts packages/factory-sdk/src/fleet/internal-fleet-client.test.ts
  • npm run verify:mcp-resources-drift
  • npm run lint with existing warnings only
  • npm run typecheck:web
  • npm run typecheck:node
  • npm test
  • npx vitest run
  • npm run build
  • npm run build:web
  • npx playwright test --config playwright.fidelity.config.ts --workers=1
  • npx playwright test --config playwright.redraw.config.ts --workers=1

I did not run the macOS-only dist:mac packaged smoke job on this Linux runner. GitHub reported the pre-edit PR head as mergeable, but hosted CI for these local edits has not run yet.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
packages/factory-sdk/src/fleet/internal-fleet-client.test.ts (1)

22-25: 📐 Maintainability & Code Quality | ⚡ Quick win

Make the fake spawn handle return pid when configured.

FakeHarnessDriverClient.spawnPty() records pid in roster but always omits it from the returned handle, so the direct handle.pid path is not testable from this double.

Suggested fix
 async spawnPty(input: SpawnPtyInput): Promise<{ name: string; session_ref: string; pid?: number }> {
   this.spawned.push(input)
-  this.agents.push({ name: input.name, pid: this.nextPid })
-  return { name: input.name, session_ref: this.nextSessionRef }
+  const pid = this.nextPid
+  this.agents.push({ name: input.name, pid })
+  return { name: input.name, session_ref: this.nextSessionRef, pid }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/fleet/internal-fleet-client.test.ts` around lines 22
- 25, FakeHarnessDriverClient.spawnPty currently records pid in agents but never
includes it in the returned handle; update spawnPty to include a pid property in
the returned object (e.g., return { name: input.name, session_ref:
this.nextSessionRef, pid: this.nextPid }), ensuring pid can be undefined when
not configured so tests can exercise the direct handle.pid path; locate the
method spawnPty and add the pid field to the returned handle.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/factory-sdk/src/fleet/internal-fleet-client.ts`:
- Around line 149-151: The PID validation currently accepts any number
(agent?.pid) which allows 0, negatives or non-integers; update the checks after
this.#client.listAgents() and the equivalent roster/handle PID extraction to
require a positive integer by replacing typeof agent?.pid === 'number' with a
stricter check such as Number.isInteger(agent?.pid) && agent!.pid > 0, and apply
the same validation where PIDs are read for the termination flow so only >0
integer PIDs are passed to signaling/kill logic.

In `@packages/factory-sdk/src/orchestrator/reaper.ts`:
- Around line 124-128: The current catch around await readChildren(pid) swallows
enumeration failures and treats them as empty children which breaks
pidTreePostOrder() guarantees; instead, modify the catch to capture the thrown
error, add an entry to the existing skipped array (or call the module logger)
with the pid and error details, and preserve the error context so the caller
knows child discovery failed for that root; do the same for the other block (the
catch around readChildPids()/lines 141-147), referencing readChildren(pid),
readChildPids(), children, pidTreePostOrder(), and skipped so failures are
surfaced rather than downgraded to "no children".
- Around line 75-88: The catch block currently treats any error as a skipped
PID, but if the error is an ESRCH (no such process) that can occur after the
liveness probe and should be treated as already-exited; change the catch handler
in the reaper logic to detect error.code === 'ESRCH' (or equivalent on the
thrown Error) and, in that case, record the PID as terminated (push into
terminated with the same signals array) instead of pushing to skipped; for other
errors keep the existing skipped behavior and preserve the existing reason
string logic. Use the existing local symbols kill, signals, termGraceMs,
terminated and skipped to implement the change.

---

Nitpick comments:
In `@packages/factory-sdk/src/fleet/internal-fleet-client.test.ts`:
- Around line 22-25: FakeHarnessDriverClient.spawnPty currently records pid in
agents but never includes it in the returned handle; update spawnPty to include
a pid property in the returned object (e.g., return { name: input.name,
session_ref: this.nextSessionRef, pid: this.nextPid }), ensuring pid can be
undefined when not configured so tests can exercise the direct handle.pid path;
locate the method spawnPty and add the pid field to the returned handle.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8c29c23c-ffb5-4eb2-9791-c92f7597b4d7

📥 Commits

Reviewing files that changed from the base of the PR and between ebedd0f and b142e92.

📒 Files selected for processing (8)
  • packages/factory-sdk/src/fleet/internal-fleet-client.test.ts
  • packages/factory-sdk/src/fleet/internal-fleet-client.ts
  • packages/factory-sdk/src/orchestrator/factory.test.ts
  • packages/factory-sdk/src/orchestrator/factory.ts
  • packages/factory-sdk/src/orchestrator/reaper.test.ts
  • packages/factory-sdk/src/orchestrator/reaper.ts
  • packages/factory-sdk/src/ports/fleet.ts
  • packages/factory-sdk/src/types.ts

Comment on lines +149 to +151
const agent = (await this.#client.listAgents()).find((candidate) => candidate.name === name)
if (typeof agent?.pid === 'number') {
return agent.pid

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Harden PID ingress validation for roster/handle sources.

Line 150 and Line 425 currently accept any number, unlike Line 170 which requires a positive integer. This can propagate invalid PIDs (e.g., 0, negative, non-integer) into termination flow and risk signaling the wrong target.

Suggested fix
 async resolveAgentPid(name: string): Promise<number | undefined> {
   try {
     for (let attempt = 1; attempt <= PID_RESOLVE_ATTEMPTS; attempt += 1) {
       const agent = (await this.#client.listAgents()).find((candidate) => candidate.name === name)
-      if (typeof agent?.pid === 'number') {
+      if (Number.isInteger(agent?.pid) && (agent?.pid ?? 0) > 0) {
         return agent.pid
       }
       if (attempt < PID_RESOLVE_ATTEMPTS) {
         await sleep(PID_RESOLVE_BACKOFF_MS)
       }
@@
 function spawnResultFrom(handle: SpawnedHandleLike, resolvedPid = handle.pid): SpawnResult {
   const result: SpawnResult = { name: handle.name }
   const sessionRef = sessionRefFrom(handle)
   if (sessionRef) result.sessionRef = sessionRef
-  if (typeof resolvedPid === 'number') result.pid = resolvedPid
+  if (Number.isInteger(resolvedPid) && resolvedPid > 0) result.pid = resolvedPid
   return result
 }

Also applies to: 421-426

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/fleet/internal-fleet-client.ts` around lines 149 -
151, The PID validation currently accepts any number (agent?.pid) which allows
0, negatives or non-integers; update the checks after this.#client.listAgents()
and the equivalent roster/handle PID extraction to require a positive integer by
replacing typeof agent?.pid === 'number' with a stricter check such as
Number.isInteger(agent?.pid) && agent!.pid > 0, and apply the same validation
where PIDs are read for the termination flow so only >0 integer PIDs are passed
to signaling/kill logic.

Comment on lines +75 to +88
try {
kill(pid, 'SIGTERM')
signals.push('SIGTERM')
if (termGraceMs > 0) {
await sleep(termGraceMs)
}
if (isPidLive(pid, kill)) {
kill(pid, 'SIGKILL')
signals.push('SIGKILL')
}
terminated.push({ pid, signals })
} catch (error) {
skipped.push({ pid, reason: error instanceof Error ? error.message : String(error) })
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Treat post-probe ESRCH as an already-exited process, not a failed termination.

A PID can disappear between the liveness probe and the subsequent SIGTERM/SIGKILL. In that race, this catch block records the PID as skipped, even though the process is already gone and may have exited cleanly after SIGTERM. That will undercount real terminations and create false "missing PID" noise for the factory/reaper callers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/reaper.ts` around lines 75 - 88, The
catch block currently treats any error as a skipped PID, but if the error is an
ESRCH (no such process) that can occur after the liveness probe and should be
treated as already-exited; change the catch handler in the reaper logic to
detect error.code === 'ESRCH' (or equivalent on the thrown Error) and, in that
case, record the PID as terminated (push into terminated with the same signals
array) instead of pushing to skipped; for other errors keep the existing skipped
behavior and preserve the existing reason string logic. Use the existing local
symbols kill, signals, termGraceMs, terminated and skipped to implement the
change.

Comment on lines +124 to +128
try {
children = await readChildren(pid)
} catch {
children = []
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Don't silently downgrade tree termination to root-only termination when child discovery fails.

Both pidTreePostOrder() and the default readChildPids() implementation turn child-enumeration failures into []. If pgrep is unavailable or the injected reader fails for a live root, the helper loses its child-before-parent guarantee and kills only the root, which can leave descendants running after release/reap. Please surface that failure in skipped or logs instead of treating it as "no children".

Also applies to: 141-147

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/reaper.ts` around lines 124 - 128, The
current catch around await readChildren(pid) swallows enumeration failures and
treats them as empty children which breaks pidTreePostOrder() guarantees;
instead, modify the catch to capture the thrown error, add an entry to the
existing skipped array (or call the module logger) with the pid and error
details, and preserve the error context so the caller knows child discovery
failed for that root; do the same for the other block (the catch around
readChildPids()/lines 141-147), referencing readChildren(pid), readChildPids(),
children, pidTreePostOrder(), and skipped so failures are surfaced rather than
downgraded to "no children".

@kjgbot kjgbot force-pushed the factory-sdk/v2fix11 branch 5 times, most recently from 10b650b to 1a76c1e Compare June 12, 2026 22:26
@kjgbot kjgbot force-pushed the factory-sdk/v2fix11 branch from 1a76c1e to e1ede3f Compare June 12, 2026 22:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/factory-sdk/src/orchestrator/process-identity.ts`:
- Around line 10-26: Add an 'unresolved' variant to the AgentProcessLookupResult
union and wire it through the scan path so failures to run pgrep/ps surface as {
status: 'unresolved' } instead of collapsing into 'missing'; specifically,
update the AgentProcessLookupResult type to include | { status: 'unresolved';
error?: Error | string } and modify the scan helpers (the function
findAgentProcessByName and any callers that use
AgentProcessLookupOptions.listPidsByCommand, readProcessIdentity, readParentPid)
to catch execution errors and return status:'unresolved' (propagating the error
message where helpful) so downstream logic in factory.ts can distinguish real
missing processes from scan failures.

In `@packages/factory-sdk/src/orchestrator/reaper.ts`:
- Around line 334-339: Replace the loose substring check on the process command
line with the same anchored --agent-name matcher used by findAgentProcessByName:
in the block around readProcessIdentity usage (the identity check that currently
does cmdline.includes(agent.name)) and the similar check at lines ~358-362,
validate identity.cmdline against the anchored `--agent-name` pattern (reuse the
exact matcher/regex or helper used by findAgentProcessByName) so only exact
agent-name tokens (not substrings) are accepted before pushing to skipped or
returning the identity.
- Around line 286-297: The loop treats agent.pids as "handled" via
agentHasPersistedPid() but never materializes ProcessInfo entries, causing
durable entries with only pids to be skipped; update the logic in the agent
iteration (the loop over registry.agents and the use of agentHasPersistedPid,
resolveRegistryAgentProcess, and processes Map) to also materialize process
records for any positive integers found in agent.pids (e.g., create
ProcessInfo-like entries or map the pid into processes with minimal metadata) or
else only consider agentHasPersistedPid true when a corresponding processes Map
entry exists; apply the same fix to the analogous code at the later block (lines
~302-304) that checks agentHasPersistedPid.
- Around line 312-333: When resolving an agent by name, prefer the broker's
answer (opts.fleet.resolveAgentPid) over a stale/ambiguous local scan: after
calling scanRegistryAgentProcess(agent, opts, protectedPids) do not immediately
convert scan ambiguity/identity-mismatch into terminal skips if the broker
returns a resolution; instead call opts.fleet.resolveAgentPid(agent.name)
whenever scan did not return status 'found', and if resolution exists and
resolution.status === 'found' return the resolved process, if resolution exists
and resolution.status !== 'found' push a skipped entry that preserves
resolution.status (e.g. 'unresolved') rather than rewriting it to "pid missing";
only when no resolution exists fall back to the original scan-based skipped
reasons (use scanned.status === 'ambiguous' / 'identity-mismatch' to push the
same skip reasons).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c20ceb4f-dc6f-448c-98d4-65939ff71871

📥 Commits

Reviewing files that changed from the base of the PR and between b142e92 and e1ede3f.

📒 Files selected for processing (12)
  • packages/factory-sdk/src/cli/fleet.ts
  • packages/factory-sdk/src/fleet/internal-fleet-client.test.ts
  • packages/factory-sdk/src/fleet/internal-fleet-client.ts
  • packages/factory-sdk/src/orchestrator/factory.test.ts
  • packages/factory-sdk/src/orchestrator/factory.ts
  • packages/factory-sdk/src/orchestrator/process-identity.test.ts
  • packages/factory-sdk/src/orchestrator/process-identity.ts
  • packages/factory-sdk/src/orchestrator/reaper.test.ts
  • packages/factory-sdk/src/orchestrator/reaper.ts
  • packages/factory-sdk/src/ports/fleet.ts
  • packages/factory-sdk/src/ports/index.ts
  • packages/factory-sdk/src/types.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/factory-sdk/src/fleet/internal-fleet-client.test.ts
  • packages/factory-sdk/src/orchestrator/factory.ts
  • packages/factory-sdk/src/orchestrator/factory.test.ts

Comment on lines +10 to +26
export type AgentProcessLookupResult =
| { status: 'found'; identity: ProcessIdentity }
| { status: 'missing' }
| { status: 'ambiguous' }

export interface AgentProcessFinderOptions {
protectedPids?: number[]
}

export type AgentProcessFinder = (agentName: string, opts?: AgentProcessFinderOptions) => Promise<AgentProcessLookupResult>

export interface AgentProcessLookupOptions {
listPidsByCommand?: (pattern: string) => Promise<number[]>
readProcessIdentity?: (pid: number) => Promise<ProcessIdentity | undefined>
readParentPid?: (pid: number) => Promise<number | undefined>
protectedPids?: number[]
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Add an unresolved result for process-scan failures.

This union leaves findAgentProcessByName() no way to surface pgrep/ps execution failures, so the helper stack collapses those cases into { status: 'missing' }. Downstream, packages/factory-sdk/src/orchestrator/factory.ts treats scan.status === 'missing' plus broker unresolved as "already exited", which can silently skip termination for a still-live agent when the local scan path is broken.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/process-identity.ts` around lines 10 -
26, Add an 'unresolved' variant to the AgentProcessLookupResult union and wire
it through the scan path so failures to run pgrep/ps surface as { status:
'unresolved' } instead of collapsing into 'missing'; specifically, update the
AgentProcessLookupResult type to include | { status: 'unresolved'; error?: Error
| string } and modify the scan helpers (the function findAgentProcessByName and
any callers that use AgentProcessLookupOptions.listPidsByCommand,
readProcessIdentity, readParentPid) to catch execution errors and return
status:'unresolved' (propagating the error message where helpful) so downstream
logic in factory.ts can distinguish real missing processes from scan failures.

Comment on lines 286 to +297
for (const agent of registry.agents ?? []) {
for (const processInfo of agent.processes ?? []) {
if (Number.isInteger(processInfo.pid) && processInfo.pid > 0) {
processes.set(processInfo.pid, processInfo)
}
}
if (!agentHasPersistedPid(agent)) {
const resolved = await resolveRegistryAgentProcess(agent, opts, skipped, protectedPids)
if (resolved) {
processes.set(resolved.pid, resolved)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Don't treat agent.pids as handled unless you also materialize a process record.

registryProcesses() only queues agent.processes, but agentHasPersistedPid() returns true for a positive PID in agent.pids too. A durable registry entry with pids populated and processes absent will therefore skip both termination and name re-resolution, leaking a live orphan.

Also applies to: 302-304

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/reaper.ts` around lines 286 - 297, The
loop treats agent.pids as "handled" via agentHasPersistedPid() but never
materializes ProcessInfo entries, causing durable entries with only pids to be
skipped; update the logic in the agent iteration (the loop over registry.agents
and the use of agentHasPersistedPid, resolveRegistryAgentProcess, and processes
Map) to also materialize process records for any positive integers found in
agent.pids (e.g., create ProcessInfo-like entries or map the pid into processes
with minimal metadata) or else only consider agentHasPersistedPid true when a
corresponding processes Map entry exists; apply the same fix to the analogous
code at the later block (lines ~302-304) that checks agentHasPersistedPid.

Comment on lines +312 to +333
const scanned = await scanRegistryAgentProcess(agent, opts, protectedPids)
if (scanned.status === 'found') {
return scanned.process
}
if (scanned.status === 'ambiguous') {
skipped.push({ reason: `pid ambiguous for ${agent.name}` })
return undefined
}
if (scanned.status === 'identity-mismatch') {
skipped.push({ pid: scanned.pid, reason: 'pid identity mismatch' })
return undefined
}

const resolution = await opts.fleet?.resolveAgentPid?.(agent.name)
if (!resolution) {
skipped.push({ reason: `pid missing for ${agent.name}` })
return undefined
}
if (resolution.status !== 'found') {
skipped.push({ reason: resolution.status === 'unresolved' ? `pid missing for ${agent.name}` : `pid ${resolution.status} for ${agent.name}` })
return undefined
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Keep broker liveness authoritative in the name-only resolution path.

This method trusts scanRegistryAgentProcess() before fleet.resolveAgentPid(), returns early on scan ambiguity/mismatch, and then rewrites broker unresolved to pid missing. That inverts the contract described in the PR: a stale or ambiguous local scan can block the broker's answer, and a present-without-PID agent is silently downgraded instead of being reported as unresolved.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/reaper.ts` around lines 312 - 333, When
resolving an agent by name, prefer the broker's answer
(opts.fleet.resolveAgentPid) over a stale/ambiguous local scan: after calling
scanRegistryAgentProcess(agent, opts, protectedPids) do not immediately convert
scan ambiguity/identity-mismatch into terminal skips if the broker returns a
resolution; instead call opts.fleet.resolveAgentPid(agent.name) whenever scan
did not return status 'found', and if resolution exists and resolution.status
=== 'found' return the resolved process, if resolution exists and
resolution.status !== 'found' push a skipped entry that preserves
resolution.status (e.g. 'unresolved') rather than rewriting it to "pid missing";
only when no resolution exists fall back to the original scan-based skipped
reasons (use scanned.status === 'ambiguous' / 'identity-mismatch' to push the
same skip reasons).

Comment on lines +334 to +339
const identity = await (opts.readProcessIdentity ?? readProcessIdentity)(resolution.pid)
if (!identity || !identity.cmdline.includes(agent.name)) {
skipped.push({ pid: resolution.pid, reason: 'pid identity mismatch' })
return undefined
}
return { ...identity, agentName: agent.name }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use anchored agent-name validation here, not substring matching.

cmdline.includes(agent.name) accepts ar-1-impl inside --agent-name ar-1-impl-extra, so both the broker fallback and an injected processFinder can bless the wrong PID for termination. Reuse the same anchored --agent-name matcher that findAgentProcessByName() already applies.

Also applies to: 358-362

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/reaper.ts` around lines 334 - 339,
Replace the loose substring check on the process command line with the same
anchored --agent-name matcher used by findAgentProcessByName: in the block
around readProcessIdentity usage (the identity check that currently does
cmdline.includes(agent.name)) and the similar check at lines ~358-362, validate
identity.cmdline against the anchored `--agent-name` pattern (reuse the exact
matcher/regex or helper used by findAgentProcessByName) so only exact agent-name
tokens (not substrings) are accepted before pushing to skipped or returning the
identity.

@kjgbot kjgbot merged commit 97ae199 into main Jun 12, 2026
4 checks passed
@kjgbot kjgbot deleted the factory-sdk/v2fix11 branch June 12, 2026 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant