Preserve dispatch-failed agents for reaper handoff#264
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughThis PR adds dispatch-failure handoff tracking to the factory orchestrator, persisting orphaned agent spawn metadata into the durable in-flight reaper registry when dispatch fails. The factory collects spawned agents before dispatch execution, persists them on failure, and clears them on success or issue completion. The reaper later reads and consumes these handoffs to release stranded processes. ChangesDispatch-Failure Handoff Tracking and Reaper Registry Persistence
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint install failed due to a network error. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Found and fixed one real issue: dispatch-failure handoff entries could remain in memory after the same issue later succeeded/completed, causing stale agents to keep appearing in later registry writes. The fix clears those handoffs after successful dispatch and completion in factory.ts, factory.ts, and factory.ts. Added a regression test in factory.test.ts. Addressed comments
Advisory NotesNone. Local validationPassed locally:
I installed Playwright Chromium locally after the first browser run failed due to a missing executable, then reran the Playwright gates successfully. I did not run the macOS packaging smoke job in this Linux sandbox. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/factory-sdk/src/orchestrator/factory.test.ts`:
- Around line 966-969: Add a final explicit assertion after the existing
expectations in the V1 reaper test to verify process liveness: assert that no
killed entry has a nonzero signal except the protected broker PID 68009 and that
the set of alive PIDs equals exactly [68009]; specifically, add checks against
the killed array (e.g., ensure killed.every(e => e.pid === 68009 || e.signal ===
0)) and assert report.alive or an equivalent liveness source contains only
68009, referencing the existing symbols report, killed, and the protected pid
68009.
In `@packages/factory-sdk/src/orchestrator/factory.ts`:
- Around line 924-945: appendAgent currently persists agents with pids returned
from this.#terminationRoots without removing protected broker PIDs, which causes
reaper to skip name-based resolution; after obtaining const { pids } = await
this.#terminationRoots(agentName, tracked) filter out protected PIDs (e.g.,
using the existing protected-PID checker on the class such as
this.#isProtectedPid or a protectedPidSet) before building the processes array
and before pushing into agents so pids = pids.filter(pid =>
!this.#isProtectedPid(pid)); also filter processes to match the filtered pids
and ensure seenAgents/registryHandoffKey logic still applies so protected PIDs
are never persisted in agents.
- Around line 885-899: The catch block currently deletes all handoff keys for
handoffAgents which can remove pre-existing entries; change the code to track
which registry keys you actually added before calling `#writeInFlightRegistry`
(e.g., create a local array/set newKeys and push registryHandoffKey(agent.issue,
agent.name) when adding to this.#dispatchFailureReaperHandoffs inside the try),
and in the catch only delete those tracked newKeys instead of deleting every
handoff for handoffAgents; keep uses of
`#increment`('dispatchFailureReaperHandoffFailures') and the logging as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: ac57f142-7502-473b-a740-30671ffbeaa8
📒 Files selected for processing (2)
packages/factory-sdk/src/orchestrator/factory.test.tspackages/factory-sdk/src/orchestrator/factory.ts
| expect(report.reaped.map((entry) => entry.pid)).toEqual([7_302, 7_303, 7_301]) | ||
| expect(killed.some((entry) => entry.pid === 68_009 && entry.signal !== 0)).toBe(false) | ||
| expect(report.skipped).toContainEqual({ pid: 68_009, reason: 'protected pid' }) | ||
| } finally { |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Add an explicit zero-survivor assertion for the V1 reaper scenario.
This test verifies kill/reap reporting but doesn’t assert final process liveness. Add a post-condition that only protected broker PID 68_009 remains alive.
🔧 Suggested fix
expect(report.reaped.map((entry) => entry.pid)).toEqual([7_302, 7_303, 7_301])
expect(killed.some((entry) => entry.pid === 68_009 && entry.signal !== 0)).toBe(false)
expect(report.skipped).toContainEqual({ pid: 68_009, reason: 'protected pid' })
+ expect(alive).toEqual(new Set([68_009]))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expect(report.reaped.map((entry) => entry.pid)).toEqual([7_302, 7_303, 7_301]) | |
| expect(killed.some((entry) => entry.pid === 68_009 && entry.signal !== 0)).toBe(false) | |
| expect(report.skipped).toContainEqual({ pid: 68_009, reason: 'protected pid' }) | |
| } finally { | |
| expect(report.reaped.map((entry) => entry.pid)).toEqual([7_302, 7_303, 7_301]) | |
| expect(killed.some((entry) => entry.pid === 68_009 && entry.signal !== 0)).toBe(false) | |
| expect(report.skipped).toContainEqual({ pid: 68_009, reason: 'protected pid' }) | |
| expect(alive).toEqual(new Set([68_009])) | |
| } finally { |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/factory-sdk/src/orchestrator/factory.test.ts` around lines 966 -
969, Add a final explicit assertion after the existing expectations in the V1
reaper test to verify process liveness: assert that no killed entry has a
nonzero signal except the protected broker PID 68009 and that the set of alive
PIDs equals exactly [68009]; specifically, add checks against the killed array
(e.g., ensure killed.every(e => e.pid === 68009 || e.signal === 0)) and assert
report.alive or an equivalent liveness source contains only 68009, referencing
the existing symbols report, killed, and the protected pid 68009.
| try { | ||
| for (const agent of handoffAgents) { | ||
| this.#dispatchFailureReaperHandoffs.set(registryHandoffKey(agent.issue, agent.name), agent) | ||
| } | ||
| await this.#writeInFlightRegistry() | ||
| this.#increment('dispatchFailureReaperHandoffs') | ||
| this.#logger.warn?.('[factory] persisted dispatch-failed agents for orphan reaper', { | ||
| issue: record.issue, | ||
| agents: handoffAgents.map((agent) => agent.name).sort(), | ||
| }) | ||
| } catch (error) { | ||
| this.#increment('dispatchFailureReaperHandoffFailures') | ||
| for (const agent of handoffAgents) { | ||
| this.#dispatchFailureReaperHandoffs.delete(registryHandoffKey(agent.issue, agent.name)) | ||
| } |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Preserve prior handoff entries on persistence failure rollback.
Lines 897-899 always delete keys after a write failure. If the same issue/agent handoff was already persisted earlier, this drops that prior entry from memory and a subsequent registry rewrite can erase the durable handoff.
🔧 Suggested fix
async `#persistDispatchFailureReaperHandoff`(record: InFlightIssue, handoffAgents: RegistryHandoffAgent[]): Promise<void> {
if (record.dryRun || handoffAgents.length === 0) {
return
}
+ const previousByKey = new Map<string, RegistryHandoffAgent | undefined>()
try {
for (const agent of handoffAgents) {
- this.#dispatchFailureReaperHandoffs.set(registryHandoffKey(agent.issue, agent.name), agent)
+ const key = registryHandoffKey(agent.issue, agent.name)
+ previousByKey.set(key, this.#dispatchFailureReaperHandoffs.get(key))
+ this.#dispatchFailureReaperHandoffs.set(key, agent)
}
await this.#writeInFlightRegistry()
this.#increment('dispatchFailureReaperHandoffs')
this.#logger.warn?.('[factory] persisted dispatch-failed agents for orphan reaper', {
issue: record.issue,
agents: handoffAgents.map((agent) => agent.name).sort(),
})
} catch (error) {
this.#increment('dispatchFailureReaperHandoffFailures')
- for (const agent of handoffAgents) {
- this.#dispatchFailureReaperHandoffs.delete(registryHandoffKey(agent.issue, agent.name))
+ for (const [key, previous] of previousByKey) {
+ if (previous) {
+ this.#dispatchFailureReaperHandoffs.set(key, previous)
+ } else {
+ this.#dispatchFailureReaperHandoffs.delete(key)
+ }
}
this.#logger.error?.('[factory] failed to persist dispatch-failed agents for orphan reaper', {
issue: record.issue,
error,
})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try { | |
| for (const agent of handoffAgents) { | |
| this.#dispatchFailureReaperHandoffs.set(registryHandoffKey(agent.issue, agent.name), agent) | |
| } | |
| await this.#writeInFlightRegistry() | |
| this.#increment('dispatchFailureReaperHandoffs') | |
| this.#logger.warn?.('[factory] persisted dispatch-failed agents for orphan reaper', { | |
| issue: record.issue, | |
| agents: handoffAgents.map((agent) => agent.name).sort(), | |
| }) | |
| } catch (error) { | |
| this.#increment('dispatchFailureReaperHandoffFailures') | |
| for (const agent of handoffAgents) { | |
| this.#dispatchFailureReaperHandoffs.delete(registryHandoffKey(agent.issue, agent.name)) | |
| } | |
| async `#persistDispatchFailureReaperHandoff`(record: InFlightIssue, handoffAgents: RegistryHandoffAgent[]): Promise<void> { | |
| if (record.dryRun || handoffAgents.length === 0) { | |
| return | |
| } | |
| const previousByKey = new Map<string, RegistryHandoffAgent | undefined>() | |
| try { | |
| for (const agent of handoffAgents) { | |
| const key = registryHandoffKey(agent.issue, agent.name) | |
| previousByKey.set(key, this.#dispatchFailureReaperHandoffs.get(key)) | |
| this.#dispatchFailureReaperHandoffs.set(key, agent) | |
| } | |
| await this.#writeInFlightRegistry() | |
| this.#increment('dispatchFailureReaperHandoffs') | |
| this.#logger.warn?.('[factory] persisted dispatch-failed agents for orphan reaper', { | |
| issue: record.issue, | |
| agents: handoffAgents.map((agent) => agent.name).sort(), | |
| }) | |
| } catch (error) { | |
| this.#increment('dispatchFailureReaperHandoffFailures') | |
| for (const [key, previous] of previousByKey) { | |
| if (previous) { | |
| this.#dispatchFailureReaperHandoffs.set(key, previous) | |
| } else { | |
| this.#dispatchFailureReaperHandoffs.delete(key) | |
| } | |
| } | |
| this.#logger.error?.('[factory] failed to persist dispatch-failed agents for orphan reaper', { | |
| issue: record.issue, | |
| error, | |
| }) | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/factory-sdk/src/orchestrator/factory.ts` around lines 885 - 899, The
catch block currently deletes all handoff keys for handoffAgents which can
remove pre-existing entries; change the code to track which registry keys you
actually added before calling `#writeInFlightRegistry` (e.g., create a local
array/set newKeys and push registryHandoffKey(agent.issue, agent.name) when
adding to this.#dispatchFailureReaperHandoffs inside the try), and in the catch
only delete those tracked newKeys instead of deleting every handoff for
handoffAgents; keep uses of `#increment`('dispatchFailureReaperHandoffFailures')
and the logging as-is.
| const appendAgent = async (issue: IssueRef, agentName: string, tracked: TrackedAgent): Promise<void> => { | ||
| const key = registryHandoffKey(issue, agentName) | ||
| if (seenAgents.has(key)) { | ||
| return | ||
| } | ||
| seenAgents.add(key) | ||
| const { pids } = await this.#terminationRoots(agentName, tracked) | ||
| const processes = [] | ||
| for (const pid of pids) { | ||
| const identity = await this.#processIdentityReader(pid) | ||
| if (identity && identity.cmdline.includes(agentName)) { | ||
| processes.push({ ...identity, agentName }) | ||
| } | ||
| } | ||
| agents.push({ | ||
| name: agentName, | ||
| role: tracked.spec.role, | ||
| issue, | ||
| sessionRef: tracked.sessionRef, | ||
| pids, | ||
| processes, | ||
| }) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Filter protected PIDs before persisting handoff registry agents.
Line 930 resolves roots without a protected-PID filter. If a protected broker PID is persisted, reaper treats the agent as already PID-resolved and skips name-based resolution, which can leave child orphan processes unreaped.
🔧 Suggested fix
async `#writeInFlightRegistry`(
path = this.#config.loop.registryPath,
heartbeatPath = this.#config.loop.heartbeatPath,
empty = false,
): Promise<void> {
const updatedAtMs = this.#clock.now()
const agents: FactoryInFlightRegistryAgent[] = []
const seenAgents = new Set<string>()
+ const protectedPids = await this.#protectedPids()
+ const protectedPidSet = new Set(protectedPids)
const appendAgent = async (issue: IssueRef, agentName: string, tracked: TrackedAgent): Promise<void> => {
const key = registryHandoffKey(issue, agentName)
if (seenAgents.has(key)) {
return
}
seenAgents.add(key)
- const { pids } = await this.#terminationRoots(agentName, tracked)
+ const { pids } = await this.#terminationRoots(agentName, tracked, protectedPids)
+ const persistedPids = pids.filter((pid) => !protectedPidSet.has(pid))
const processes = []
- for (const pid of pids) {
+ for (const pid of persistedPids) {
const identity = await this.#processIdentityReader(pid)
if (identity && identity.cmdline.includes(agentName)) {
processes.push({ ...identity, agentName })
}
}
agents.push({
name: agentName,
role: tracked.spec.role,
issue,
sessionRef: tracked.sessionRef,
- pids,
+ pids: persistedPids,
processes,
})
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/factory-sdk/src/orchestrator/factory.ts` around lines 924 - 945,
appendAgent currently persists agents with pids returned from
this.#terminationRoots without removing protected broker PIDs, which causes
reaper to skip name-based resolution; after obtaining const { pids } = await
this.#terminationRoots(agentName, tracked) filter out protected PIDs (e.g.,
using the existing protected-PID checker on the class such as
this.#isProtectedPid or a protectedPidSet) before building the processes array
and before pushing into agents so pids = pids.filter(pid =>
!this.#isProtectedPid(pid)); also filter processes to match the filtered pids
and ensure seenAgents/registryHandoffKey logic still applies so protected PIDs
are never persisted in agents.
|
Implemented fixes for the validated PR review findings. Changed:
Addressed comments
Advisory NotesNone. Local validationPassed locally:
I did not run the macOS packaging job in this Linux sandbox. |
c77a088 to
d5c2ae3
Compare
|
Implemented the validated PR fixes. Changed:
Addressed comments
Advisory NotesNone. Local validationPassed locally:
I did not run the macOS packaging job in this Linux sandbox, and I did not verify GitHub-side mergeability/check status from here. |
Summary
#batch.abandon(...), loop heartbeat rewrites, and stop-time registry rewrites so the externalfleet factory reap-orphanspass can still resolve and reap them.Verification
npx vitest run packages/factory-sdknpx tsc --noEmitV1 Required
V0 proves the dispatch-failure path writes and preserves the durable handoff, and that the reaper consumes matching name-only entries with protected PIDs honored. It cannot prove the live failure-path end to end. fv2 V1 is mandatory: induce a real post-spawn writeback failure, confirm the abandoned pair is present in the durable registry, then confirm the running external reaper consumes that handoff and kills the whole tree with zero survivors while broker
68009remains alive.