Skip to content

Harden factory loop dedup and teardown#248

Merged
kjgbot merged 1 commit into
mainfrom
factory-sdk/v2fix2
Jun 12, 2026
Merged

Harden factory loop dedup and teardown#248
kjgbot merged 1 commit into
mainfrom
factory-sdk/v2fix2

Conversation

@kjgbot

@kjgbot kjgbot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements V2FIX-2 hardening for the factory loop:

  • D1 dispatch dedup by Linear issue key, independent of duplicate path/uuid observations.
  • D2 dispatch error backoff with bounded attempts and no tight retry storm.
  • D3 Slack status responder dedup by human message ts.
  • D4 bounded V3 loop runner with max iteration cap and batch cap preserved.
  • D5 un-orphanable teardown: dispose(), loop finally stop, Slack watcher timer cleanup, and CLI SIGINT/SIGTERM stop-before-exit handling.
  • D6 heartbeat/liveness plus CLI factory loop-status and factory kill-loop single-command kill path.

V0 Evidence

  • npx vitest run packages/factory-sdk -> 205 passed / 18 files.
  • npx tsc --noEmit -p tsconfig.node.json -> exit 0.
  • Non-vacuous tests added for duplicate key dispatch, cooldown retry gap, Slack message-ts reread dedup, bounded loop + batch cap, watcher timer teardown, heartbeat stale detection, loop-status, kill-loop, and signal cleanup.

V1 Handoff

Ready for factory-verify live checks:

  • Ready [factory-e2e] issue dispatches once, backs off on error, and does not storm.
  • One human Slack status message produces one responder reply even if inbound is reread.
  • Bounded loop advances heartbeat, idles after K iterations, and factory kill-loop stops via SIGTERM.
  • Exit/crash/SIGTERM paths leave zero lingering listeners/processes.

Guardrails

  • No src/main changes.
  • No dependency changes.
  • Canary AR-133/AR-134 untouched.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR extends the factory-sdk with bounded loop execution, heartbeat persistence, dispatch attempt gating with backoff/retry logic, and new CLI actions. It introduces loop types and configuration, implements runLoop with state transitions written to a heartbeat JSON file, tracks dispatch attempts to prevent repeated failed dispatches, refactors Slack polling for cleaner cancellation, and adds signal handlers for graceful shutdown.

Changes

Factory Loop Execution and Management

Layer / File(s) Summary
Loop execution types and configuration
src/types.ts, src/config/schema.ts, src/index.ts, src/orchestrator/index.ts
Factory interface gains runLoop() and dispose() methods. New types define loop execution options, heartbeat structure (pid, status, iteration counts, timestamps), and liveness status. Config schema extends liveSubscription with dispatch (errorCooldownMs, maxAttempts) and loop (maxIterations, heartbeatPath, heartbeatStaleMs) fields.
Loop execution and heartbeat persistence
src/orchestrator/factory.ts, src/orchestrator/factory.test.ts
runLoop() executes runOnce() for a bounded iteration count while writing heartbeat state (running, idle, stopping) to a configured JSON file. Exports readFactoryLoopHeartbeat() to parse heartbeat files and checkFactoryLoopLiveness() to compute staleness and report health. Tests verify iteration limits, heartbeat updates, and liveness detection of missing/stale heartbeats.
Dispatch attempt tracking and refusal gating
src/orchestrator/factory.ts, src/orchestrator/factory.test.ts
Introduces DispatchAttemptState to track per-issue attempt counts, in-flight status, terminal failures, and backoff-until timestamps. runOnce() and #handleChange() short-circuit when #dispatchBlockReason() indicates an issue is blocked (in-flight, terminal, backoff-active, or retry-limit exceeded). On failure, records attempt state with cooldown or terminal marking. Tests verify deduplication across differing issue paths and deterministic backoff spacing using a manual clock.
Slack polling refactoring and deduplication
src/orchestrator/factory.ts, src/orchestrator/factory.test.ts
Refactors reply watcher polling from a tight while loop to a recursive setTimeout-scheduled poll() with explicit timer tracking and cancellation on stop. Reply deduplication now uses seenReplyMessages keyed by (threadTs, messageTs) for message-level precision. Tests verify dedup across re-reads with fresh event IDs and dispose() cleanup of watchers/timers.
Signal handling and CLI loop management
src/cli/fleet.ts, src/cli/fleet.test.ts
Adds installFactoryStopSignalHandlers() to register SIGINT/SIGTERM listeners that stop the factory and exit with code 130 (SIGINT) or 143 (SIGTERM). CLI expands with loop-status (reads heartbeat, reports liveness) and kill-loop (sends SIGTERM to heartbeat PID) actions. Factory loop action now invokes runLoop() with signal handlers, guaranteed cleanup in finally, and report emission. Tests verify signal delivery, handler unregistration, CLI output, and heartbeat state transitions.
BatchTracker issue cleanup
src/orchestrator/batch-tracker.ts
BatchTracker gains abandon() to remove an in-flight issue record and its associated invocation tracking without advancing to the next queued issue, enabling targeted cleanup in dispatch failure paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • AgentWorkforce/pear#229: Main PR and PR #229 both modify packages/factory-sdk/src/config/schema.ts's exported FactoryConfigSchema—PR #229 introduces the schema, and the main PR extends it with liveSubscription.dispatch/liveSubscription.loop fields that the new CLI/orchestrator loop/heartbeat logic consumes.

🐰 A factory loop hops and bounds,
Each heartbeat a steady sound,
When signals arrive, it stops with grace,
Then clears the slate—dispatch-backoff's in place!
Dispatch attempts won't race,
Slack polls now know their place. 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: hardening factory loop with dispatch deduplication and teardown improvements, which matches the core objectives.
Description check ✅ Passed The description is directly related to the changeset, providing detailed coverage of all six hardening objectives (D1-D6) with evidence and test results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch factory-sdk/v2fix2

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed due to a network error.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a factory loop runner (runLoop) with heartbeat-backed status monitoring, loop-status, and kill-loop CLI commands, alongside dispatch attempt deduplication, error backoff, and Slack watcher cleanup. The review feedback focuses on improving the graceful shutdown of the factory loop. Key recommendations include implementing a dispose() method that awaits the active loop promise, tracking this promise with a private field, updating signal handlers and CLI commands to call dispose() instead of stop(), and enhancing runLoop to support graceful cancellation and robust error handling.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +419 to 421
if (!stopping) {
stopping = factory.stop()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Call factory.dispose() instead of factory.stop() in the signal handler. This ensures that if a SIGINT or SIGTERM is received while the factory loop is running, the process will wait for the current iteration to finish gracefully before exiting, rather than terminating abruptly mid-execution.

Suggested change
if (!stopping) {
stopping = factory.stop()
}
if (!stopping) {
stopping = factory.dispose()
}

Comment on lines +193 to +195
async dispose(): Promise<void> {
await this.stop()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Implement dispose() to await the active loop promise if it exists. This ensures that when the factory is disposed (e.g., during a graceful shutdown or signal handling), the current loop iteration completes cleanly before the process exits.

  async dispose(): Promise<void> {
    await this.stop()
    if (this.#activeLoopPromise) {
      try {
        await this.#activeLoopPromise
      } catch {
        // ignore
      }
    }
  }

Comment on lines +385 to +406
async runLoop(opts: FactoryLoopRunOptions = {}): Promise<IterationReport[]> {
const maxIterations = Math.min(5, Math.max(1, Math.trunc(opts.maxIterations ?? this.#config.loop.maxIterations)))
const heartbeatPath = opts.heartbeatPath ?? this.#config.loop.heartbeatPath
const reports: IterationReport[] = []
let completed = false
try {
for (let iteration = 0; iteration < maxIterations; iteration += 1) {
await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration, maxIterations)
reports.push(await this.runOnce({ dryRun: opts.dryRun }))
await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration + 1, maxIterations)
}
this.#increment('loopIdle')
await this.#writeLoopHeartbeat(heartbeatPath, 'idle', reports.length, maxIterations)
completed = true
return reports
} finally {
if (!completed) {
await this.#writeLoopHeartbeat(heartbeatPath, 'stopping', reports.length, maxIterations)
}
await this.stop()
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Enhance runLoop to support graceful cancellation and robust error handling:

  1. Set this.#started = true at the start of the loop and check if (!this.#started) break to allow graceful early exit when stop() is called.
  2. Wrap runOnce in a try/catch block so that a single iteration failure (e.g., due to a transient network or dispatch error) does not crash the entire loop runner.
  3. Track the active loop promise in this.#activeLoopPromise and clear it in the finally block to avoid deadlocks during shutdown.
  async runLoop(opts: FactoryLoopRunOptions = {}): Promise<IterationReport[]> {
    this.#started = true
    const maxIterations = Math.min(5, Math.max(1, Math.trunc(opts.maxIterations ?? this.#config.loop.maxIterations)))
    const heartbeatPath = opts.heartbeatPath ?? this.#config.loop.heartbeatPath
    const reports: IterationReport[] = []
    let completed = false
    const promise = (async () => {
      for (let iteration = 0; iteration < maxIterations; iteration += 1) {
        if (!this.#started) {
          break
        }
        await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration, maxIterations)
        try {
          reports.push(await this.runOnce({ dryRun: opts.dryRun }))
        } catch (error) {
          this.#logger.error?.(`[factory] loop iteration ${iteration} failed`, error)
        }
        await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration + 1, maxIterations)
      }
      if (this.#started) {
        this.#increment('loopIdle')
        await this.#writeLoopHeartbeat(heartbeatPath, 'idle', reports.length, maxIterations)
        completed = true
      }
      return reports
    })()
    this.#activeLoopPromise = promise
    try {
      return await promise
    } finally {
      this.#activeLoopPromise = undefined
      if (!completed) {
        await this.#writeLoopHeartbeat(heartbeatPath, 'stopping', reports.length, maxIterations)
      }
      await this.stop()
    }
  }

Comment on lines +220 to +221
removeSignalHandlers()
await factory.stop()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use factory.dispose() instead of factory.stop() to ensure that any active loop execution is gracefully awaited and cleaned up before the CLI command exits.

Suggested change
removeSignalHandlers()
await factory.stop()
removeSignalHandlers()
await factory.dispose()

Comment on lines +97 to 100
readonly #dispatchAttempts = new Map<string, DispatchAttemptState>()
#subscription?: Subscription
#livePollTimer?: ReturnType<typeof setTimeout>
#livePollInFlight = false

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a private field #activeLoopPromise to track the active loop execution promise so that graceful shutdown can await its completion.

Suggested change
readonly #dispatchAttempts = new Map<string, DispatchAttemptState>()
#subscription?: Subscription
#livePollTimer?: ReturnType<typeof setTimeout>
#livePollInFlight = false
readonly #dispatchAttempts = new Map<string, DispatchAttemptState>()
#activeLoopPromise?: Promise<unknown>
#subscription?: Subscription
#livePollTimer?: ReturnType<typeof setTimeout>
#livePollInFlight = false

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/factory-sdk/src/orchestrator/factory.ts (1)

461-500: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Release partially spawned agents before abandoning the batch record.

If any spawn succeeds and a later step in this try fails, Line 498 forgets the issue without releasing the agents already recorded on record.agents. That leaks live workers and allows the next retry to dispatch the same issue again while the first wave is still running.

Suggested direction
     } catch (error) {
+      const spawnedAgents = [...record.agents.keys()]
+      await Promise.allSettled(
+        spawnedAgents.map((name) => this.#fleet.release(name, 'dispatch-failed')),
+      )
       this.#recordDispatchFailure(decision.issue)
       this.#batch.abandon(decision.issue)
       this.#error(error, decision.issue)
       throw error
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/factory.ts` around lines 461 - 500, If
any agents are successfully spawned by this.#spawnAgent but a later step in the
try block fails, the code currently abandons the batch (this.#batch.abandon) and
records failure (this.#recordDispatchFailure) without releasing the live agents
recorded on record.agents, leaking workers; fix this by, in the catch path
before abandoning the batch and rethrowing, iterating the spawned agents (use
the local agents array or record.agents) and call the appropriate
release/cleanup for each (e.g., await this.#releaseAgent(...) or the
agent.release method), handle any release errors with logging, then proceed to
call this.#recordDispatchFailure and this.#batch.abandon and rethrow the
original error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/factory-sdk/src/cli/fleet.ts`:
- Line 111: The loop-status and kill-loop commands should be handled before
constructing runtime dependencies (fleet, mount, factory): detect when command
=== 'loop-status' or 'kill-loop' (and the same logic referenced in the 183-223
region) and short-circuit to read loaded.config.loop.* and return the
heartbeat/kill result immediately instead of calling runFactoryCommand or
initializing fleet/mount/factory; move that branch above the code that builds
factory/fleet/mount so these actions only use loaded.config and globals and do
not trigger runtime bootstrapping.
- Around line 205-211: For the 'kill-loop' branch (command.action ===
'kill-loop'), don't blindly trust readFactoryLoopHeartbeat; call
checkFactoryLoopLiveness(heartbeat, { staleMs: config.loop.heartbeatStaleMs })
and if it reports not alive or returns false/throws, reject by throwing an error
(refuse stale/missing heartbeats). Only after liveness check succeeds, call
process.kill(heartbeat.pid, 'SIGTERM') and writeJson(out, { killed:
heartbeat.pid, signal: 'SIGTERM' }); use the existing symbols
readFactoryLoopHeartbeat, checkFactoryLoopLiveness,
config.loop.heartbeatStaleMs, process.kill, and writeJson to locate and
implement the change.
- Around line 418-425: stopAndExit currently attaches a new .finally() each time
it's called, so multiple signals can trigger multiple exits; change it so the
function is single-fire: as soon as shutdown starts (when setting the stopping
promise from factory.stop()), immediately remove the signal listeners (or call
remove()) and return early on subsequent invocations by checking if stopping is
already set (use the stopping variable as the gate). Ensure you only attach the
.finally() once to the stopping promise so remove() and exit(code) run exactly
once; reference stopAndExit, stopping, factory.stop(), remove(), and exit() when
making this change.

In `@packages/factory-sdk/src/orchestrator/factory.ts`:
- Around line 1205-1233: The heartbeat reading/trust issue: ensure malformed
JSON is rejected by validating the parsed object in readFactoryLoopHeartbeat
(and/or at the start of checkFactoryLoopLiveness) so bad files are treated as
"missing" not "live"; specifically, after parseJsonContent in
readFactoryLoopHeartbeat validate that the returned object has a finite numeric
updatedAtMs, an allowed status string (e.g., 'running'|'stopping' if those are
used), and if present a numeric pid, otherwise return undefined; alternatively
add the same sanity checks at the top of checkFactoryLoopLiveness (verify
Number.isFinite(heartbeat.updatedAtMs) before computing ageMs and reject
non-finite values) so NaN ages and invalid pids cannot cause ok:true or be used
by kill-loop.

---

Outside diff comments:
In `@packages/factory-sdk/src/orchestrator/factory.ts`:
- Around line 461-500: If any agents are successfully spawned by
this.#spawnAgent but a later step in the try block fails, the code currently
abandons the batch (this.#batch.abandon) and records failure
(this.#recordDispatchFailure) without releasing the live agents recorded on
record.agents, leaking workers; fix this by, in the catch path before abandoning
the batch and rethrowing, iterating the spawned agents (use the local agents
array or record.agents) and call the appropriate release/cleanup for each (e.g.,
await this.#releaseAgent(...) or the agent.release method), handle any release
errors with logging, then proceed to call this.#recordDispatchFailure and
this.#batch.abandon and rethrow the original error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c608c53b-4eb9-4871-a600-bd832cfc8d59

📥 Commits

Reviewing files that changed from the base of the PR and between 4331644 and 5bc6470.

📒 Files selected for processing (9)
  • packages/factory-sdk/src/cli/fleet.test.ts
  • packages/factory-sdk/src/cli/fleet.ts
  • packages/factory-sdk/src/config/schema.ts
  • packages/factory-sdk/src/index.ts
  • packages/factory-sdk/src/orchestrator/batch-tracker.ts
  • packages/factory-sdk/src/orchestrator/factory.test.ts
  • packages/factory-sdk/src/orchestrator/factory.ts
  • packages/factory-sdk/src/orchestrator/index.ts
  • packages/factory-sdk/src/types.ts

const mount = await buildMount(loaded, deps)
const factory = createFactory(loaded.config, { mount, fleet })
return await runFactoryCommand(command, factory, mount, globals, out)
return await runFactoryCommand(command, factory, mount, loaded.config, globals, out)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Decouple loop-status / kill-loop from Fleet/Mount bootstrap.

These two actions only read config.loop.*, but this path still requires a fully initialized fleet, mount, and factory before it can reach the heartbeat logic. That makes the recovery commands fail on unrelated broker or mount startup errors, which is exactly when operators need them to keep working. Short-circuit these actions before runtime dependency construction.

Also applies to: 183-223

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/cli/fleet.ts` at line 111, The loop-status and
kill-loop commands should be handled before constructing runtime dependencies
(fleet, mount, factory): detect when command === 'loop-status' or 'kill-loop'
(and the same logic referenced in the 183-223 region) and short-circuit to read
loaded.config.loop.* and return the heartbeat/kill result immediately instead of
calling runFactoryCommand or initializing fleet/mount/factory; move that branch
above the code that builds factory/fleet/mount so these actions only use
loaded.config and globals and do not trigger runtime bootstrapping.

Comment on lines +205 to +211
if (command.action === 'kill-loop') {
const heartbeat = await readFactoryLoopHeartbeat(config.loop.heartbeatPath)
if (!heartbeat?.pid) {
throw new Error(`No factory loop heartbeat at ${config.loop.heartbeatPath}`)
}
process.kill(heartbeat.pid, 'SIGTERM')
writeJson(out, { killed: heartbeat.pid, signal: 'SIGTERM' })

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject stale heartbeats before sending SIGTERM.

kill-loop currently trusts any PID found in the heartbeat file. If that file is stale and the OS has already reused the PID, this can terminate an unrelated process. Gate the kill with checkFactoryLoopLiveness(..., { staleMs: config.loop.heartbeatStaleMs }) and refuse stale/missing heartbeats.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/cli/fleet.ts` around lines 205 - 211, For the
'kill-loop' branch (command.action === 'kill-loop'), don't blindly trust
readFactoryLoopHeartbeat; call checkFactoryLoopLiveness(heartbeat, { staleMs:
config.loop.heartbeatStaleMs }) and if it reports not alive or returns
false/throws, reject by throwing an error (refuse stale/missing heartbeats).
Only after liveness check succeeds, call process.kill(heartbeat.pid, 'SIGTERM')
and writeJson(out, { killed: heartbeat.pid, signal: 'SIGTERM' }); use the
existing symbols readFactoryLoopHeartbeat, checkFactoryLoopLiveness,
config.loop.heartbeatStaleMs, process.kill, and writeJson to locate and
implement the change.

Comment on lines +418 to +425
const stopAndExit = (code: number) => {
if (!stopping) {
stopping = factory.stop()
}
process.once('SIGINT', stop)
process.once('SIGTERM', stop)
})
void stopping.finally(() => {
remove()
exit(code)
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make shutdown handling single-fire across repeated signals.

If a second SIGINT/SIGTERM arrives before factory.stop() resolves, this attaches another .finally() to the same promise, so exit() can run multiple times with conflicting codes when a custom exit callback is injected. Remove the listeners or return early as soon as shutdown starts, not after the stop promise settles.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/cli/fleet.ts` around lines 418 - 425, stopAndExit
currently attaches a new .finally() each time it's called, so multiple signals
can trigger multiple exits; change it so the function is single-fire: as soon as
shutdown starts (when setting the stopping promise from factory.stop()),
immediately remove the signal listeners (or call remove()) and return early on
subsequent invocations by checking if stopping is already set (use the stopping
variable as the gate). Ensure you only attach the .finally() once to the
stopping promise so remove() and exit(code) run exactly once; reference
stopAndExit, stopping, factory.stop(), remove(), and exit() when making this
change.

Comment on lines +1205 to +1233
export async function readFactoryLoopHeartbeat(
path = DEFAULT_FACTORY_LOOP_HEARTBEAT_PATH,
): Promise<FactoryLoopHeartbeat | undefined> {
try {
return parseJsonContent(await readFile(path, 'utf8')) as FactoryLoopHeartbeat
} catch {
return undefined
}
}

export function checkFactoryLoopLiveness(
heartbeat: FactoryLoopHeartbeat | undefined,
opts: { nowMs?: number; staleMs?: number } = {},
): FactoryLoopLiveness {
if (!heartbeat) {
return { ok: false, stale: true, reason: 'heartbeat missing' }
}

const nowMs = opts.nowMs ?? Date.now()
const staleMs = opts.staleMs ?? 60_000
const ageMs = Math.max(0, nowMs - heartbeat.updatedAtMs)
const stale = ageMs > staleMs
if (stale) {
return { ok: false, stale: true, ageMs, heartbeat, reason: 'heartbeat stale' }
}
if (heartbeat.status === 'stopping') {
return { ok: false, stale: false, ageMs, heartbeat, reason: 'loop stopping' }
}
return { ok: true, stale: false, ageMs, heartbeat }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the heartbeat file before using it for liveness or PID-based control.

This helper trusts arbitrary JSON as FactoryLoopHeartbeat. A malformed file can make checkFactoryLoopLiveness() report ok: true because nowMs - heartbeat.updatedAtMs becomes NaN, and the same unchecked pid is what the CLI later uses for kill-loop. Invalid heartbeat content should be rejected as unreadable, not treated as live.

Suggested direction
 export async function readFactoryLoopHeartbeat(
   path = DEFAULT_FACTORY_LOOP_HEARTBEAT_PATH,
 ): Promise<FactoryLoopHeartbeat | undefined> {
   try {
-    return parseJsonContent(await readFile(path, 'utf8')) as FactoryLoopHeartbeat
+    const parsed = parseJsonContent(await readFile(path, 'utf8'))
+    if (
+      parsed &&
+      typeof parsed === 'object' &&
+      typeof (parsed as Record<string, unknown>).pid === 'number' &&
+      typeof (parsed as Record<string, unknown>).updatedAtMs === 'number' &&
+      typeof (parsed as Record<string, unknown>).updatedAt === 'string' &&
+      typeof (parsed as Record<string, unknown>).iteration === 'number' &&
+      typeof (parsed as Record<string, unknown>).maxIterations === 'number' &&
+      ['running', 'idle', 'stopping'].includes(String((parsed as Record<string, unknown>).status))
+    ) {
+      return parsed as FactoryLoopHeartbeat
+    }
+    return undefined
   } catch {
     return undefined
   }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/factory.ts` around lines 1205 - 1233,
The heartbeat reading/trust issue: ensure malformed JSON is rejected by
validating the parsed object in readFactoryLoopHeartbeat (and/or at the start of
checkFactoryLoopLiveness) so bad files are treated as "missing" not "live";
specifically, after parseJsonContent in readFactoryLoopHeartbeat validate that
the returned object has a finite numeric updatedAtMs, an allowed status string
(e.g., 'running'|'stopping' if those are used), and if present a numeric pid,
otherwise return undefined; alternatively add the same sanity checks at the top
of checkFactoryLoopLiveness (verify Number.isFinite(heartbeat.updatedAtMs)
before computing ageMs and reject non-finite values) so NaN ages and invalid
pids cannot cause ok:true or be used by kill-loop.

@kjgbot kjgbot merged commit 3adc586 into main Jun 12, 2026
5 checks passed
@kjgbot kjgbot deleted the factory-sdk/v2fix2 branch June 12, 2026 13:19
@agent-relay-code

Copy link
Copy Markdown
Contributor

Implemented one scoped fix in the PR:

dispatch() now records retry attempts only after BatchTracker.start() actually starts an issue, so queued issues no longer burn retry budget before any spawn happens. Added a regression test for maxAttempts: 1 with queued work draining after the active issue completes.

Addressed comments

  • Self-review: queued factory dispatches consumed retry budget without a real dispatch attempt, causing queued work to become terminal before it could start. Fixed in factory.ts and covered in factory.test.ts.
  • Bot/reviewer comments: none were present in .workforce/context.json or other .workforce metadata files.

Advisory Notes

None.

Validation run:

  • npm ci
  • npm run verify:mcp-resources-drift
  • npm run lint passed with existing warnings only
  • npm run typecheck:web
  • npm run typecheck:node
  • npm test
  • npx vitest run
  • npm run build
  • npm run build:web
  • npx playwright test --config playwright.fidelity.config.ts
  • npx playwright test --config playwright.redraw.config.ts

Also checked for merge conflict markers with rg; none found. I did not print READY because GitHub mergeability/check status cannot be verified from this sandbox without the prohibited GitHub tooling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant