Harden factory loop dedup and teardown by kjgbot · Pull Request #248 · AgentWorkforce/pear

kjgbot · 2026-06-12T13:09:56Z

Summary

Implements V2FIX-2 hardening for the factory loop:

D1 dispatch dedup by Linear issue key, independent of duplicate path/uuid observations.
D2 dispatch error backoff with bounded attempts and no tight retry storm.
D3 Slack status responder dedup by human message ts.
D4 bounded V3 loop runner with max iteration cap and batch cap preserved.
D5 un-orphanable teardown: dispose(), loop finally stop, Slack watcher timer cleanup, and CLI SIGINT/SIGTERM stop-before-exit handling.
D6 heartbeat/liveness plus CLI factory loop-status and factory kill-loop single-command kill path.

V0 Evidence

npx vitest run packages/factory-sdk -> 205 passed / 18 files.
npx tsc --noEmit -p tsconfig.node.json -> exit 0.
Non-vacuous tests added for duplicate key dispatch, cooldown retry gap, Slack message-ts reread dedup, bounded loop + batch cap, watcher timer teardown, heartbeat stale detection, loop-status, kill-loop, and signal cleanup.

V1 Handoff

Ready for factory-verify live checks:

Ready [factory-e2e] issue dispatches once, backs off on error, and does not storm.
One human Slack status message produces one responder reply even if inbound is reread.
Bounded loop advances heartbeat, idles after K iterations, and factory kill-loop stops via SIGTERM.
Exit/crash/SIGTERM paths leave zero lingering listeners/processes.

Guardrails

No src/main changes.
No dependency changes.
Canary AR-133/AR-134 untouched.

coderabbitai · 2026-06-12T13:10:09Z

📝 Walkthrough

Walkthrough

This PR extends the factory-sdk with bounded loop execution, heartbeat persistence, dispatch attempt gating with backoff/retry logic, and new CLI actions. It introduces loop types and configuration, implements runLoop with state transitions written to a heartbeat JSON file, tracks dispatch attempts to prevent repeated failed dispatches, refactors Slack polling for cleaner cancellation, and adds signal handlers for graceful shutdown.

Changes

Factory Loop Execution and Management

Layer / File(s)	Summary
Loop execution types and configuration `src/types.ts`, `src/config/schema.ts`, `src/index.ts`, `src/orchestrator/index.ts`	Factory interface gains `runLoop()` and `dispose()` methods. New types define loop execution options, heartbeat structure (pid, status, iteration counts, timestamps), and liveness status. Config schema extends `liveSubscription` with `dispatch` (errorCooldownMs, maxAttempts) and `loop` (maxIterations, heartbeatPath, heartbeatStaleMs) fields.
Loop execution and heartbeat persistence `src/orchestrator/factory.ts`, `src/orchestrator/factory.test.ts`	`runLoop()` executes `runOnce()` for a bounded iteration count while writing heartbeat state (`running`, `idle`, `stopping`) to a configured JSON file. Exports `readFactoryLoopHeartbeat()` to parse heartbeat files and `checkFactoryLoopLiveness()` to compute staleness and report health. Tests verify iteration limits, heartbeat updates, and liveness detection of missing/stale heartbeats.
Dispatch attempt tracking and refusal gating `src/orchestrator/factory.ts`, `src/orchestrator/factory.test.ts`	Introduces `DispatchAttemptState` to track per-issue attempt counts, in-flight status, terminal failures, and backoff-until timestamps. `runOnce()` and `#handleChange()` short-circuit when `#dispatchBlockReason()` indicates an issue is blocked (in-flight, terminal, backoff-active, or retry-limit exceeded). On failure, records attempt state with cooldown or terminal marking. Tests verify deduplication across differing issue paths and deterministic backoff spacing using a manual clock.
Slack polling refactoring and deduplication `src/orchestrator/factory.ts`, `src/orchestrator/factory.test.ts`	Refactors reply watcher polling from a tight `while` loop to a recursive `setTimeout`-scheduled `poll()` with explicit timer tracking and cancellation on stop. Reply deduplication now uses `seenReplyMessages` keyed by `(threadTs, messageTs)` for message-level precision. Tests verify dedup across re-reads with fresh event IDs and `dispose()` cleanup of watchers/timers.
Signal handling and CLI loop management `src/cli/fleet.ts`, `src/cli/fleet.test.ts`	Adds `installFactoryStopSignalHandlers()` to register SIGINT/SIGTERM listeners that stop the factory and exit with code 130 (SIGINT) or 143 (SIGTERM). CLI expands with `loop-status` (reads heartbeat, reports liveness) and `kill-loop` (sends SIGTERM to heartbeat PID) actions. Factory `loop` action now invokes `runLoop()` with signal handlers, guaranteed cleanup in `finally`, and report emission. Tests verify signal delivery, handler unregistration, CLI output, and heartbeat state transitions.
BatchTracker issue cleanup `src/orchestrator/batch-tracker.ts`	`BatchTracker` gains `abandon()` to remove an in-flight issue record and its associated invocation tracking without advancing to the next queued issue, enabling targeted cleanup in dispatch failure paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

AgentWorkforce/pear#229: Main PR and PR #229 both modify packages/factory-sdk/src/config/schema.ts's exported FactoryConfigSchema—PR #229 introduces the schema, and the main PR extends it with liveSubscription.dispatch/liveSubscription.loop fields that the new CLI/orchestrator loop/heartbeat logic consumes.

🐰 A factory loop hops and bounds,
Each heartbeat a steady sound,
When signals arrive, it stops with grace,
Then clears the slate—dispatch-backoff's in place!
Dispatch attempts won't race,
Slack polls now know their place. 🎵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: hardening factory loop with dispatch deduplication and teardown improvements, which matches the core objectives.
Description check	✅ Passed	The description is directly related to the changeset, providing detailed coverage of all six hardening objectives (D1-D6) with evidence and test results.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch factory-sdk/v2fix2

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed due to a network error.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a factory loop runner (runLoop) with heartbeat-backed status monitoring, loop-status, and kill-loop CLI commands, alongside dispatch attempt deduplication, error backoff, and Slack watcher cleanup. The review feedback focuses on improving the graceful shutdown of the factory loop. Key recommendations include implementing a dispose() method that awaits the active loop promise, tracking this promise with a private field, updating signal handlers and CLI commands to call dispose() instead of stop(), and enhancing runLoop to support graceful cancellation and robust error handling.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T13:12:32Z

+    if (!stopping) {
+      stopping = factory.stop()
    }


Call factory.dispose() instead of factory.stop() in the signal handler. This ensures that if a SIGINT or SIGTERM is received while the factory loop is running, the process will wait for the current iteration to finish gracefully before exiting, rather than terminating abruptly mid-execution.

Suggested change

if (!stopping) {

stopping = factory.stop()

}

if (!stopping) {

stopping = factory.dispose()

}

gemini-code-assist · 2026-06-12T13:12:33Z

+  async dispose(): Promise<void> {
+    await this.stop()
+  }


Implement dispose() to await the active loop promise if it exists. This ensures that when the factory is disposed (e.g., during a graceful shutdown or signal handling), the current loop iteration completes cleanly before the process exits.

async dispose(): Promise<void> { await this.stop() if (this.#activeLoopPromise) { try { await this.#activeLoopPromise } catch { // ignore } } }

gemini-code-assist · 2026-06-12T13:12:33Z

+  async runLoop(opts: FactoryLoopRunOptions = {}): Promise<IterationReport[]> {
+    const maxIterations = Math.min(5, Math.max(1, Math.trunc(opts.maxIterations ?? this.#config.loop.maxIterations)))
+    const heartbeatPath = opts.heartbeatPath ?? this.#config.loop.heartbeatPath
+    const reports: IterationReport[] = []
+    let completed = false
+    try {
+      for (let iteration = 0; iteration < maxIterations; iteration += 1) {
+        await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration, maxIterations)
+        reports.push(await this.runOnce({ dryRun: opts.dryRun }))
+        await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration + 1, maxIterations)
+      }
+      this.#increment('loopIdle')
+      await this.#writeLoopHeartbeat(heartbeatPath, 'idle', reports.length, maxIterations)
+      completed = true
+      return reports
+    } finally {
+      if (!completed) {
+        await this.#writeLoopHeartbeat(heartbeatPath, 'stopping', reports.length, maxIterations)
+      }
+      await this.stop()
+    }
+  }


Enhance runLoop to support graceful cancellation and robust error handling:

Set this.#started = true at the start of the loop and check if (!this.#started) break to allow graceful early exit when stop() is called.

Wrap runOnce in a try/catch block so that a single iteration failure (e.g., due to a transient network or dispatch error) does not crash the entire loop runner.

Track the active loop promise in this.#activeLoopPromise and clear it in the finally block to avoid deadlocks during shutdown.

async runLoop(opts: FactoryLoopRunOptions = {}): Promise<IterationReport[]> { this.#started = true const maxIterations = Math.min(5, Math.max(1, Math.trunc(opts.maxIterations ?? this.#config.loop.maxIterations))) const heartbeatPath = opts.heartbeatPath ?? this.#config.loop.heartbeatPath const reports: IterationReport[] = [] let completed = false const promise = (async () => { for (let iteration = 0; iteration < maxIterations; iteration += 1) { if (!this.#started) { break } await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration, maxIterations) try { reports.push(await this.runOnce({ dryRun: opts.dryRun })) } catch (error) { this.#logger.error?.(`[factory] loop iteration ${iteration} failed`, error) } await this.#writeLoopHeartbeat(heartbeatPath, 'running', iteration + 1, maxIterations) } if (this.#started) { this.#increment('loopIdle') await this.#writeLoopHeartbeat(heartbeatPath, 'idle', reports.length, maxIterations) completed = true } return reports })() this.#activeLoopPromise = promise try { return await promise } finally { this.#activeLoopPromise = undefined if (!completed) { await this.#writeLoopHeartbeat(heartbeatPath, 'stopping', reports.length, maxIterations) } await this.stop() } }

gemini-code-assist · 2026-06-12T13:12:33Z

+      removeSignalHandlers()
+      await factory.stop()


Use factory.dispose() instead of factory.stop() to ensure that any active loop execution is gracefully awaited and cleaned up before the CLI command exits.

Suggested change

removeSignalHandlers()

await factory.stop()

removeSignalHandlers()

await factory.dispose()

gemini-code-assist · 2026-06-12T13:12:33Z

+  readonly #dispatchAttempts = new Map<string, DispatchAttemptState>()
  #subscription?: Subscription
  #livePollTimer?: ReturnType<typeof setTimeout>
  #livePollInFlight = false


Add a private field #activeLoopPromise to track the active loop execution promise so that graceful shutdown can await its completion.

Suggested change

readonly #dispatchAttempts = new Map<string, DispatchAttemptState>()

#subscription?: Subscription

#livePollTimer?: ReturnType<typeof setTimeout>

#livePollInFlight = false

readonly #dispatchAttempts = new Map<string, DispatchAttemptState>()

#activeLoopPromise?: Promise<unknown>

#subscription?: Subscription

#livePollTimer?: ReturnType<typeof setTimeout>

#livePollInFlight = false

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/factory-sdk/src/orchestrator/factory.ts (1)

461-500: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Release partially spawned agents before abandoning the batch record.

If any spawn succeeds and a later step in this try fails, Line 498 forgets the issue without releasing the agents already recorded on record.agents. That leaks live workers and allows the next retry to dispatch the same issue again while the first wave is still running.

Suggested direction

     } catch (error) {
+      const spawnedAgents = [...record.agents.keys()]
+      await Promise.allSettled(
+        spawnedAgents.map((name) => this.#fleet.release(name, 'dispatch-failed')),
+      )
       this.#recordDispatchFailure(decision.issue)
       this.#batch.abandon(decision.issue)
       this.#error(error, decision.issue)
       throw error
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/factory-sdk/src/orchestrator/factory.ts` around lines 461 - 500, If
any agents are successfully spawned by this.#spawnAgent but a later step in the
try block fails, the code currently abandons the batch (this.#batch.abandon) and
records failure (this.#recordDispatchFailure) without releasing the live agents
recorded on record.agents, leaking workers; fix this by, in the catch path
before abandoning the batch and rethrowing, iterating the spawned agents (use
the local agents array or record.agents) and call the appropriate
release/cleanup for each (e.g., await this.#releaseAgent(...) or the
agent.release method), handle any release errors with logging, then proceed to
call this.#recordDispatchFailure and this.#batch.abandon and rethrow the
original error.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/factory-sdk/src/cli/fleet.ts`:
- Line 111: The loop-status and kill-loop commands should be handled before
constructing runtime dependencies (fleet, mount, factory): detect when command
=== 'loop-status' or 'kill-loop' (and the same logic referenced in the 183-223
region) and short-circuit to read loaded.config.loop.* and return the
heartbeat/kill result immediately instead of calling runFactoryCommand or
initializing fleet/mount/factory; move that branch above the code that builds
factory/fleet/mount so these actions only use loaded.config and globals and do
not trigger runtime bootstrapping.
- Around line 205-211: For the 'kill-loop' branch (command.action ===
'kill-loop'), don't blindly trust readFactoryLoopHeartbeat; call
checkFactoryLoopLiveness(heartbeat, { staleMs: config.loop.heartbeatStaleMs })
and if it reports not alive or returns false/throws, reject by throwing an error
(refuse stale/missing heartbeats). Only after liveness check succeeds, call
process.kill(heartbeat.pid, 'SIGTERM') and writeJson(out, { killed:
heartbeat.pid, signal: 'SIGTERM' }); use the existing symbols
readFactoryLoopHeartbeat, checkFactoryLoopLiveness,
config.loop.heartbeatStaleMs, process.kill, and writeJson to locate and
implement the change.
- Around line 418-425: stopAndExit currently attaches a new .finally() each time
it's called, so multiple signals can trigger multiple exits; change it so the
function is single-fire: as soon as shutdown starts (when setting the stopping
promise from factory.stop()), immediately remove the signal listeners (or call
remove()) and return early on subsequent invocations by checking if stopping is
already set (use the stopping variable as the gate). Ensure you only attach the
.finally() once to the stopping promise so remove() and exit(code) run exactly
once; reference stopAndExit, stopping, factory.stop(), remove(), and exit() when
making this change.

In `@packages/factory-sdk/src/orchestrator/factory.ts`:
- Around line 1205-1233: The heartbeat reading/trust issue: ensure malformed
JSON is rejected by validating the parsed object in readFactoryLoopHeartbeat
(and/or at the start of checkFactoryLoopLiveness) so bad files are treated as
"missing" not "live"; specifically, after parseJsonContent in
readFactoryLoopHeartbeat validate that the returned object has a finite numeric
updatedAtMs, an allowed status string (e.g., 'running'|'stopping' if those are
used), and if present a numeric pid, otherwise return undefined; alternatively
add the same sanity checks at the top of checkFactoryLoopLiveness (verify
Number.isFinite(heartbeat.updatedAtMs) before computing ageMs and reject
non-finite values) so NaN ages and invalid pids cannot cause ok:true or be used
by kill-loop.

---

Outside diff comments:
In `@packages/factory-sdk/src/orchestrator/factory.ts`:
- Around line 461-500: If any agents are successfully spawned by
this.#spawnAgent but a later step in the try block fails, the code currently
abandons the batch (this.#batch.abandon) and records failure
(this.#recordDispatchFailure) without releasing the live agents recorded on
record.agents, leaking workers; fix this by, in the catch path before abandoning
the batch and rethrowing, iterating the spawned agents (use the local agents
array or record.agents) and call the appropriate release/cleanup for each (e.g.,
await this.#releaseAgent(...) or the agent.release method), handle any release
errors with logging, then proceed to call this.#recordDispatchFailure and
this.#batch.abandon and rethrow the original error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c608c53b-4eb9-4871-a600-bd832cfc8d59

📥 Commits

Reviewing files that changed from the base of the PR and between 4331644 and 5bc6470.

📒 Files selected for processing (9)

packages/factory-sdk/src/cli/fleet.test.ts
packages/factory-sdk/src/cli/fleet.ts
packages/factory-sdk/src/config/schema.ts
packages/factory-sdk/src/index.ts
packages/factory-sdk/src/orchestrator/batch-tracker.ts
packages/factory-sdk/src/orchestrator/factory.test.ts
packages/factory-sdk/src/orchestrator/factory.ts
packages/factory-sdk/src/orchestrator/index.ts
packages/factory-sdk/src/types.ts

coderabbitai · 2026-06-12T13:16:35Z

        const mount = await buildMount(loaded, deps)
        const factory = createFactory(loaded.config, { mount, fleet })
-        return await runFactoryCommand(command, factory, mount, globals, out)
+        return await runFactoryCommand(command, factory, mount, loaded.config, globals, out)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Decouple loop-status / kill-loop from Fleet/Mount bootstrap.

These two actions only read config.loop.*, but this path still requires a fully initialized fleet, mount, and factory before it can reach the heartbeat logic. That makes the recovery commands fail on unrelated broker or mount startup errors, which is exactly when operators need them to keep working. Short-circuit these actions before runtime dependency construction.

Also applies to: 183-223

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/factory-sdk/src/cli/fleet.ts` at line 111, The loop-status and kill-loop commands should be handled before constructing runtime dependencies (fleet, mount, factory): detect when command === 'loop-status' or 'kill-loop' (and the same logic referenced in the 183-223 region) and short-circuit to read loaded.config.loop.* and return the heartbeat/kill result immediately instead of calling runFactoryCommand or initializing fleet/mount/factory; move that branch above the code that builds factory/fleet/mount so these actions only use loaded.config and globals and do not trigger runtime bootstrapping.

coderabbitai · 2026-06-12T13:16:35Z

+    if (command.action === 'kill-loop') {
+      const heartbeat = await readFactoryLoopHeartbeat(config.loop.heartbeatPath)
+      if (!heartbeat?.pid) {
+        throw new Error(`No factory loop heartbeat at ${config.loop.heartbeatPath}`)
+      }
+      process.kill(heartbeat.pid, 'SIGTERM')
+      writeJson(out, { killed: heartbeat.pid, signal: 'SIGTERM' })


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject stale heartbeats before sending SIGTERM.

kill-loop currently trusts any PID found in the heartbeat file. If that file is stale and the OS has already reused the PID, this can terminate an unrelated process. Gate the kill with checkFactoryLoopLiveness(..., { staleMs: config.loop.heartbeatStaleMs }) and refuse stale/missing heartbeats.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/factory-sdk/src/cli/fleet.ts` around lines 205 - 211, For the 'kill-loop' branch (command.action === 'kill-loop'), don't blindly trust readFactoryLoopHeartbeat; call checkFactoryLoopLiveness(heartbeat, { staleMs: config.loop.heartbeatStaleMs }) and if it reports not alive or returns false/throws, reject by throwing an error (refuse stale/missing heartbeats). Only after liveness check succeeds, call process.kill(heartbeat.pid, 'SIGTERM') and writeJson(out, { killed: heartbeat.pid, signal: 'SIGTERM' }); use the existing symbols readFactoryLoopHeartbeat, checkFactoryLoopLiveness, config.loop.heartbeatStaleMs, process.kill, and writeJson to locate and implement the change.

coderabbitai · 2026-06-12T13:16:35Z

+  const stopAndExit = (code: number) => {
+    if (!stopping) {
+      stopping = factory.stop()
    }
-    process.once('SIGINT', stop)
-    process.once('SIGTERM', stop)
-  })
+    void stopping.finally(() => {
+      remove()
+      exit(code)
+    })


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make shutdown handling single-fire across repeated signals.

If a second SIGINT/SIGTERM arrives before factory.stop() resolves, this attaches another .finally() to the same promise, so exit() can run multiple times with conflicting codes when a custom exit callback is injected. Remove the listeners or return early as soon as shutdown starts, not after the stop promise settles.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/factory-sdk/src/cli/fleet.ts` around lines 418 - 425, stopAndExit currently attaches a new .finally() each time it's called, so multiple signals can trigger multiple exits; change it so the function is single-fire: as soon as shutdown starts (when setting the stopping promise from factory.stop()), immediately remove the signal listeners (or call remove()) and return early on subsequent invocations by checking if stopping is already set (use the stopping variable as the gate). Ensure you only attach the .finally() once to the stopping promise so remove() and exit(code) run exactly once; reference stopAndExit, stopping, factory.stop(), remove(), and exit() when making this change.

coderabbitai · 2026-06-12T13:16:35Z

+export async function readFactoryLoopHeartbeat(
+  path = DEFAULT_FACTORY_LOOP_HEARTBEAT_PATH,
+): Promise<FactoryLoopHeartbeat | undefined> {
+  try {
+    return parseJsonContent(await readFile(path, 'utf8')) as FactoryLoopHeartbeat
+  } catch {
+    return undefined
+  }
+}
+
+export function checkFactoryLoopLiveness(
+  heartbeat: FactoryLoopHeartbeat | undefined,
+  opts: { nowMs?: number; staleMs?: number } = {},
+): FactoryLoopLiveness {
+  if (!heartbeat) {
+    return { ok: false, stale: true, reason: 'heartbeat missing' }
+  }
+
+  const nowMs = opts.nowMs ?? Date.now()
+  const staleMs = opts.staleMs ?? 60_000
+  const ageMs = Math.max(0, nowMs - heartbeat.updatedAtMs)
+  const stale = ageMs > staleMs
+  if (stale) {
+    return { ok: false, stale: true, ageMs, heartbeat, reason: 'heartbeat stale' }
+  }
+  if (heartbeat.status === 'stopping') {
+    return { ok: false, stale: false, ageMs, heartbeat, reason: 'loop stopping' }
+  }
+  return { ok: true, stale: false, ageMs, heartbeat }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the heartbeat file before using it for liveness or PID-based control.

This helper trusts arbitrary JSON as FactoryLoopHeartbeat. A malformed file can make checkFactoryLoopLiveness() report ok: true because nowMs - heartbeat.updatedAtMs becomes NaN, and the same unchecked pid is what the CLI later uses for kill-loop. Invalid heartbeat content should be rejected as unreadable, not treated as live.

Suggested direction

export async function readFactoryLoopHeartbeat( path = DEFAULT_FACTORY_LOOP_HEARTBEAT_PATH, ): Promise<FactoryLoopHeartbeat | undefined> { try { - return parseJsonContent(await readFile(path, 'utf8')) as FactoryLoopHeartbeat + const parsed = parseJsonContent(await readFile(path, 'utf8')) + if ( + parsed && + typeof parsed === 'object' && + typeof (parsed as Record<string, unknown>).pid === 'number' && + typeof (parsed as Record<string, unknown>).updatedAtMs === 'number' && + typeof (parsed as Record<string, unknown>).updatedAt === 'string' && + typeof (parsed as Record<string, unknown>).iteration === 'number' && + typeof (parsed as Record<string, unknown>).maxIterations === 'number' && + ['running', 'idle', 'stopping'].includes(String((parsed as Record<string, unknown>).status)) + ) { + return parsed as FactoryLoopHeartbeat + } + return undefined } catch { return undefined } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/factory-sdk/src/orchestrator/factory.ts` around lines 1205 - 1233, The heartbeat reading/trust issue: ensure malformed JSON is rejected by validating the parsed object in readFactoryLoopHeartbeat (and/or at the start of checkFactoryLoopLiveness) so bad files are treated as "missing" not "live"; specifically, after parseJsonContent in readFactoryLoopHeartbeat validate that the returned object has a finite numeric updatedAtMs, an allowed status string (e.g., 'running'|'stopping' if those are used), and if present a numeric pid, otherwise return undefined; alternatively add the same sanity checks at the top of checkFactoryLoopLiveness (verify Number.isFinite(heartbeat.updatedAtMs) before computing ageMs and reject non-finite values) so NaN ages and invalid pids cannot cause ok:true or be used by kill-loop.

agent-relay-code · 2026-06-12T13:24:54Z

Implemented one scoped fix in the PR:

dispatch() now records retry attempts only after BatchTracker.start() actually starts an issue, so queued issues no longer burn retry budget before any spawn happens. Added a regression test for maxAttempts: 1 with queued work draining after the active issue completes.

Addressed comments

Self-review: queued factory dispatches consumed retry budget without a real dispatch attempt, causing queued work to become terminal before it could start. Fixed in factory.ts and covered in factory.test.ts.
Bot/reviewer comments: none were present in .workforce/context.json or other .workforce metadata files.

Advisory Notes

None.

Validation run:

npm ci
npm run verify:mcp-resources-drift
npm run lint passed with existing warnings only
npm run typecheck:web
npm run typecheck:node
npm test
npx vitest run
npm run build
npm run build:web
npx playwright test --config playwright.fidelity.config.ts
npx playwright test --config playwright.redraw.config.ts

Also checked for merge conflict markers with rg; none found. I did not print READY because GitHub mergeability/check status cannot be verified from this sandbox without the prohibited GitHub tooling.

Harden factory loop dedup and teardown

5bc6470

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

kjgbot merged commit 3adc586 into main Jun 12, 2026
5 checks passed

kjgbot deleted the factory-sdk/v2fix2 branch June 12, 2026 13:19

Uh oh!

Conversation

kjgbot commented Jun 12, 2026

Summary

V0 Evidence

V1 Handoff

Guardrails

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

agent-relay-code Bot commented Jun 12, 2026

Addressed comments

Advisory Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading